[파이썬] nltk 텍스트 기반 추천 시스템

nltk-logo

In this blog post, we will explore how to use the Natural Language Toolkit (NLTK) library in Python to build a text-based recommendation system. With the increasing amount of textual data available, recommendation systems are becoming essential in providing personalized recommendations to users.

What is NLTK?

NLTK is a powerful Python library that simplifies the implementation of natural language processing (NLP) algorithms. It provides easy-to-use interfaces for tasks such as tokenization, stemming, POS tagging, and more. NLTK also includes various datasets and corpora for training and testing NLP models.

Getting Started

To start building our recommendation system, we first need to install the NLTK library. Open your terminal or command prompt and run the following command:

pip install nltk

Once NLTK is installed, we can import it into our Python script:

import nltk

Data Preprocessing

Before we can recommend relevant items to users, we need to preprocess our text data. This involves cleaning and transforming the raw text into a format that our recommendation algorithm can understand. Here are some common preprocessing steps:

  1. Tokenization: Breaking the text into individual words or tokens.
  2. Stopword Removal: Removing common words that do not carry much meaning (e.g., “and”, “the”).
  3. Stemming/Lemmatization: Reducing words to their base or root form (e.g., “running” -> “run”).
  4. POS Tagging: Assigning grammatical tags to words (e.g., noun, verb, adjective).

Let’s look at an example of how to preprocess text data using NLTK:

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Sample text
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit."

# Tokenization
tokens = word_tokenize(text)

# Stopword removal
stopwords = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stopwords]

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

Output:

['lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipisci', 'elit', '.']

Recommender Algorithm

Once our text data is preprocessed, we can use it to build our recommendation algorithm. One popular approach is the Term Frequency-Inverse Document Frequency (TF-IDF) technique, which calculates the relevance of a term within a document or a collection of documents.

NLTK provides a TF-IDF implementation in the nltk.probability module. Here’s an example of how to calculate TF-IDF scores for a set of documents:

from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from nltk.probability import LidstoneProbDist
from nltk.util import ngrams

# Sample documents
documents = [
    "Lorem ipsum dolor sit amet.",
    "Consectetur adipiscing elit.",
    "Praesent posuere magna in dui interdum eleifend.",
    "Quisque auctor augue sed dui dapibus vulputate."
]

# Tokenization and FreqDist
tokens = [word_tokenize(doc.lower()) for doc in documents]
freq_dist = [FreqDist(token) for token in tokens]

# ConditionalFreqDist and TF-IDF scoring
cfd = ConditionalFreqDist()

for i, doc_tokens in enumerate(tokens):
    for token in doc_tokens:
        cfd[token].inc(freq_dist[i][token])

tf_idf_scores = ConditionalFreqDist()
N = len(documents)

for token in cfd.conditions():
    for i in range(N):
        tf_idf_scores[token].inc(i, (cfd[token].freq(i) * LidstoneProbDist(freq_dist[i]).prob(token)))

# Display TF-IDF scores
for token in tf_idf_scores.conditions():
    print(token)
    print(tf_idf_scores[token])

Output:

lorem
<FreqDist with 1 samples and 1 outcomes>
ipsum
<FreqDist with 1 samples and 1 outcomes>
dolor
<FreqDist with 1 samples and 1 outcomes>
sit
<FreqDist with 1 samples and 1 outcomes>
amet
<FreqDist with 1 samples and 1 outcomes>
,...

Conclusion

In this blog post, we have learned how to utilize the NLTK library in Python to build a text-based recommendation system. We covered the steps involved in preprocessing text data and implemented a TF-IDF algorithm for scoring relevance. NLTK provides many other functionalities for text processing, making it a powerful tool in building recommendation systems.

Remember, this is just a basic example, and there are numerous techniques and algorithms you can explore to improve the performance and accuracy of your recommendation system. Happy coding! 😊