[파이썬] nltk TF-IDF 벡터화

TF-IDF (Term Frequency-Inverse Document Frequency) is a popular technique used in natural language processing to measure the importance of a term within a document or a collection of documents. In this blog post, we will explore how to perform TF-IDF vectorization using the Natural Language Toolkit (NLTK) library in Python.

Prerequisites

Before we dive into the code, make sure you have the NLTK library installed. If not, you can install it using pip:

pip install nltk

Also, we need to download the necessary NLTK data. Open a Python shell and execute the following commands:

import nltk
nltk.download('stopwords')
nltk.download('punkt')

TF-IDF Vectorization

Step 1: Preprocessing the Documents

The first step in performing TF-IDF vectorization is to preprocess the collection of documents. Preprocessing involves removing any unnecessary characters, tokenizing the text, and removing stop words (common words like “the”, “is”, etc.).

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

def preprocess_documents(documents):
    # Remove unnecessary characters
    documents = [re.sub(r'\W+', ' ', doc) for doc in documents]
    
    # Tokenize the text
    tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    processed_documents = [[word for word in doc if word not in stop_words] for doc in tokenized_documents]
    
    return processed_documents

Step 2: Calculating Term Frequencies

The next step is to calculate the term frequency (TF) for each term in each document. Term frequency is the number of times a term appears in a document divided by the total number of terms in that document.

from collections import defaultdict

def calculate_term_frequency(documents):
    term_frequency = defaultdict(lambda: defaultdict(int))
    
    for i, document in enumerate(documents):
        for term in document:
            term_frequency[i][term] += 1
    
    return term_frequency

Step 3: Calculating Inverse Document Frequencies

After calculating the term frequencies, we need to calculate the inverse document frequency (IDF) for each term. IDF measures the importance of a term in the entire collection of documents. It is calculated by taking the logarithm of the total number of documents divided by the number of documents that contain the term.

import math

def calculate_inverse_document_frequency(documents):
    document_frequency = defaultdict(int)
    
    for document in documents:
        for term in set(document):
            document_frequency[term] += 1
    
    total_documents = len(documents)
    inverse_document_frequency = {term: math.log(total_documents / frequency) for term, frequency in document_frequency.items()}
    
    return inverse_document_frequency

Step 4: Calculating TF-IDF Scores

Finally, we can calculate the TF-IDF score for each term in each document by multiplying the term frequency with the inverse document frequency.

def calculate_tfidf_scores(term_frequency, inverse_document_frequency):
    tfidf_scores = defaultdict(lambda: defaultdict(float))
    
    for document, terms in term_frequency.items():
        for term, frequency in terms.items():
            tfidf_scores[document][term] = frequency * inverse_document_frequency[term]
    
    return tfidf_scores

Putting it All Together

Now let’s put all the pieces together and use the functions we’ve defined above to perform TF-IDF vectorization.

def tfidf_vectorization(documents):
    processed_documents = preprocess_documents(documents)
    term_frequency = calculate_term_frequency(processed_documents)
    inverse_document_frequency = calculate_inverse_document_frequency(processed_documents)
    tfidf_scores = calculate_tfidf_scores(term_frequency, inverse_document_frequency)
    
    return tfidf_scores

Conclusion

In this blog post, we have learned how to perform TF-IDF vectorization using NLTK in Python. We started by preprocessing the documents, then calculated the term frequencies, inverse document frequencies, and finally, the TF-IDF scores. This technique is widely used in text mining and information retrieval for tasks such as document classification, information retrieval, and keyword extraction. NLTK provides a robust set of tools for natural language processing, and TF-IDF vectorization is just one of many techniques you can leverage for analyzing text data.