TF-IDF (Term Frequency-Inverse Document Frequency) is a popular technique used in natural language processing to measure the importance of a term within a document or a collection of documents. In this blog post, we will explore how to perform TF-IDF vectorization using the Natural Language Toolkit (NLTK) library in Python.
Prerequisites
Before we dive into the code, make sure you have the NLTK library installed. If not, you can install it using pip:
pip install nltk
Also, we need to download the necessary NLTK data. Open a Python shell and execute the following commands:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
TF-IDF Vectorization
Step 1: Preprocessing the Documents
The first step in performing TF-IDF vectorization is to preprocess the collection of documents. Preprocessing involves removing any unnecessary characters, tokenizing the text, and removing stop words (common words like “the”, “is”, etc.).
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
def preprocess_documents(documents):
# Remove unnecessary characters
documents = [re.sub(r'\W+', ' ', doc) for doc in documents]
# Tokenize the text
tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]
# Remove stop words
stop_words = set(stopwords.words('english'))
processed_documents = [[word for word in doc if word not in stop_words] for doc in tokenized_documents]
return processed_documents
Step 2: Calculating Term Frequencies
The next step is to calculate the term frequency (TF) for each term in each document. Term frequency is the number of times a term appears in a document divided by the total number of terms in that document.
from collections import defaultdict
def calculate_term_frequency(documents):
term_frequency = defaultdict(lambda: defaultdict(int))
for i, document in enumerate(documents):
for term in document:
term_frequency[i][term] += 1
return term_frequency
Step 3: Calculating Inverse Document Frequencies
After calculating the term frequencies, we need to calculate the inverse document frequency (IDF) for each term. IDF measures the importance of a term in the entire collection of documents. It is calculated by taking the logarithm of the total number of documents divided by the number of documents that contain the term.
import math
def calculate_inverse_document_frequency(documents):
document_frequency = defaultdict(int)
for document in documents:
for term in set(document):
document_frequency[term] += 1
total_documents = len(documents)
inverse_document_frequency = {term: math.log(total_documents / frequency) for term, frequency in document_frequency.items()}
return inverse_document_frequency
Step 4: Calculating TF-IDF Scores
Finally, we can calculate the TF-IDF score for each term in each document by multiplying the term frequency with the inverse document frequency.
def calculate_tfidf_scores(term_frequency, inverse_document_frequency):
tfidf_scores = defaultdict(lambda: defaultdict(float))
for document, terms in term_frequency.items():
for term, frequency in terms.items():
tfidf_scores[document][term] = frequency * inverse_document_frequency[term]
return tfidf_scores
Putting it All Together
Now let’s put all the pieces together and use the functions we’ve defined above to perform TF-IDF vectorization.
def tfidf_vectorization(documents):
processed_documents = preprocess_documents(documents)
term_frequency = calculate_term_frequency(processed_documents)
inverse_document_frequency = calculate_inverse_document_frequency(processed_documents)
tfidf_scores = calculate_tfidf_scores(term_frequency, inverse_document_frequency)
return tfidf_scores
Conclusion
In this blog post, we have learned how to perform TF-IDF vectorization using NLTK in Python. We started by preprocessing the documents, then calculated the term frequencies, inverse document frequencies, and finally, the TF-IDF scores. This technique is widely used in text mining and information retrieval for tasks such as document classification, information retrieval, and keyword extraction. NLTK provides a robust set of tools for natural language processing, and TF-IDF vectorization is just one of many techniques you can leverage for analyzing text data.