[파이썬] nltk TfidfVectorizer 및 CountVectorizer 사용

In natural language processing (NLP), text data needs to be transformed into numerical representations for further analysis and modeling. Two popular techniques for text vectorization in Python are nltk’s TfidfVectorizer and CountVectorizer. These tools allow us to transform text data into matrices of token counts or term frequency-inverse document frequency (TF-IDF) scores.

Here, we will explore the basic usage of both TfidfVectorizer and CountVectorizer in the nltk library.

CountVectorizer

CountVectorizer is a simple and widely used text vectorization method that converts a collection of text documents into a matrix of token counts. Each row represents a document, and each column represents a unique token. The matrix entry at row “i” and column “j” denotes the number of occurrences of token “j” in document “i”.

To use CountVectorizer, start by importing the necessary libraries:

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

We also need a list of text documents. Assuming we have a list called documents, we can apply CountVectorizer as follows:

count_vectorizer = CountVectorizer(tokenizer=word_tokenize)
X = count_vectorizer.fit_transform(documents)

Here, we specify tokenizer=word_tokenize to tokenize each document into individual words. The resulting X matrix will contain the token counts for each document.

TfidfVectorizer

TfidfVectorizer is similar to CountVectorizer, but it also applies the concept of TF-IDF. TF-IDF stands for Term Frequency-Inverse Document Frequency and is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

To use TfidfVectorizer, follow a similar process as with CountVectorizer. Import the required libraries:

from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

Assuming we have a list of text documents called documents, we can apply TfidfVectorizer as follows:

tfidf_vectorizer = TfidfVectorizer(tokenizer=word_tokenize)
X = tfidf_vectorizer.fit_transform(documents)

The resulting X matrix will contain the TF-IDF scores for each document.

Conclusion

In this blog post, we explored the basic usage of TfidfVectorizer and CountVectorizer from the nltk library in Python. CountVectorizer is a simple approach that represents text data as token counts, while TfidfVectorizer extends this by incorporating the TF-IDF concept. Consider using these vectorization techniques to transform text data into numerical representations for NLP tasks such as text classification, sentiment analysis, and topic modeling.