[파이썬] Gensim TfidfModel 사용

In this blog post, we will explore the usage of the Gensim TfidfModel in Python. Gensim is a Python library used for topic modeling and document similarity analysis. The TfidfModel is a powerful tool that helps in determining the importance of words in a document or corpus.

What is TfidfModel?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that reflects how important a word is to a document in a collection or corpus. The main idea behind the TF-IDF model is to give more weightage to the words that appear frequently in a document but less frequently in the entire corpus.

The TfidfModel in Gensim calculates the TF-IDF values for a given set of texts. It takes a vectorized input corpus and returns a transformed corpus where each document is represented by a list of tuples with the word ID and its corresponding TF-IDF score.

Installing Gensim

Before we begin, make sure you have Gensim installed in your Python environment. You can install Gensim using pip:

$ pip install gensim

Alternatively, you can install it using conda:

$ conda install -c conda-forge gensim

Example Usage

Let’s see an example of how to use the TfidfModel in Gensim.

from gensim.models import TfidfModel
from gensim.corpora import Dictionary

# Example corpus
corpus = [
    ["apple", "banana", "apple"],
    ["banana", "cherry", "date"],
    ["apple", "cherry", "date"],
    ["date", "elderberry", "fig"]
]

# Create a dictionary from the corpus
dictionary = Dictionary(corpus)

# Convert the corpus into a bag-of-words format
bow_corpus = [dictionary.doc2bow(doc) for doc in corpus]

# Create a TF-IDF model
tfidf = TfidfModel(bow_corpus)

# Transform the original corpus into TF-IDF representation
tfidf_corpus = tfidf[bow_corpus]

# Print the TF-IDF representation of each document
for doc in tfidf_corpus:
    print(doc)

In the above example, we have a corpus with four documents. We first create a dictionary from the corpus using the Dictionary class in Gensim. Next, we convert the corpus into a bag-of-words format using the doc2bow function. Then, we create a TF-IDF model using the TfidfModel class and apply it to the bag-of-words corpus to obtain the TF-IDF representation.

Finally, we print the TF-IDF representation of each document in the corpus. The output will be a list of tuples, where each tuple contains the word ID and its corresponding TF-IDF score.

Conclusion

The Gensim TfidfModel is a useful tool for calculating the TF-IDF values of words in a document or corpus. It can help in various tasks such as keyword extraction, document similarity analysis, and information retrieval. By using the TfidfModel in Gensim, you can gain insights into the importance of words in your text data and improve the performance of your natural language processing applications.