[파이썬] Gensim Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a topic modeling technique that is widely used in Natural Language Processing (NLP) to discover hidden topics in a collection of documents. Gensim is a popular Python library that provides an efficient implementation of LDA.

In this blog post, we will walk through an example of running LDA using Gensim in Python.

Installation

Before we start, let’s make sure we have Gensim installed. Open your command prompt and run the following command:

pip install gensim

Preparing the Data

The first step is to prepare our data for LDA. We need a collection of documents, where each document is represented as a list of words. For this example, let’s assume we have the following documents:

documents = [
    ['apple', 'banana', 'orange', 'juice'],
    ['apple', 'smoothie', 'mango'],
    ['banana', 'peach', 'smoothie'],
    ['orange', 'mango', 'juice'],
    ['apple', 'banana', 'peach']
]

Creating a Dictionary and Corpus

Next, we need to create a dictionary and corpus from our documents. The dictionary maps each word to a unique ID, and the corpus represents the frequency of each word in each document.

from gensim import corpora

# Create the dictionary
dictionary = corpora.Dictionary(documents)

# Create the corpus
corpus = [dictionary.doc2bow(doc) for doc in documents]

Running LDA

Now that we have our corpus and dictionary, we can proceed to run LDA. In Gensim, LDA can be trained using the LdaModel class.

from gensim.models import LdaModel

# Train the LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=10)

In the above code, we specify the number of topics as 3 and the number of iterations (passes) as 10. Feel free to adjust these parameters based on your data and requirements.

Exploring the Topics

Once the LDA model is trained, we can explore the discovered topics and their associated keywords.

# Print the topics and keywords
for topic in lda_model.print_topics():
    print(topic)

The print_topics() method returns the most significant words for each topic along with their corresponding weights.

Evaluating the Model

To evaluate the performance of the LDA model, we can use metrics such as coherence score or perplexity. Gensim provides built-in methods to calculate these metrics.

# Calculate coherence score
coherence_score = lda_model.top_topics(corpus, coherence='c_v')
print('Coherence Score:', coherence_score)

# Calculate perplexity
perplexity = lda_model.log_perplexity(corpus)
print('Perplexity:', perplexity)

Higher coherence scores and lower perplexity values indicate better-performing models.

Conclusion

Gensim provides a powerful and efficient implementation of Latent Dirichlet Allocation for topic modeling. In this blog post, we have illustrated how to use Gensim to train an LDA model, explore topics, and evaluate the model’s performance.

By applying LDA to your own text data, you can gain valuable insights into the hidden topics and structure within your documents.

References: