[파이썬] Gensim에서의 Coherence Score

Gensim is a popular Python library used for topic modeling and document similarity analysis. One of the important aspects of topic modeling is evaluating the quality of the generated topics. Coherence score is a widely used metric for measuring the quality of topic models. In this blog post, we will explore how to calculate coherence scores using Gensim in Python.

What is Coherence Score?

Coherence score helps us measure the semantic coherence of the topics generated by a topic model. Semantic coherence refers to how well the words within a topic align and make sense with each other. A high coherence score indicates that the generated topics are coherent and meaningful.

Installing Gensim

Before we begin, make sure you have Gensim installed in your Python environment. You can install Gensim by running the following command:

pip install gensim

Calculating Coherence Score in Gensim

To calculate the coherence score in Gensim, we need to preprocess our corpus and train a topic model first. Let’s assume that we have already preprocessed our corpus and trained a topic model using Gensim’s LDA (Latent Dirichlet Allocation) algorithm. Here’s an example of how to calculate coherence score with Gensim:

from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

# Load the dictionary and corpus
dictionary = Dictionary.load('<path_to_dictionary>')
corpus = <load_corpus_function>('<path_to_corpus>')

# Load the trained LDA model
lda_model = LdaModel.load('<path_to_lda_model>')

# Calculate coherence score
coherence_model = CoherenceModel(model=lda_model, texts=corpus, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()

# Print the coherence score
print("Coherence Score: %.3f" % coherence_score)

Here, we first load the dictionary and corpus that we used for training the topic model. Then, we load the trained LDA model using Gensim’s LdaModel class. We then create a CoherenceModel object, passing the LDA model, corpus, and dictionary as arguments. We also specify the coherence measure to be calculated by setting coherence='c_v' (other options include ‘u_mass’ and ‘c_uci’). Finally, we call the get_coherence() method to obtain the coherence score.

Interpreting the Coherence Score

The coherence score obtained can range from -1 to 1, with a higher score indicating better coherence. It is important to note that the absolute value of the coherence score itself does not hold much significance, but rather, it is the relative score among different topic models that matters. Therefore, use the coherence score as a comparative metric to evaluate different topic models and choose the one with the highest coherence score.

Conclusion

In this blog post, we learned how to calculate coherence scores using Gensim in Python. Coherence score is a valuable metric for evaluating the quality of topic models and can help us understand how well the generated topics align with each other. By leveraging Gensim’s capabilities, we can easily calculate coherence scores and compare different topic models. This can assist in making informed decisions when it comes to topic modeling tasks.