[python] 파이썬 gensim을 이용한 토픽 모델링 성능 평가 방법

19 Dec 2023

python

텍스트 데이터에서 토픽 모델링은 많은 관심을 받고 있습니다. gensim은 파이썬에서 자연어 처리와 토픽 모델링을 위한 라이브러리 중 하나로, 이 라이브러리를 사용하여 토픽 모델링의 성능을 평가하는 것은 중요한 과제입니다. 이번에는 gensim을 사용하여 토픽 모델링의 성능을 평가하는 방법에 대해 알아보겠습니다.

1. 코퍼스 및 딕셔너리 생성

gensim을 사용하여 토픽 모델링을 수행하기 위해서는 우선 코퍼스(corpus)와 딕셔너리(dictionary)를 생성해야 합니다. gensim을 사용하여 이러한 데이터 구조를 만드는 방법은 다음과 같습니다:

from gensim import corpora
documents = [/* your list of documents */]
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

2. 토픽 모델링 수행

데이터 구조가 준비되면, gensim을 사용하여 토픽 모델링을 수행할 수 있습니다. 여기에는 Latent Dirichlet Allocation(LDA) 등의 알고리즘을 사용할 수 있습니다.

from gensim import models
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=10)

3. 성능 평가

토픽 모델링의 성능을 평가하기 위해서는 coherence score와 perplexity 등의 지표를 사용할 수 있습니다. gensim 라이브러리에서는 이러한 지표를 계산하는 기능을 제공합니다.

from gensim.models.coherencemodel import CoherenceModel
cm = CoherenceModel(model=lda_model, texts=documents, dictionary=dictionary, coherence='c_v')
coherence = cm.get_coherence()

토픽 모델링의 perplexity를 계산하기 위해서는 다음과 같은 방법을 사용할 수 있습니다:

perplexity = lda_model.log_perplexity(corpus)

gensim을 사용하여 토픽 모델링의 성능을 평가하는 방법을 알아보았습니다. 이를 통해 효과적인 토픽 모델링을 위한 파라미터 튜닝 및 모델 선택을 진행할 수 있을 것입니다.

Reference

Gensim Documentation: https://radimrehurek.com/gensim/
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine Learning research, 3, 993-1022.