Seaborn과 Gensim의 조합을 사용한 토픽 모델링 결과 시각화

20 Oct 2023

seaborn

소개

토픽 모델링은 문서 집합에서 주제를 추출하는 기술입니다. Gensim은 Python에서 토픽 모델링을 수행하기 위한 강력한 도구입니다. 이를 사용하여 토픽 모델링을 수행하고 결과를 시각화하기 위해 Seaborn을 사용하는 방법에 대해 알아보겠습니다.

필요한 패키지 설치

아래의 명령어를 사용하여 필요한 패키지를 설치합니다:

pip install gensim seaborn

예시 코드

아래는 Gensim을 사용하여 토픽 모델링을 수행하고 결과를 Seaborn을 사용하여 시각화하는 예시 코드입니다.

import gensim
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# 문서 집합과 사전 생성
documents = [["apple", "banana", "orange"],
             ["cat", "dog", "elephant"],
             ["apple", "dog", "cat"],
             ["banana", "elephant", "orange"]]

dictionary = gensim.corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

# LDA 모델 생성과 학습
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=10)

# 토픽-단어 분포 확인
topics = []
for topic_id in range(lda_model.num_topics):
    topic_words = lda_model.show_topic(topic_id, topn=3)
    topics.append([word for word, _ in topic_words])

# 데이터프레임 생성
df = pd.DataFrame(topics, columns=['Word 1', 'Word 2', 'Word 3'])

# 시각화
sns.set(style="whitegrid")
plt.figure(figsize=(8, 6))
ax = sns.barplot(data=df, orient="h")
ax.set_title("Topic Word Distribution")
ax.set_xlabel("Word Probability")
plt.show()

결과

위의 코드를 실행하면, 토픽 모델링 결과에 대한 시각화가 생성됩니다. 시각화된 결과는 각 주제에 대한 주요 단어의 분포를 보여줍니다.

결론

Seaborn과 Gensim의 조합은 토픽 모델링 결과를 시각적으로 파악하는 데 매우 유용합니다. 이를 통해 데이터 과학자나 연구자는 문서 집합에서 주요 토픽을 추출하고 시각화하여 인사이트를 얻을 수 있습니다. Seaborn의 다양한 시각화 기능을 통해 토픽 모델링 결과를 더욱 멋지고 직관적으로 표현할 수 있습니다.

참고 자료

Gensim: https://radimrehurek.com/gensim/
Seaborn: https://seaborn.pydata.org/