[python] 파이썬 gensim 라이브러리 소개

19 Dec 2023

python

Gensim은 “unstructured digital text”를 효율적으로 다루기 위한 Python 라이브러리입니다. 이 라이브러리는 토픽 모델링, 문서 유사성, 자연어 처리 등 다양한 기능을 제공하여 텍스트 데이터를 다루는데 유용합니다.

주요 기능

Gensim 라이브러리에는 다음과 같은 주요 기능이 포함되어 있습니다:

토픽 모델링: Latent Semantic Analysis(LSA), Latent Dirichlet Allocation(LDA) 등을 활용하여 텍스트에서 토픽을 추출합니다.
문서 유사성 계산: 문서 간의 유사성을 계산하고 비교하는 기능을 제공합니다.
Word2Vec: 단어 간의 의미론적 유사성을 계산하여 단어 벡터를 생성하는 기능을 제공합니다.

설치 방법

Gensim 라이브러리는 pip를 통해 간편하게 설치할 수 있습니다:

pip install gensim

예제

다음은 Gensim을 사용하여 간단한 토픽 모델링을 수행하는 예제 코드입니다:

from gensim import corpora
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string

# 문서
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# 텍스트 전처리
texts = [preprocess_string(doc) for doc in documents]

# 단어 사전 생성
dictionary = corpora.Dictionary(texts)

# 단어-문서 매트릭스 생성
corpus = [dictionary.doc2bow(text) for text in texts]

# LDA 모델 학습
lda_model = LdaModel(corpus, id2word=dictionary, num_topics=2)

결론

Gensim은 텍스트 분석 작업을 간편하게 수행할 수 있는 강력한 도구입니다. 이 라이브러리를 활용하여 텍스트 데이터로부터 유의미한 정보를 추출하고, 다양한 자연어 처리 작업을 수행할 수 있습니다.

참고 자료

Gensim 공식 문서: https://radimrehurek.com/gensim/
Gensim GitHub 저장소: https://github.com/RaRe-Technologies/gensim