Gensim을 사용하여 텍스트 기반 추천 시스템 구현하기

09 Nov 2023

gensim

텍스트 데이터를 기반으로 추천 시스템을 구현하기 위해 Gensim이라는 파이썬 라이브러리를 사용할 수 있습니다. Gensim은 자연어 처리와 토픽 모델링을 위한 툴킷으로, 텍스트 데이터를 효율적으로 처리하고 유의미한 추천을 제공하는 데 도움을 줍니다.

1. Gensim 설치하기

먼저, Gensim을 설치해야 합니다. 다음 명령을 사용하여 설치할 수 있습니다:

pip install gensim

2. 텍스트 데이터 전처리하기

추천 시스템을 구현하기 전에, 텍스트 데이터를 전처리해야 합니다. 일반적으로 텍스트 데이터의 전처리 단계에는 토큰화(Tokenization), 불용어 제거(Stopwords removal), 어간 추출(Stemming) 등이 포함됩니다.

import gensim
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

# 텍스트 데이터 전처리 예시
documents = ["텍스트 데이터 전처리 예시입니다.",
             "텍스트 데이터를 효율적으로 처리하기 위해 전처리가 필요합니다.",
             "Gensim을 사용하면 텍스트 기반 추천 시스템을 구현할 수 있습니다."]

# 텍스트 데이터 토큰화
tokenized_documents = [doc.split() for doc in documents]

# Dictionary 생성
dictionary = Dictionary(tokenized_documents)

# 불용어 제거
stopwords = ['텍스트', '데이터', '전처리']
dictionary.filter_tokens([dictionary.token2id[stopword] for stopword in stopwords])

# 어간 추출
stemmer = gensim.parsing.PorterStemmer()
preprocessed_documents = [[stemmer.stem(token) for token in doc_tokens] for doc_tokens in tokenized_documents]

# TF-IDF 행렬 생성
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]
tfidf = TfidfModel(corpus)
tfidf_matrix = tfidf[corpus]

3. 텍스트 유사도 측정하기

이제 전처리된 데이터를 사용하여 텍스트 간의 유사도를 측정할 수 있습니다. 유사도 측정에는 코사인 유사도(Cosine similarity)를 자주 사용합니다.

from gensim.similarities import MatrixSimilarity

# TF-IDF 행렬에 대한 코사인 유사도 계산
index = MatrixSimilarity(tfidf_matrix)

# 코사인 유사도 계산 예시
test_document = "텍스트 데이터 처리를 위해 Gensim을 사용하는 방법은 무엇인가요?"
test_tokenized_document = test_document.split()
test_preprocessed_document = [stemmer.stem(token) for token in test_tokenized_document]
test_bow = dictionary.doc2bow(test_preprocessed_document)
test_tfidf = tfidf[test_bow]

# 유사한 문서 검색
similarity_scores = index[test_tfidf]

# 가장 유사한 문서 순위 출력
sorted_docs = sorted(enumerate(similarity_scores), key=lambda item: -item[1])
for doc_index, score in sorted_docs:
    print(f"문서 #{doc_index + 1}, 유사도: {score}")

마무리

이제 Gensim을 활용하여 텍스트 기반 추천 시스템을 구현하는 방법을 알게 되었습니다. Gensim은 텍스트 데이터의 전처리와 유사도 측정을 효율적으로 처리하는데 도움을 주는 강력한 도구입니다. 이를 통해 사용자에게 유의미한 추천을 제공할 수 있습니다.

텍스트 기반 추천 시스템, 자연어 처리, Gensim에 대해 더 알고 싶다면 #추천시스템, #자연어처리 해시태그를 참고하세요.

참고 자료: