[python] NLTK를 사용해 텍스트의 유사도를 계산하는 방법은 무엇인가요?

30 Nov 2023

python

전처리: 비교할 텍스트들을 토큰화하고 정규화합니다. 이 단계에서 텍스트를 단어로 나누고, 소문자로 변환하며, 불용어(Stopwords)를 제거하는 등의 작업을 수행합니다.
문서 벡터화: TF-IDF(Term Frequency-Inverse Document Frequency)나 문서 임베딩 방법을 사용하여 텍스트를 벡터로 변환합니다. TF-IDF는 단어의 빈도와 역문서 빈도를 고려하여 각 단어에 가중치를 부여하는 방법입니다. 문서 임베딩은 단어를 밀집 벡터로 변환하는 방법으로, 미리 학습된 임베딩 모델을 사용하거나, Word2Vec, GloVe 등의 알고리즘으로 직접 임베딩을 학습할 수 있습니다.
유사도 계산: 벡터화된 텍스트들 간의 유사도를 계산합니다. 가장 일반적인 유사도 측정 방법은 코사인 유사도(Cosine Similarity)입니다. 코사인 유사도는 벡터 간의 각도를 이용하여 유사도를 측정합니다. 두 벡터가 완전히 일치하는 경우에는 1이고, 완전히 다른 경우에는 0입니다.

아래는 NLTK를 사용하여 텍스트의 유사도를 계산하는 간단한 예제 코드입니다:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 텍스트 전처리
stop_words = set(stopwords.words('english'))

text1 = "This is the first document."
text2 = "This document is the second document."
text3 = "And this is the third one."

tokens1 = [word for word in word_tokenize(text1.lower()) if word.isalpha() and word not in stop_words]
tokens2 = [word for word in word_tokenize(text2.lower()) if word.isalpha() and word not in stop_words]
tokens3 = [word for word in word_tokenize(text3.lower()) if word.isalpha() and word not in stop_words]

# 문서 벡터화
corpus = [' '.join(tokens1), ' '.join(tokens2), ' '.join(tokens3)]
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(corpus)

# 유사도 계산
similarity_matrix = cosine_similarity(vectors)

print("Similarity between text1 and text2:", similarity_matrix[0, 1])
print("Similarity between text1 and text3:", similarity_matrix[0, 2])
print("Similarity between text2 and text3:", similarity_matrix[1, 2])

이 예제에서는 NLTK를 사용하여 문서 전처리를 수행하고, TF-IDF 벡터화를 사용하여 문서를 벡터로 변환한 후, 코사인 유사도를 계산하여 텍스트 간의 유사도를 측정합니다.