[python] NLTK를 사용해 문장의 유사도를 계산하는 방법은 무엇인가요?

30 Nov 2023

python

NLTK를 사용하여 문장의 유사도를 계산하는 예제를 보여드리겠습니다. 먼저, 다음과 같이 필요한 라이브러리와 모듈을 import합니다:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

다음으로, 비교하고자 하는 두 문장을 불러옵니다:

sentence1 = "I love playing soccer"
sentence2 = "I enjoy playing football"

토큰화(Tokenization)를 수행하여 각 문장을 단어로 분할합니다:

tokens1 = word_tokenize(sentence1)
tokens2 = word_tokenize(sentence2)

불용어(Stopwords)를 제거하고 어근 추출(Stemming) 또는 표제어 추출(Lemmatization)을 수행하여 각 단어를 정규화합니다:

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

normalized_words1 = [lemmatizer.lemmatize(word.lower()) for word in tokens1 if word.lower() not in stop_words]
normalized_words2 = [lemmatizer.lemmatize(word.lower()) for word in tokens2 if word.lower() not in stop_words]

TF-IDF 벡터화(Vectorization)를 수행하여 각 문장을 벡터로 변환합니다:

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([sentence1, sentence2])

코사인 유사도(Cosine Similarity)를 계산하여 문장의 유사도를 측정합니다:

similarity_score = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

이제, similarity_score 변수에는 두 문장의 유사도가 0과 1 사이의 값으로 저장됩니다. 값이 1에 가까울수록 유사도가 높다는 의미입니다.

이와 같은 방법으로 NLTK를 활용하여 문장의 유사도를 계산할 수 있습니다. 자세한 내용은 NLTK와 scikit-learn의 공식 문서를 참조하시기 바랍니다.