[python] NLTK를 사용해 문장의 감성을 분류하는 방법은 무엇인가요?

30 Nov 2023

python

데이터셋 준비: 감성 분류를 위해 레이블링된 데이터셋이 필요합니다. 일반적으로 긍정과 부정으로 레이블링된 문장들로 구성됩니다.
전처리: 데이터셋의 문장들을 토큰화하고 각 토큰을 정제합니다. 불용어(Stopwords)를 제거하고 어간 추출(Stemming) 또는 표제어 추출(Lemmatization)을 수행할 수 있습니다.
특성 추출: 문장에서 특성을 추출하는 과정입니다. 가장 일반적인 방법은 Bag-of-Words 모델을 사용하는 것입니다. 이 모델은 문장을 단어의 빈도로 표현합니다.
모델 학습: 추출된 특성을 기반으로 모델을 학습시킵니다. 다양한 알고리즘을 사용할 수 있습니다. 예를 들어, Naive Bayes, SVM, 혹은 딥러닝 기반의 모델을 사용할 수 있습니다.
모델 평가: 학습한 모델을 평가하여 감성 분류의 정확도를 확인합니다. 일반적으로 데이터셋을 학습/테스트 셋으로 분할하고 테스트 셋에서의 정확도를 평가합니다.

다음은 예제 코드입니다:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

# 데이터셋 로드
positive_sentences = ["This movie is great!", "I loved the acting in this film."]
negative_sentences = ["The plot was predictable.", "The actress gave a terrible performance."]

# 전처리
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess(sentence):
    tokens = word_tokenize(sentence.lower())
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return dict([(token, True) for token in filtered_tokens])

positive_features = [(preprocess(sentence), "positive") for sentence in positive_sentences]
negative_features = [(preprocess(sentence), "negative") for sentence in negative_sentences]

# 데이터셋 분할
split_ratio = 0.8
positive_cutoff = int(len(positive_features) * split_ratio)
negative_cutoff = int(len(negative_features) * split_ratio)

train_set = positive_features[:positive_cutoff] + negative_features[:negative_cutoff]
test_set = positive_features[positive_cutoff:] + negative_features[negative_cutoff:]

# 모델 학습
model = SklearnClassifier(MultinomialNB())
model.train(train_set)

# 모델 평가
accuracy = nltk.classify.accuracy(model, test_set)
print("Accuracy:", accuracy)

이 예제 코드에서는 NLTK를 사용하여 감성 분류를 수행하는 예시입니다. 데이터셋을 레이블링하고 전처리한 후, Bag-of-Words 기반의 특성을 추출하고 Multinomial Naive Bayes 모델을 사용하여 분류를 수행합니다.

참고 문서: