[python] NLTK를 사용해 텍스트 분류를 수행하는 방법은 무엇인가요?

30 Nov 2023

python

먼저, NLTK를 설치해야 합니다. 아래의 명령을 사용하여 NLTK를 설치할 수 있습니다.

pip install nltk

NLTK를 설치했다면, 다음과 같이 필요한 모듈을 가져옵니다.

import nltk
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentIntensityAnalyzer

이제 영화 리뷰 데이터셋을 로딩하고 텍스트를 토큰화하여 전처리합니다.

# 리뷰 데이터셋 로딩
nltk.download('movie_reviews')

# 리뷰 데이터셋에서 긍정/부정 리뷰를 로딩하여 문서와 레이블로 분리
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# 문서를 토큰화하여 전처리
all_words = []
for words, _ in documents:
    all_words.extend(words)

# 단어들의 빈도수를 계산하여 피처를 생성
word_freq = nltk.FreqDist(all_words)
word_features = list(word_freq.keys())[:5000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# 피처를 사용하여 훈련 데이터셋과 테스트 데이터셋을 생성
featuresets = [(document_features(doc), category) for (doc, category) in documents]
train_set = featuresets[:1500]
test_set = featuresets[1500:]

이제 생성된 훈련 데이터셋을 바탕으로 Naive Bayes 분류기를 학습시키고, 테스트 데이터셋에서의 성능을 평가해보겠습니다.

# Naive Bayes 분류기 학습
classifier = NaiveBayesClassifier.train(train_set)

# 테스트 데이터셋에서의 성능 평가
accuracy = nltk.classify.accuracy(classifier, test_set)
print('Accuracy:', accuracy)

또한 NLTK의 SentimentIntensityAnalyzer를 사용하면 텍스트의 감성을 추정할 수도 있습니다.

# SentimentIntensityAnalyzer 생성
sia = SentimentIntensityAnalyzer()

# 텍스트의 감성 추정
sentiment_scores = sia.polarity_scores("This movie is very bad.")
print(sentiment_scores)

위의 코드를 사용하여 NLTK를 활용해 텍스트 분류를 수행할 수 있습니다. 자세한 내용은 NLTK 공식 문서(https://www.nltk.org/)를 참조하시기 바랍니다.