[python] NLTK를 사용해 테스트 데이터를 분류하는 방법은 무엇인가요?

30 Nov 2023

python

NLTK를 사용하여 텍스트를 분류하는 절차는 다음과 같습니다:

NLTK를 설치합니다.
```
pip install nltk
```

필요한 모듈을 임포트합니다.

import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

훈련 데이터와 테스트 데이터를 준비합니다. ```python
예시 훈련 데이터

positive_tweets = [(“I love this car”, “positive”), (“This view is amazing”, “positive”), (“I feel great”, “positive”)] negative_tweets = [(“I hate this car”, “negative”), (“I have a headache”, “negative”), (“I’m feeling sad”, “negative”)]

테스트 데이터

test_tweet = “I feel happy”

4. 훈련 데이터를 가공합니다. 이 단계에서는 단어를 토큰화하고, 불용어(stopwords)를 제거하는 등의 전처리 작업을 수행합니다.
```python
# 훈련 데이터 전처리
def preprocess(tweet):
    tokens = word_tokenize(tweet)
    tokens = [token.lower() for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stopwords.words("english")]
    return tokens

# 전처리된 훈련 데이터 생성
positive_tweets_processed = [(preprocess(tweet), sentiment) for tweet, sentiment in positive_tweets]
negative_tweets_processed = [(preprocess(tweet), sentiment) for tweet, sentiment in negative_tweets]

훈련 데이터를 기반으로 나이브 베이즈 분류기를 학습시킵니다. ```python
훈련 데이터 합치기

train_data = positive_tweets_processed + negative_tweets_processed

나이브 베이즈 분류기 학습

classifier = NaiveBayesClassifier.train(train_data)

6. 테스트 데이터를 분류합니다.
```python
# 테스트 데이터 전처리
test_tweet_processed = preprocess(test_tweet)

# 분류 결과 출력
classification = classifier.classify(dict([(token, True) for token in test_tweet_processed]))
print(f"Classification result: {classification}")

NLTK를 사용하여 텍스트 데이터를 분류하는 방법에 대한 간략한 예시를 제시했습니다. 더 복잡한 텍스트 분류 작업을 위해서는 데이터의 특성에 맞게 전처리 및 알고리즘을 설정해야합니다. NLTK에는 다양한 분류 알고리즘과 전처리 기능이 제공되므로, 필요에 따라 추가적인 공부와 적용이 필요할 수 있습니다.

더 자세한 내용은 NLTK 공식 문서를 참고하시기 바랍니다. NLTK 문서: http://www.nltk.org/

예시 훈련 데이터

테스트 데이터

훈련 데이터 합치기

나이브 베이즈 분류기 학습