[파이썬] nltk 나이브 베이즈 분류기

Introduction

Naive Bayes classifier is a probabilistic machine learning model that is commonly used for text classification. It is based on the Bayes’ theorem and the assumption of independence between features. In this tutorial, we will explore how to implement the Naive Bayes classifier using the NLTK library in Python.

Step 1: Installing NLTK

If you haven’t installed the NLTK library already, you can use the following command to install it:

pip install nltk

Step 2: Importing the Necessary Libraries

First, we need to import the NLTK library and the necessary modules. Execute the following code:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

Step 3: Preparing the Data

Next, we need to prepare our training data. We can use the NLTK movie reviews corpus for this purpose. The movie reviews corpus contains 2000 movie reviews, classified into positive and negative reviews.

nltk.download('movie_reviews')

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

# Select the top 2000 most common words as features
word_features = list(all_words.keys())[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[word] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

Step 4: Training the Classifier

Now, we can train our Naive Bayes classifier using the featuresets we created.

train_set = featuresets[:1500]
test_set = featuresets[1500:]

classifier = NaiveBayesClassifier.train(train_set)

Step 5: Evaluating the Classifier

Finally, let’s evaluate our classifier by measuring its accuracy on the test set.

print("Accuracy:", accuracy(classifier, test_set))

The accuracy represents the percentage of correct predictions made by the classifier.

Summary

In this tutorial, we learned how to implement the Naive Bayes classifier using the NLTK library in Python. The Naive Bayes classifier is a powerful tool for text classification tasks. It is widely used in various applications such as spam filtering, sentiment analysis, and document categorization.