Introduction
Naive Bayes classifier is a probabilistic machine learning model that is commonly used for text classification. It is based on the Bayes’ theorem and the assumption of independence between features. In this tutorial, we will explore how to implement the Naive Bayes classifier using the NLTK library in Python.
Step 1: Installing NLTK
If you haven’t installed the NLTK library already, you can use the following command to install it:
pip install nltk
Step 2: Importing the Necessary Libraries
First, we need to import the NLTK library and the necessary modules. Execute the following code:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
Step 3: Preparing the Data
Next, we need to prepare our training data. We can use the NLTK movie reviews corpus for this purpose. The movie reviews corpus contains 2000 movie reviews, classified into positive and negative reviews.
nltk.download('movie_reviews')
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
# Select the top 2000 most common words as features
word_features = list(all_words.keys())[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
Step 4: Training the Classifier
Now, we can train our Naive Bayes classifier using the featuresets we created.
train_set = featuresets[:1500]
test_set = featuresets[1500:]
classifier = NaiveBayesClassifier.train(train_set)
Step 5: Evaluating the Classifier
Finally, let’s evaluate our classifier by measuring its accuracy on the test set.
print("Accuracy:", accuracy(classifier, test_set))
The accuracy represents the percentage of correct predictions made by the classifier.
Summary
In this tutorial, we learned how to implement the Naive Bayes classifier using the NLTK library in Python. The Naive Bayes classifier is a powerful tool for text classification tasks. It is widely used in various applications such as spam filtering, sentiment analysis, and document categorization.