[python] Text Classification

12 Dec 2023

python

In this blog post, we will explore text classification in Python. Text classification is the process of categorizing texts into different predefined classes or categories. This is a common task in natural language processing (NLP) and is used in various applications such as spam filtering, sentiment analysis, and topic labeling.

What is Text Classification?

Text classification, also known as text categorization, is the process of assigning tags or categories to text data according to its content. The goal of text classification is to automatically classify text documents into one or more predefined categories based on their content.

Text Classification in Python

Python provides several libraries and tools that can be used for text classification tasks. Some of the popular libraries for text classification in Python include:

NLTK (Natural Language Toolkit): NLTK is a leading platform for building Python programs to work with human language data.
scikit-learn: Scikit-learn is a popular machine learning library in Python that provides simple and efficient tools for data mining and data analysis.
TensorFlow: TensorFlow is an open-source machine learning library developed by Google for building and training machine learning models.

Using NLTK for Text Classification

NLTK provides various tools and algorithms that can be used for text classification tasks. Here’s an example of how to use NLTK for text classification:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import random

# Sample text data
documents = [
    (['hello', 'world'], 'greeting'),
    (['machine', 'learning', 'is', 'fun'], 'positive'),
    (['nlp', 'is', 'challenging'], 'negative'),
    # Add more sample data
]

# Preprocessing the text data
all_words = []
for doc in documents:
    all_words += doc[0]

all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]

# Feature extraction
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

# Create feature sets
featuresets = [(find_features(rev), category) for (rev, category) in documents]
random.shuffle(featuresets)

# Training and testing the classifier
training_set = featuresets[:len(featuresets)//2]
testing_set = featuresets[len(featuresets)//2:]

classifier = nltk.NaiveBayesClassifier.train(training_set)

# Test the classifier
print("Classifier accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)

In this example, we use NLTK to preprocess the text data, create features, and train a Naive Bayes classifier for text classification.

Using scikit-learn for Text Classification

Scikit-learn provides a wide range of tools for text classification, including various algorithms and preprocessing techniques. Here’s an example of how to use scikit-learn for text classification:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn import metrics

# Sample text data
texts = ['hello world', 'machine learning is fun', 'nlp is challenging']
categories = ['greeting', 'positive', 'negative']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(texts, categories, test_size=0.25, random_state=42)

# Define a pipeline
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model
model.fit(X_train, y_train)

# Test the model
predicted = model.predict(X_test)
print(metrics.classification_report(y_test, predicted))

In this example, we use scikit-learn to preprocess the text data, create a pipeline with a CountVectorizer and a Multinomial Naive Bayes classifier, and then train and test the model for text classification.

Conclusion

In this blog post, we have explored text classification in Python using NLTK and scikit-learn. Text classification is an essential task in natural language processing, and Python provides powerful tools and libraries for implementing text classification algorithms.

Text classification is a fundamental task in the field of natural language processing (NLP), and Python provides a rich ecosystem of libraries and tools for building and evaluating text classification models.

If you are interested in learning more about text classification, I would recommend checking out the official documentation and tutorials for NLTK and scikit-learn.

References:

NLTK: https://www.nltk.org/
scikit-learn: https://scikit-learn.org/stable/
TensorFlow: https://www.tensorflow.org/