[파이썬] nltk 다중 라벨 텍스트 분류

Introduction

The Natural Language Toolkit (nltk) is a powerful library for text processing and analysis in Python. One common task in natural language processing (NLP) is text classification, where we categorize text documents into predefined labels or classes. While nltk provides various techniques for binary text classification, it also supports multiple label text classification. In this blog post, we will explore how to perform multiple label text classification using nltk in Python.

Prerequisites

To follow along with the examples in this blog post, you need to have the following installed:

Data Preparation

Before we dive into text classification, we need some labeled text data. For multiple label text classification, each document in the dataset can belong to one or more labels. Let’s say we have a dataset of customer reviews categorized into different product categories like “electronics”, “clothing”, and “books”.

Each review will have multiple labels associated with it. For example, a review for an electronics product may have labels like “electronics” and “positive”, indicating both the category and sentiment of the review.

Text Preprocessing

To prepare our text data for classification, we need to perform some preprocessing steps such as tokenization, removing stopwords, and stemming. The nltk library provides various functions to perform these tasks.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # Stem the tokens
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    
    return stemmed_tokens

Text Classification

Once our text data is preprocessed, we can proceed with the text classification task. To perform multiple label text classification, we can use binary relevance or classifier chains approach.

In the binary relevance approach, we train a separate binary classifier for each label independently. Each classifier predicts whether a document belongs to a particular label or not. We can use popular machine learning algorithms such as Naive Bayes, Support Vector Machines (SVM), or Random Forests for binary classification.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

# Initialize the classifier
classifier = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=preprocess_text)),
    ('clf', OneVsRestClassifier(LinearSVC()))
])

# Train the classifier
classifier.fit(train_data, train_labels)

# Predict labels for new documents
predictions = classifier.predict(test_data)

Evaluation

To evaluate the performance of our text classification model, we can use metrics like accuracy, precision, recall, and F1-score. These metrics provide insights into how well our model is performing on the given test dataset.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)

# Calculate precision
precision = precision_score(test_labels, predictions, average='micro')

# Calculate recall
recall = recall_score(test_labels, predictions, average='micro')

# Calculate F1-score
f1 = f1_score(test_labels, predictions, average='micro')

Conclusion

The nltk library in Python offers powerful tools for performing multiple label text classification. By following the steps outlined in this blog post, you can preprocess your text data, train a text classification model using binary relevance or classifier chains, and evaluate its performance using various metrics.

Keep in mind that the success of your text classification task heavily depends on the quality of the labeled data and the preprocessing techniques used. Experiment with different algorithms and preprocessing steps to find the best combination for your specific use case.