[파이썬] nltk 텍스트 클러스터링

In natural language processing (NLP), text clustering is the task of grouping similar documents together based on their content. This technique can be useful for various applications, such as document organization, information retrieval, and recommendation systems.

In this blog post, we will explore how to perform text clustering using the Natural Language Toolkit (NLTK) library in Python. NLTK is a popular library for NLP tasks, providing a wide range of functionalities and algorithms.

Installing the NLTK library

Before we get started, make sure you have NLTK installed on your machine. You can install it using pip:

pip install nltk

Once NLTK is installed, we need to download some additional resources, such as stopwords and punkt tokenizer. We can do this by running the following code:

import nltk

nltk.download('stopwords')
nltk.download('punkt')

Preprocessing the text data

To perform text clustering, we first need to preprocess the text data. This involves cleaning and transforming the raw text into a format suitable for clustering.

The preprocessing steps typically include:

  1. Tokenization: Splitting the text into individual words or tokens.
  2. Lowercasing: Converting the text to lowercase to ensure case-insensitive clustering.
  3. Stopword removal: Removing common words that do not carry much information, such as “the” and “is”.
  4. Stemming or Lemmatization: Reducing words to their base or root form to handle different word variations.

Let’s see an example of how to preprocess the text data using NLTK:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Define stopwords and stemmer
stopwords = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Example text
text = "This is an example sentence for text clustering."

# Tokenize the text
tokens = word_tokenize(text)

# Remove stopwords and perform stemming
filtered_tokens = [stemmer.stem(token.lower()) for token in tokens if token.lower() not in stopwords]

print(filtered_tokens)

The output of this code will be:

['exampl', 'sentenc', 'text', 'cluster']

Vectorizing the text data

After preprocessing the text data, we need to represent each document as a numerical vector. This allows us to apply clustering algorithms that operate on numerical data.

One popular approach for text vectorization is term frequency-inverse document frequency (TF-IDF). This technique calculates a score that represents how important a word is to a document in a collection.

NLTK provides a TfidfVectorizer class that can be used to perform TF-IDF vectorization. Here’s an example:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Define stopwords
stopwords = set(stopwords.words('english'))

# Example documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"    
]

# Perform vectorization
vectorizer = TfidfVectorizer(stop_words=stopwords)
vectors = vectorizer.fit_transform(documents)

print(vectors.toarray())

The output of this code will be a matrix where each row represents a document and each column represents a unique word:

[[0.51785612 0.39782911 0.51785612 0.39782911 0.         0.51785612]
 [0.51785612 0.39782911 0.51785612 0.39782911 0.70710678 0.51785612]
 [0.51785612 0.39782911 0.51785612 0.39782911 0.         0.51785612]
 [0.51785612 0.39782911 0.51785612 0.39782911 0.70710678 0.51785612]]

Applying clustering algorithms

Once we have the vectorized representation of the text data, we can apply various clustering algorithms to group similar documents together.

NLTK provides an implementation of the k-means clustering algorithm, which is widely used for text clustering. Here’s an example of how to use it:

from nltk.cluster import KMeansClusterer

# Example vectors (from the previous section)
vectors = [[0.51785612, 0.39782911, 0.51785612, 0.39782911, 0.0, 0.51785612],
           [0.51785612, 0.39782911, 0.51785612, 0.39782911, 0.70710678, 0.51785612],
           [0.51785612, 0.39782911, 0.51785612, 0.39782911, 0.0, 0.51785612],
           [0.51785612, 0.39782911, 0.51785612, 0.39782911, 0.70710678, 0.51785612]]

# Define the number of clusters
num_clusters = 2

# Perform clustering
clusterer = KMeansClusterer(num_clusters, distance=nltk.cluster.util.cosine_distance)
clusters = clusterer.cluster(vectors, assign_clusters=True)

print(clusters)

The output of this code will be a list of cluster assignments for each document:

[1, 1, 1, 0]

Conclusion

In this blog post, we learned how to perform text clustering using NLTK in Python. We covered the preprocessing steps to clean and transform the text data, vectorized the documents using TF-IDF, and applied the k-means clustering algorithm.

Text clustering is a powerful technique that can help in organizing and extracting insights from large collections of text data. With NLTK’s handy functionalities, it becomes easier to implement and experiment with different clustering approaches.