[파이썬] nltk 불용어 처리 (Stopwords)

06 Sep 2023

nltk

Natural Language Toolkit (nltk) is a powerful library in Python for processing human language data. One common preprocessing step in natural language processing is removing stopwords from text data. Stopwords are words that do not carry much meaning and are often filtered out to improve the efficiency and accuracy of text analysis tasks.

In this blog post, we will explore how to use the nltk library to handle stopwords in Python. We will cover the following topics:

What are stopwords?
Installing and importing the nltk library
Downloading stopwords corpus
Removing stopwords from text data
Customizing the stopwords list

What are stopwords?

Stopwords are commonly used words in a language that do not contribute much to the overall meaning of a sentence or document. These words include articles, pronouns, prepositions, conjunctions, and other frequently occurring words.

Examples of English stopwords:

“a”, “an”, “the”
“and”, “but”, “or”
“is”, “am”, “are”
“this”, “that”, “those”

By removing stopwords from text data, we can focus on the more important and meaningful words in a document.

Installing and importing the nltk library

Before we start working with stopwords, we need to install the nltk library using pip:

pip install nltk

Next, we can import the nltk library in our Python code:

import nltk

Downloading stopwords corpus

The nltk library provides various corpora and resources, including a stopwords corpus. To download the stopwords corpus, we can use the following code:

nltk.download('stopwords')

Running this code will download the stopwords corpus to your local machine.

Removing stopwords from text data

Once we have the nltk library and stopwords corpus, we can start removing stopwords from our text data. Here is an example of how to remove stopwords from a sentence:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is an example sentence that contains some stopwords."
stopwords = set(stopwords.words('english'))

# Tokenize the sentence
tokens = word_tokenize(text)

# Remove stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stopwords]

print(filtered_tokens)

Output:

['example', 'sentence', 'contains', 'stopwords', '.']

In the above code, we first import the stopwords corpus from the nltk library and the word_tokenize function, which is used to split the sentence into tokens. We define the text we want to process and create a set of stopwords for the English language.

We tokenize the sentence using word_tokenize and then use a list comprehension to keep only the words that are not stopwords. The resulting filtered_tokens list contains the important words from the original sentence, excluding the stopwords.

Customizing the stopwords list

The nltk library provides a default set of stopwords for several languages. However, depending on the specific text analysis task or domain, you may want to customize the stopwords list. To do this, you can create your own set of stopwords or modify the existing ones.

Here is an example of how to customize the stopwords list:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is an example sentence for customizing stopwords."
custom_stopwords = set(stopwords.words('english')) - {'example', 'for'}

tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in custom_stopwords]

print(filtered_tokens)

Output:

['This', 'is', 'an', 'sentence', 'customizing', 'stopwords', '.']

In this example, we exclude the words ‘example’ and ‘for’ from the default stopwords list. The resulting filtered_tokens list only contains the words that are not in the custom stopwords set.

Conclusion

In this blog post, we have explored how to handle stopwords using the nltk library in Python. We learned what stopwords are, how to install and import the nltk library, how to download the stopwords corpus, and how to remove stopwords from our text data. We also saw how to customize the stopwords list for specific needs.

By removing stopwords from text data, we can improve the accuracy and efficiency of text analysis tasks such as sentiment analysis, topic modeling, and text classification. Using the nltk library makes it easy to handle stopwords and focus on the more important words in our text documents.