[파이썬] nltk 어간 추출 (Stemming)

In natural language processing (NLP), stemming is the process of reducing words to their base or root form. The base form, or stem, can be used to identify words that have the same meaning, despite their different inflections.

NLTK (Natural Language Toolkit) is a popular library in Python that provides tools for textual data processing, including stemming functionality. NLTK offers various stemming algorithms, such as the Porter Stemmer and the Snowball Stemmer.

Installing NLTK

To use NLTK for stemming, you first need to install the library. Open your terminal or command prompt and enter the following command:

pip install nltk

This command will install the NLTK library on your system.

Stemming with NLTK

Now that you have NLTK installed, let’s perform stemming on some words using NLTK’s Porter Stemmer algorithm:

import nltk
from nltk.stem import PorterStemmer

# Initialize the PorterStemmer
stemmer = PorterStemmer()

# Perform stemming on words
word_list = ["running", "runs", "ran", "runner", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in word_list]

print(stemmed_words)

When you run the above code, you will get the following output:

['run', 'run', 'ran', 'runner', 'easili', 'fairli']

As you can see, the words have been stemmed to their base form. The words “running” and “runs” have been reduced to “run”, “easily” has been reduced to “easili”, and “fairly” has been reduced to “fairli”.

Conclusion

Stemming plays an important role in various NLP tasks, such as text preprocessing, information retrieval, and sentiment analysis. NLTK provides powerful stemming algorithms that allow you to perform stemming on words and reduce them to their base form.

By using the NLTK library and its stemming functionality, you can simplify textual data analysis and improve the accuracy of your NLP models.