[파이썬] nltk 원형 복원 (Lemmatization)

Lemmatization is the process of reducing a word to its base or root form. It helps in standardizing various forms of words and simplifying text analysis. In this blog post, we will explore how to perform lemmatization using the Natural Language Toolkit (NLTK) library in Python.

Installation

Before we start, make sure you have NLTK installed on your system. You can install it using the following command:

pip install nltk

Importing the necessary libraries

We need to import the nltk library and download the WordNet corpus, which is required for lemmatization.

import nltk
nltk.download('wordnet')

Lemmatization using NLTK

NLTK provides a class called WordNetLemmatizer which can be used for lemmatization. Let’s see how to use it:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word = "running"
lemma = lemmatizer.lemmatize(word, pos='v')

print("Word:", word)
print("Lemma:", lemma)

In the above example, we create an instance of the WordNetLemmatizer class. We then pass the word “running” and the part of speech tag (‘v’ for verb in this case) to the lemmatize method. The method returns the base form of the word, which in this case is “run”.

Lemmatization with different part of speech tags

NLTK supports lemmatization with different part of speech (POS) tags such as noun, verb, adjective, and adverb.

nouns = ["cat", "dogs", "mice", "feet"]
verbs = ["playing", "running", "ate", "driven", "washes"]

print("Nouns:")
for noun in nouns:
    lemma = lemmatizer.lemmatize(noun, pos='n')
    print(noun, " --> ", lemma)

print("\nVerbs:")
for verb in verbs:
    lemma = lemmatizer.lemmatize(verb, pos='v')
    print(verb, " --> ", lemma)

In the above example, we lemmatize a list of nouns and verbs. For nouns, we pass the POS tag ‘n’ to the lemmatize method, and for verbs, we pass the POS tag ‘v’. The output will show the original word and its lemmatized form.

Conclusion

Lemmatization is a useful technique for text normalization and analysis. In this blog post, we learned how to perform lemmatization using the NLTK library in Python. NLTK provides an easy-to-use WordNetLemmatizer class to accomplish this task. By leveraging lemmatization, we can simplify text processing and improve the accuracy of our text analysis tasks.