[파이썬] `nltk`에서의 Feature Engineering

Introduction to nltk

nltk (Natural Language Toolkit) is a powerful library for natural language processing in Python. It provides various functionalities to process and analyze text data. We can leverage nltk for feature extraction and engineering tasks.

Basics of Feature Engineering with nltk

Tokenization

Tokenization is the process of breaking text into individual words or tokens. nltk provides a tokenizer module that allows us to split text into tokens. Let’s take a look at an example:

import nltk
from nltk.tokenize import word_tokenize

text = "This is an example sentence for tokenization."

tokens = word_tokenize(text)

print(tokens)

Output:

['This', 'is', 'an', 'example', 'sentence', 'for', 'tokenization', '.']

Text Normalization

Text normalization is the process of converting text into a standard format. It includes tasks such as removing punctuation, converting to lowercase, and removing stop words. nltk provides convenient methods for performing these tasks. Here’s an example:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is an example sentence for text normalization."

tokens = word_tokenize(text)

text_lower = [token.lower() for token in tokens if token.isalpha()]

stop_words = set(stopwords.words('english'))
text_normalized = [token for token in text_lower if token not in stop_words]

print(text_normalized)

Output:

['example', 'sentence', 'text', 'normalization']

Part-of-Speech Tagging

Part-of-Speech (POS) tagging is the process of assigning a grammatical label to each word in a sentence. nltk provides a POS tagging module that allows us to perform this task. Let’s see an example:

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "This is an example sentence for POS tagging."

tokens = word_tokenize(text)

pos_tags = pos_tag(tokens)

print(pos_tags)

Output:

[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('sentence', 'NN'), ('for', 'IN'), ('POS', 'NNP'), ('tagging', 'VBG'), ('.', '.')]

Text Vectorization

Text vectorization is the process of converting text into numerical vectors that machine learning algorithms can understand. nltk provides various methods for text vectorization, such as CountVectorizer and TfidfVectorizer, which can be used to convert text into bag-of-words or TF-IDF representations.

Here’s an example using CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.toarray())

Output:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]]

Conclusion

Feature engineering is crucial for improving the performance of machine learning models. In this blog post, we explored some basic feature engineering techniques using the nltk library in Python. nltk provides a wide range of functionalities for text processing, making it a valuable tool in any natural language processing project.