[파이썬] 불용어(stopwords) 처리 방법

04 Sep 2023

python

불용어(stopwords)는 문장에서 일반적으로 많이 사용되는 단어로서, 문장의 의미를 해석하는 데는 도움이 되지 않습니다. 예를 들어 ‘the’, ‘is’, ‘and’와 같은 단어가 일반적인 불용어입니다.

불용어 처리는 자연어 처리(Natural Language Processing) 작업에서 중요한 단계 중 하나로, 올바른 단어 토큰화와 문장 분석을 위해 필요합니다. 이번 포스트에서는 Python에서 불용어를 처리하는 몇 가지 방법을 살펴보겠습니다.

NLTK(Natural Language Toolkit)를 활용한 불용어 처리

Python에서 가장 많이 사용되는 라이브러리 중 하나인 NLTK를 사용하여 불용어 처리를 할 수 있습니다. NLTK는 다양한 언어 처리 작업에 유용한 도구와 데이터를 제공합니다.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

# 불용어 목록을 가져옵니다.
stop_words = set(stopwords.words('english'))

# 예시 문장
sentence = 'This is an example sentence, showing off the stop words removal.'

# 문장을 단어로 토큰화합니다.
words = word_tokenize(sentence)

# 불용어가 아닌 단어만 추출합니다.
filtered_words = [word for word in words if word.casefold() not in stop_words]

print(filtered_words)

위의 코드는 NLTK를 사용하여 영어 문장에서 불용어를 처리하는 예시입니다. 우선 nltk.download를 사용하여 불용어 목록과 토큰화를 위한 데이터를 다운로드합니다. 그리고 stopwords.words('english')를 통해 영어 불용어 목록을 가져옵니다.

문장을 토큰화한 후, filtered_words 리스트 컴프리헨션을 사용하여 불용어가 아닌 단어만 추출합니다. word.casefold()를 사용하여 대소문자를 구별하지 않고 불용어를 처리합니다.

sklearn의 CountVectorizer를 활용한 불용어 처리

또 다른 방법으로는 scikit-learn 라이브러리의 CountVectorizer를 사용하는 것입니다. CountVectorizer는 텍스트 데이터를 벡터로 변환하는 데 사용되며, 불용어 처리도 기본적으로 지원합니다.

from sklearn.feature_extraction.text import CountVectorizer

# 예시 문장
corpus = ['This is an example sentence, showing off the stop words removal.']

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

위의 코드에서 CountVectorizer(stop_words='english')를 사용하여 불용어 처리를 할 수 있습니다. stop_words='english'는 영어 불용어 목록을 적용한다는 의미입니다.

fit_transform 메서드를 사용하여 문장을 벡터로 변환합니다. 마지막으로 vectorizer.get_feature_names()를 호출하여 변환된 벡터에 대응하는 단어 목록을 얻을 수 있습니다.

불용어 처리는 효과적인 자연어 처리를 위해 필수적인 단계입니다. Python에서는 NLTK나 scikit-learn과 같은 라이브러리를 활용하여 불용어를 처리할 수 있습니다. 이를 통해 더 정확하고 의미 있는 결과를 얻을 수 있습니다.