[파이썬] 불용어(stopwords) 개념과 예제

18 Aug 2023

gensim

불용어(stopwords)는 자연어 처리 작업에서 분석에 큰 의미가 없거나 너무 자주 나타나서 중요하지 않은 단어들을 말합니다. 이러한 단어들은 분석 결과를 개선하기 위해 제거하는 것이 일반적입니다. NLTK(Natural Language Toolkit) 라이브러리에서는 영어 불용어 목록을 제공하며, 이를 사용하여 텍스트 데이터에서 불용어를 제거할 수 있습니다.

아래는 간단한 불용어 제거 예제를 보여드리겠습니다.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## 불용어 목록 다운로드 (한 번만 실행)
import nltk
nltk.download('stopwords')

## 원본 텍스트
text = "Natural language processing is a field of artificial intelligence."

## 단어 토큰화
tokens = word_tokenize(text)

## 불용어 제거
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

## 결과 출력
print("Original Text:", text)
print("Tokens:", tokens)
print("Filtered Tokens (Without stopwords):", filtered_tokens)` 

위의 코드에서 set(stopwords.words('english'))를 사용하여 NLTK에서 제공하는 영어 불용어 목록을 불러옵니다. 그 다음 텍스트를 단어로 토큰화하고, 불용어를 제거한 후 결과를 출력합니다.

예제 출력:

`Original Text: Natural language processing is a field of artificial intelligence.
Tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', '.']
Filtered Tokens (Without stopwords): ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', '.']` 

불용어를 제거하면 분석에 불필요한 단어를 걸러내어 모델의 성능을 향상시킬 수 있습니다.