[python] NLTK를 사용해 텍스트의 특정 패턴을 제거하는 방법은 무엇인가요?

30 Nov 2023

python

NLTK를 설치한 후에, 먼저 필요한 패키지를 임포트합니다.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

NLTK에서는 영어의 일반적인 불용어(stopwords) 목록을 제공합니다. 불용어는 텍스트에서 자주 나타나지만 실질적인 의미가 없는 단어입니다. 이러한 불용어를 제거하면 텍스트의 특정 패턴을 정제할 수 있습니다.

불용어 목록을 다운로드하고, 텍스트를 토큰화한 후 불용어를 제거하는 예제 코드를 작성해보겠습니다.

nltk.download('stopwords')

def remove_pattern(text):
    stop_words = set(stopwords.words("english"))
    tokens = word_tokenize(text)
    filtered_text = [word for word in tokens if not word in stop_words]
    return " ".join(filtered_text)

위의 코드에서 remove_pattern 함수는 인자로 전달된 텍스트에서 불용어를 제거한 새로운 텍스트를 반환합니다. 이 함수를 사용하려면 아래와 같이 호출하면 됩니다.

text = "This is an example sentence. We want to remove certain patterns from it."
clean_text = remove_pattern(text)
print(clean_text)

실행 결과는 다음과 같습니다.

This example sentence . We want remove certain patterns .

위의 예제에서는 영어 불용어를 사용하였지만, NLTK는 다른 언어에 대한 불용어 목록도 제공하므로 해당 언어에 맞게 사용할 수 있습니다.

더 자세한 내용은 NLTK 문서를 참조하세요.