[python] NLTK를 사용해 텍스트의 정제를 어떻게 할 수 있나요?

30 Nov 2023

python

텍스트 데이터는 다양한 형태의 불필요한 문자, 기호, 또는 불용어(stop words)를 포함할 수 있습니다. 이러한 불필요한 요소들을 제거하고, 텍스트를 정제하는 과정은 자연어 처리(NLP) 작업에서 매우 중요합니다. NLTK(Natural Language Toolkit)는 텍스트 정제를 위한 다양한 기능을 제공하는 Python 라이브러리입니다.

아래 예시 코드를 통해 NLTK를 사용해 텍스트를 정제하는 방법을 살펴보겠습니다:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def text_preprocessing(text):
    # 소문자로 변환
    text = text.lower()

    # 특수문자 제거
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # 토큰화
    tokens = word_tokenize(text)

    # 불용어 제거
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]

    # 어간 추출(Stemming)
    stemmer = nltk.stem.PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    # 정제된 텍스트 반환
    return ' '.join(tokens)

# 텍스트 정제 예시
text = "Hello, World! This is an example text for text preprocessing."
clean_text = text_preprocessing(text)
print(clean_text)

위 코드에서는 먼저 lower() 메서드를 사용하여 모든 텍스트를 소문자로 변환합니다. 그런 다음 re.sub() 메서드를 사용하여 특수 문자를 제거합니다. 그 후 word_tokenize() 함수를 사용하여 텍스트를 토큰화합니다. 다음으로, stopwords.words('english')를 이용하여 영어 불용어(stop words)를 로드하고, 이를 통해 토큰화된 단어 중 불용어를 제거합니다. 마지막으로, PorterStemmer()를 사용하여 어간 추출을 수행합니다.

예시 입력인 “Hello, World! This is an example text for text preprocessing.”은 정제된 후 “hello world exampl text text preprocess”로 출력됩니다.

이와 같은 텍스트 정제 과정을 통해 데이터를 청소하고, 불필요한 요소를 제거하여 자연어 처리 작업에 바로 적용할 수 있습니다. NLTK는 이외에도 다양한 기능을 제공하므로, 자세한 내용은 NLTK의 공식 문서를 참고하시기 바랍니다.

참고 문서

NLTK 공식 문서: https://www.nltk.org/
NLTK stop words: https://www.nltk.org/book/ch02.html#stopwords_index_term
NLTK stemmers: https://www.nltk.org/api/nltk.stem.html