[python] NLTK를 사용해 텍스트의 어깨동무 문제를 해결하는 방법은 무엇인가요?

30 Nov 2023

python

토큰화(Tokenization): NLTK는 단어나 문장과 같은 텍스트를 작은 단위인 토큰으로 분리할 수 있는 기능을 제공합니다. 이를 통해 어깨동무 문제가 발생하는 긴 문장이나 문서를 작은 단위로 나눌 수 있습니다.

import nltk
nltk.download('punkt')  # 토큰화를 위해 필요한 데이터를 다운로드

from nltk.tokenize import word_tokenize, sent_tokenize

sentence = "안녕하세요. 자연어 처리에 관심이 있으신가요?"
tokens = word_tokenize(sentence)  # 단어 토큰화
sentences = sent_tokenize(sentence)  # 문장 토큰화

print(tokens)
print(sentences)

정규화(Normalization): NLTK는 단어의 형태를 일반화하여 어깨동무 문제를 해결할 수 있습니다. 이를테면 단수와 복수 형태, 과거형과 현재형 등을 통합하여 처리할 수 있습니다.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["cats", "running", "better"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

불용어 제거(Stopword removal): NLTK는 불용어라고 불리는 일반적인 단어들을 제거하여 어깨동무 문제를 완화할 수 있습니다. 이를 위해 불용어 리스트를 사용할 수 있습니다.

from nltk.corpus import stopwords

nltk.download('stopwords')  # 불용어를 다운로드

stopwords_list = stopwords.words('english')
text = "This is an example sentence showcasing the stopword removal."
cleaned_text = [word for word in text.split() if word.lower() not in stopwords_list]

print(cleaned_text)

NLTK를 사용해 텍스트의 어깨동무 문제를 해결하는 기능은 다양하게 제공됩니다. 이외에도 명사 추출, 형태소 분석, 문서 분류 등 다양한 기능을 통해 텍스트 처리 과정에서 발생하는 어깨동무 문제를 해결할 수 있습니다. NLTK의 공식 문서와 예제 코드들을 참고하면 더욱 자세한 정보를 얻을 수 있습니다.