[파이썬] `gensim`에서의 정규화

06 Sep 2023

gensim

정규화는 텍스트 데이터의 전처리 단계로, 단어를 일정한 형식으로 변환하여 데이터를 단순화하고 일반화하는 과정입니다. 다양한 정규화 기법이 있지만, 여기서는 주로 토큰화, 소문자 변환, 특수 문자 제거 등을 다루겠습니다.

먼저, gensim을 설치하고 로드해야 합니다. 아래의 코드를 사용하여 설치할 수 있습니다:

!pip install gensim

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, preprocess_string

# 입력 텍스트
text = "This is an example sentence. It includes punctuation marks!"

# 텍스트 정규화
tokens = simple_preprocess(text)
lowercase = [token.lower() for token in tokens]
stopwords_removed = remove_stopwords(lowercase)
punctuation_stripped = strip_punctuation(stopwords_removed)

# 결과 출력
print(punctuation_stripped)

위의 코드에서 simple_preprocess 함수는 텍스트를 작은 단위로 분리합니다. remove_stopwords 함수는 불용어를 제거하고, strip_punctuation 함수는 문장부호를 제거합니다.

정규화된 결과는 다음과 같습니다:

['example', 'sentence', 'includes', 'punctuation', 'marks']

gensim을 사용하여 텍스트 데이터를 정규화함으로써 데이터의 일관성을 유지하고, 불필요한 정보를 제거하여 모델 성능을 향상시킬 수 있습니다. 이를 통해 토픽 모델링, 문서 유사도 계산 등의 다양한 자연어 처리 작업을 보다 효과적으로 수행할 수 있습니다.

참고: gensim documentation을 참조하여 더 많은 기능과 사용 방법을 알아보세요.