[파이썬] 텍스트 데이터의 토큰화 개념과 예제

18 Aug 2023

gensim

텍스트 데이터의 토큰화(tokenization)는 텍스트를 작은 단위로 나누는 작업을 의미합니다. 이 작은 단위를 토큰(token)이라고 하며, 일반적으로 단어 단위로 나누는 것이 일반적입니다. 토큰화는 자연어 처리 작업의 첫 단계 중 하나로, 텍스트를 분석 가능한 형태로 변환하는 중요한 과정입니다.

아래는 간단한 텍스트 토큰화 예제를 보여드리겠습니다.

from nltk.tokenize import word_tokenize

## 원본 텍스트
text = "Natural language processing is a field of artificial intelligence."

## 단어 토큰화
tokens = word_tokenize(text)

## 토큰 출력
print("Original Text:", text)
print("Tokens:", tokens)` 

위의 코드에서 word_tokenize() 함수는 NLTK(Natural Language Toolkit) 라이브러리를 사용하여 텍스트를 단어로 토큰화합니다.

예제 출력:

Original Text: Natural language processing is a field of artificial intelligence.
Tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', '.']` 

위의 예제에서는 원본 텍스트를 공백과 문장 부호를 기준으로 단어로 토큰화하여 출력하였습니다. 이렇게 텍스트를 작은 단위로 나누는 것은 자연어 처리 작업의 다음 단계에서 단어 간의 관계를 파악하거나 문서 구조를 이해하는 데 도움이 됩니다.