[python] NLTK를 사용해 문서 주제 모델링을 수행하는 방법은 무엇인가요?

30 Nov 2023

python

NLTK 설치: NLTK를 설치하기 위해 터미널 창에서 다음 명령을 실행합니다.
```
pip install nltk
```
NLTK 코퍼스 다운로드: NLTK는 다양한 코퍼스 데이터를 제공합니다. 예를 들어, ‘reuters’ 코퍼스를 사용하여 문서 주제 모델링을 수행해보겠습니다. 다음 코드를 사용하여 코퍼스를 다운로드합니다.
```
import nltk
nltk.download('reuters')
```
텍스트 전처리: NLTK를 사용하여 텍스트 데이터를 전처리합니다. 문장 토큰화, 단어 토큰화, 불용어 제거 등의 작업을 수행할 수 있습니다. 예를 들어, 다음 코드는 문서들을 문장 단위로 토큰화합니다. ```python from nltk.corpus import reuters

doc_ids = reuters.fileids() # 모든 문서의 아이디 가져오기 docs = [reuters.raw(doc_id) for doc_id in doc_ids] # 각 문서를 가져와서 리스트에 저장

tokenizer = nltk.sent_tokenize tokenized_docs = [tokenizer(doc) for doc in docs] # 문서들을 문장 단위로 토큰화

4. 주제 모델링: 토큰화된 문서를 바탕으로 주제 모델링을 수행합니다. 대표적인 주제 모델링 알고리즘인 잠재 디리클레 할당(Latent Dirichlet Allocation, LDA)를 사용합니다. 다음 코드는 LDA 모델을 학습하고 주제를 추출하는 예입니다.
```python
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 전처리 함수 정의
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    tokens = word_tokenize(text.lower())
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]

    return ' '.join(filtered_tokens)

# 텍스트 전처리 및 피처 벡터화
preprocessed_docs = [preprocess_text(doc) for doc in tokenized_docs]
vectorizer = CountVectorizer(max_features=1000)  # 상위 1000개의 단어만 사용
X = vectorizer.fit_transform(preprocessed_docs)

# LDA 모델 학습
n_topics = 10  # 주제 개수 지정
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(X)

# 주제 추출
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print(f"주제 {topic_idx+1}")
    top_words = [feature_names[i] for i in topic.argsort()[:-10 - 1: -1]]
    print(top_words)

위의 코드는 NLTK를 사용하여 문서 주제 모델링을 수행하는 기본적인 방법을 보여줍니다. 자세한 내용은 NLTK의 문서와 LDA 알고리즘에 대해 추가적인 공부를 하시면 도움이 될 것입니다.