[python] NLTK를 사용해 문서의 중심성을 계산하는 방법은 무엇인가요?

30 Nov 2023

python

NLTK(Natural Language Toolkit)는 파이썬에서 자연어 처리 작업을 수행하기 위해 사용되는 라이브러리입니다. NLTK를 사용하여 문서의 중심성을 계산하는 방법을 알아보겠습니다.

먼저, 문서의 중심성을 계산하기 위해 NetworkX 라이브러리를 설치해야 합니다. NetworkX는 그래프 이론 및 네트워크 분석을 위한 파이썬 패키지입니다. 설치하기 위해 아래 명령을 실행해주세요.

pip install networkx

그리고 NLTK 라이브러리도 필요합니다. NLTK를 설치하려면 다음 명령을 실행하세요.

pip install nltk

이제 NLTK를 사용하여 문서의 중심성을 계산하는 예제 코드를 작성해보겠습니다.

import nltk
import networkx as nx

# 문서 텍스트
document = "네트워크 분석은 다양한 사회, 기술 및 자연 시스템을 모델링하고 분석하는 도구입니다. 중심성은 네트워크의 노드가 얼마나 중요한지를 측정하는 지표입니다."

# 문서를 문장으로 분할
sentences = nltk.sent_tokenize(document)

# 문장에서 단어 토큰화
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# 단어에서 불용어 제거
stop_words = nltk.corpus.stopwords.words('english')
filtered_words = [[word for word in sentence if word.lower() not in stop_words] for sentence in tokenized_sentences]

# 문장간 유사도 계산
similarity_matrix = nx.adjacency_matrix(nx.Graph(filtered_words))

# 중심성 계산
centralities = nx.degree_centrality(similarity_matrix)

# 결과 출력
for sentence, centrality in zip(sentences, centralities):
    print(sentence)
    print("중심성:", centrality)
    print()

이 코드는 먼저 문서를 문장으로 분할하고, 각 문장을 단어로 토큰화합니다. 그런 다음 불용어를 제거하고 문장 간의 유사도를 계산하기 위해 그래프를 만듭니다. 마지막으로 중심성을 계산하고 결과를 출력합니다.

이 예제 코드를 실행해보면 문서의 각 문장의 중심성을 확인할 수 있습니다.