[python] NLTK를 사용해 문서의 중요도를 평가하는 방법은 무엇인가요?

30 Nov 2023

python

NLTK를 사용하여 문서의 중요도를 평가하는 과정은 다음과 같습니다:

1. 텍스트 전처리

먼저, 문서에서 불필요한 요소를 제거하고 텍스트를 클린하게 만들어야 합니다. 이 단계에는 문장 토큰화, 단어 토큰화, 스톱 워드 제거 등이 포함될 수 있습니다.

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

def preprocess_text(text):
    # 문장 토큰화
    sentences = sent_tokenize(text)
    
    preprocessed_text = []
    for sentence in sentences:
        # 단어 토큰화
        words = word_tokenize(sentence)
        
        # 소문자로 변환
        words = [word.lower() for word in words]
        
        # 스톱 워드 제거
        words = [word for word in words if word not in set(stopwords.words('english'))]
        
        preprocessed_text.extend(words)
    
    return preprocessed_text

2. 키워드 추출 및 중요도 계산

NLTK는 다양한 방법을 제공하여 텍스트에서 키워드를 추출하고 중요도를 계산할 수 있습니다. 예를 들어, TF-IDF(Term Frequency-Inverse Document Frequency) 방법을 사용하여 단어의 출현 빈도와 문서 전체에서의 빈도를 고려하여 중요도를 결정할 수 있습니다.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.tokenize import sent_tokenize

def calculate_word_importance(text):
    # 전처리된 텍스트 가져오기
    preprocessed_text = preprocess_text(text)
    
    # 단어 빈도 계산
    fdist = FreqDist(preprocessed_text)
    
    # 문서에서 가장 빈도가 높은 단어 추출
    most_common_words = fdist.most_common(10)
    
    # 결과 출력
    for word, frequency in most_common_words:
        print(f"{word}: {frequency}")

위의 calculate_word_importance 함수는 전처리된 텍스트를 가져오고, 각 단어의 빈도를 계산하여 가장 빈도가 높은 단어를 추출합니다. 이것은 중요한 단어로 간주될 수 있습니다.

이러한 방법을 사용하여 NLTK를 통해 문서의 중요도를 평가할 수 있습니다. 그러나 중요도 평가는 응용 프로그램 및 특정 도메인에 따라 달라질 수 있으므로 독립적인 평가가 필요할 수 있습니다.