[python] NLTK를 사용해 텍스트 요약을 수행하는 방법은 무엇인가요?

30 Nov 2023

python

먼저, NLTK를 설치해야 합니다. 다음 명령어를 사용하여 NLTK를 설치합니다.

pip install nltk

NLTK에서는 다양한 요약 알고리즘을 제공합니다. 간단히 텍스트 문서를 요약하는 예제를 살펴보겠습니다.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

def summarize_text(text):
    # 텍스트 토큰화
    words = word_tokenize(text)
    sentences = sent_tokenize(text)

    # 불용어 제거
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word.casefold() not in stop_words]

    # 단어 빈도수 계산
    word_freq = nltk.FreqDist(words)
    
    # 문장 점수 계산
    sentence_scores = {}
    for sentence in sentences:
        for word in nltk.word_tokenize(sentence.lower()):
            if word in word_freq.keys():
                if sentence not in sentence_scores.keys():
                    sentence_scores[sentence] = word_freq[word]
                else:
                    sentence_scores[sentence] += word_freq[word]

    # 상위 3개 문장 반환
    summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)
    summary = " ".join(summary_sentences)

    return summary

# 테스트를 위한 예제 텍스트
text = "Natural Language Processing (NLP) is a subfield of artificial intelligence concerned with the interaction between computers and human language. NLP techniques are used to analyze and understand text data. Text summarization is an important NLP task that involves condensing a large amount of text into a shorter version without losing the main ideas. NLTK provides various algorithms for text summarization. In this example, we will use NLTK to implement a simple text summarization algorithm."

# 텍스트 요약 실행
summary = summarize_text(text)
print(summary)

이 코드에서는 NLTK의 word_tokenize 및 sent_tokenize 함수를 사용하여 텍스트를 토큰화합니다. 불용어(stop words)를 제거하고, 단어의 빈도수를 계산하여 문장들에 점수를 매깁니다. 마지막으로, 상위 3개의 문장을 선택하여 요약을 생성합니다.

위 예제는 NLTK를 사용한 간단한 텍스트 요약 방법을 보여주고 있습니다. 다른 요약 알고리즘을 사용하려면 NLTK의 다른 함수와 클래스를 참조하십시오.

참고 자료: