[파이썬] nltk 자동 요약

Natural Language Processing (NLP) is a field of study that focuses on the interactions between computers and humans using natural language. One of the tasks in NLP is text summarization, which involves condensing a larger text into a shorter, more concise form while retaining the key information. In this blog post, we will explore how to perform automatic text summarization using the Natural Language Toolkit (NLTK) library in Python.

Installation

Before we begin, make sure you have NLTK installed. You can install it using pip:

pip install nltk

Text Preprocessing

The first step in automatic text summarization is to preprocess the text. This involves removing any unnecessary characters, stopwords, and performing tokenization. Let’s see an example:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

def preprocess_text(text):
    # Tokenize the text into sentences and words
    sentences = sent_tokenize(text)
    words = [word.lower() for sentence in sentences for word in word_tokenize(sentence)]
    
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    filtered_words = [word for word in words if word not in stop_words and word.isalpha()]
    
    return filtered_words

# Example text
text = "This is an example sentence. It will be preprocessed for text summarization."

# Preprocess the text
preprocessed_text = preprocess_text(text)

In the code above, we first import necessary NLTK modules for sentence and word tokenization, as well as stopwords. We define a function preprocess_text that takes in a text, tokenizes it into sentences and words, and removes stopwords and non-alphabetic characters. The resulting preprocessed text is stored in the preprocessed_text variable.

Text Summarization

Once the text is preprocessed, we can move on to the text summarization step. There are different approaches to text summarization, such as extraction-based and abstraction-based. In this example, we will focus on an extraction-based approach using the NLTK library.

from nltk.probability import FreqDist
from heapq import nlargest

def summarize_text(text, n):
    # Preprocess the text
    preprocessed_text = preprocess_text(text)
    
    # Calculate the frequency distribution of words
    fdist = FreqDist(preprocessed_text)
    
    # Get the top n most frequent words
    top_n_words = nlargest(n, fdist, key=fdist.get)
    
    return " ".join(top_n_words)

# Example text
text = "This is an example sentence. It will be summarized using the extraction-based approach."

# Summarize the text
summary = summarize_text(text, 5)

In the code above, we define a function summarize_text that takes in a preprocessed text and the number of words to include in the summary. We calculate the frequency distribution of words in the preprocessed text using the FreqDist class from NLTK. Then, we get the top n most frequent words using the nlargest function from the heapq module. Finally, we join the top words together to form the summary.

Conclusion

In this blog post, we explored how to perform automatic text summarization using the NLTK library in Python. We learned about text preprocessing, including tokenization and removing stopwords. We also implemented an extraction-based approach to generate text summaries based on word frequency. Text summarization is a complex task with many different approaches, and NLTK provides a convenient and powerful toolset to help you get started.