[파이썬] nltk 핵심 문장 추출 (Key Sentence Extraction)

In Natural Language Processing (NLP), extracting key or important sentences from a text document can be extremely useful. It helps in summarizing the content or identifying the main ideas or key points of the text. One popular technique for key sentence extraction is using the nltk library in python.

Installation

Before we start, make sure you have nltk installed on your system. You can install it using pip:

pip install nltk

Key Sentence Extraction with nltk

Let’s go through the steps to perform key sentence extraction using nltk in python:

Step 1: Importing the required libraries

import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

First, we import the nltk library and specifically import the sent_tokenize function from the nltk.tokenize module. We also need to download the required tokenizer data using nltk.download('punkt').

Step 2: Loading and Tokenizing the Text

To extract key sentences, we first need to load the text document and tokenize it into sentences. We can do this using the sent_tokenize function.

with open('input.txt', 'r') as file:
    text = file.read()
    
sentences = sent_tokenize(text)

In this example, we assume the text is stored in a file named input.txt. Adjust the code according to your requirements.

Step 3: Preprocessing

Before extracting key sentences, it’s usually helpful to preprocess the text by removing any unnecessary characters or formatting. You can perform various preprocessing steps like removing stop words, punctuation, or lowercasing the text, depending on your specific use case.

Step 4: Calculating Sentence Scores

The next step is to calculate scores for each sentence in the text. There are various algorithms and approaches available for this purpose, including term frequency-inverse document frequency (TF-IDF), TextRank, or a combination of different features.

# Your sentence scoring algorithm goes here
scores = calculate_sentence_scores(sentences)

This step involves implementing or using an existing algorithm to calculate scores for each sentence based on their relevance or importance.

Step 5: Selecting Key Sentences

Finally, based on the scores calculated in the previous step, you can select the top N sentences as key sentences.

num_sentences = 5
key_sentences = heapq.nlargest(num_sentences, sentences, key=scores.get)

Here, we select the top 5 sentences using the nlargest function from the heapq module. Adjust the num_sentences variable according to your requirements.

Step 6: Displaying the Key Sentences

for sentence in key_sentences:
    print(sentence)

Finally, iterate over the key sentences and display them as desired. You can store them in a separate file, display them on a webpage, or use them for any other purpose.

Conclusion

In this blog post, we discussed the nltk library and its usage for key sentence extraction in python. nltk provides various tools and functions that make it easy to tokenize and preprocess text, calculate sentence scores, and select key sentences. By leveraging these capabilities, you can summarize text documents or identify the main ideas and key points quickly and efficiently.