[파이썬] 텍스트 요약과 추출적 요약

In natural language processing, text summarization is the task of automatically generating a concise and coherent summary of a given text. There are two main approaches to text summarization: abstractive and extractive summarization.

Abstractive Summarization

Abstractive summarization involves creating a summary that captures the main ideas of the original text in a new and cohesive way. This method relies on natural language generation techniques to produce summaries that may contain words or phrases not present in the original text. It requires a deeper understanding of the text and often uses advanced machine learning models like seq2seq and transformers.

Here is an example of how to perform abstractive summarization in Python using the Hugging Face library, which provides state-of-the-art models and pre-trained weights:

from transformers import pipeline

summarizer = pipeline("summarization")

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."

summary = summarizer(text, max_length=50, min_length=10, do_sample=False)[0]["summary_text"]

print("Abstractive Summary:")
print(summary)

This code snippet utilizes the pipeline function from Hugging Face’s transformers library to create a summarization pipeline. It takes the input text, specifies the maximum and minimum lengths of the summary, and generates the abstractive summary using the pre-trained model.

Extractive Summarization

Extractive summarization aims to extract the most important sentences or phrases from the original text to create a summary. It avoids the generation of new content and instead selects the most representative parts of the text. Common approaches include text ranking algorithms and graph-based methods.

Let’s see an example of how to perform extractive summarization in Python using the Gensim library, which provides tools for topic modeling and document similarity analysis:

from gensim.summarization import summarize

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."

summary = summarize(text, ratio=0.3)

print("Extractive Summary:")
print(summary)

In this code snippet, we use the summarize function from Gensim’s summarization module. It takes the input text and the summarization ratio (a value between 0 and 1 representing the fraction of the original text to retain) as inputs. The function then returns the extractive summary.

Conclusion

Both abstractive and extractive summarization techniques have their own advantages and disadvantages. Abstractive summarization allows for a more flexible and creative generation of summaries, while extractive summarization maintains the factual accuracy of the original text.

Depending on the specific requirements and constraints of your project, you can choose the most suitable approach for text summarization in Python.