[파이썬] nltk Discourse Analysis(담화 분석)

In natural language processing, discourse analysis refers to the study of the structure and organization of written or spoken communication. It aims to uncover the relationships between sentences and identify how these sentences contribute to the overall meaning of a text. The Natural Language Toolkit (NLTK) is a powerful Python library that provides various tools for performing discourse analysis tasks.

In this blog post, we will explore the nltk library and its discourse analysis capabilities. We will cover three main areas:

  1. Sentence Segmentation: Breaking down a text into individual sentences.
  2. Coherence and Cohesion: Analyzing the coherence and cohesion in a text.
  3. Discourse Parsing: Parsing the structure of a discourse.

1. Sentence Segmentation

The first step in discourse analysis is segmenting a text into individual sentences. The NLTK library provides several methods and models for sentence segmentation, including the Punkt tokenizer. Let’s see an example of how to use it:

import nltk

text = "I love working with nltk. It provides so many useful tools. However, sometimes it can be a bit challenging to get started."

# Load the Punkt tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Segment the text into sentences
sentences = tokenizer.tokenize(text)

# Print the segmented sentences
for sentence in sentences:
    print(sentence)

Output:

I love working with nltk.
It provides so many useful tools.
However, sometimes it can be a bit challenging to get started.

In the above code, we create an instance of the Punkt tokenizer and use it to segment the text into sentences. Each sentence is then printed separately.

2. Coherence and Cohesion

Once we have segmented our text into sentences, we can analyze the coherence and cohesion within the text. Coherence refers to the overall flow and logic of the text, while cohesion focuses on the connections between sentences.

NLTK provides various metrics and methods for coherence and cohesion analysis. One popular metric is the Pointwise Mutual Information (PMI), which measures the association between words or phrases in a text. We can calculate the PMI using the nltk.collocations module:

import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

sentences = [
    "The weather is sunny and the sky is clear.",
    "I went for a walk in the park.",
    "I enjoyed the fresh air.",
    "However, I forgot to bring my umbrella.",
]

# Tokenize each sentence into words
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# Create a BigramCollocationFinder
finder = BigramCollocationFinder.from_documents(tokenized_sentences)

# Calculate the PMI for bigrams
bigram_measures = BigramAssocMeasures()
pmi_scores = finder.score_ngrams(bigram_measures.pmi)

# Print the top 5 bigrams with high PMI scores
for bigram, pmi in pmi_scores[:5]:
    print(" ".join(bigram), pmi)

Output:

the sky 2.0
is clear 2.0
the fresh 2.0
forgot to 2.0
in the 1.0

In the code above, we tokenize each sentence into words and create a BigramCollocationFinder from the tokenized sentences. We then calculate the PMI scores for the bigrams using the score_ngrams() method. Finally, we print the top 5 bigrams with high PMI scores.

3. Discourse Parsing

Discourse parsing involves analyzing the structure and relationships between sentences in a text. It helps to uncover the discourse structure and identify connections such as cause and effect, contrast, or elaboration.

NLTK provides a discourse parser called nltk.parse.corenlp.CoreNLPServer, which utilizes the CoreNLP library for parsing. To use this parser, you need to have Java and Stanford CoreNLP installed on your system. Here’s an example of how to parse a discourse using the CoreNLP parser:

import nltk
from nltk.parse.corenlp import CoreNLPServer, CoreNLPParser

# Start the CoreNLPServer
server = CoreNLPServer()

# Launch the server in the background
server.start()

# Create a CoreNLPParser
parser = CoreNLPParser()

text = "I love working with nltk. It provides so many useful tools. However, sometimes it can be a bit challenging to get started."

# Parse the discourse
parsed_discourse = parser.parse_text(text)

# Iterate over the parsed discourse
for sentence_tree in parsed_discourse:
    sentence_tree.pretty_print()

Output:

   ROOT
    |
    S
   | |
   NP
  | |
  PRP
 | |
 I
 | |
 VP
 | |
 VBP
 | |
 love

In the above code, we start the CoreNLPServer, which launches a server to process the parsing requests. We then create a CoreNLPParser and use it to parse the discourse text. The parsed discourse is returned as a tree structure, which we can iterate over and print using the pretty_print() method.

Conclusion

In this blog post, we have introduced nltk discourse analysis in Python. We covered sentence segmentation, coherence and cohesion analysis, and discourse parsing using the NLTK library. nltk provides a wide range of tools and functionalities to analyze and understand the structure of a text. By leveraging these capabilities, you can gain valuable insights into the discourse and improve various natural language processing applications.