[파이썬] nltk Collocations(연어) 및 Bigrams

Introduction

Collocations and bigrams play a significant role in natural language processing (NLP) and text analysis. In this blog post, we will explore the concept of collocations and bigrams using the Natural Language Toolkit (nltk) library in Python.

What are Collocations?

In linguistics, collocations are words that tend to appear together frequently. They provide valuable insights into the semantic relationships between words in a given text. For example, words like “salt” and “pepper” often appear together as collocates in phrases like “salt and pepper,” “salt to taste,” etc.

Why are Collocations Important?

Understanding collocations is crucial as they can aid in improving language modeling, information retrieval, semantic analysis, and text classification tasks. By identifying and analyzing collocations, we can gain a deeper understanding of the underlying structure and meaning within a text.

Using nltk for Collocations

Python’s nltk library provides a range of tools to work with collocations. The most common approach is to use the Collocations class in the nltk.collocations module. Here’s a simple example:

import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

# Load sample text data
text = "This is a sample sentence for demonstrating collocations."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Create a BigramCollocationFinder
finder = BigramCollocationFinder.from_words(tokens)

# Filter out common nltk stop words
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in nltk.corpus.stopwords.words('english'))

# Get the top 10 most frequent bigrams
bigrams = finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)

# Display the result
for bigram in bigrams:
    print(bigram)

In this example, we import the necessary modules from the nltk library and tokenize the input text. We then create a BigramCollocationFinder and apply a filter to remove common stopwords. Lastly, we retrieve the top 10 most frequent bigrams using the nbest() method and display the results.

Conclusion

Collocations and bigrams are powerful tools for gaining insights into the relationships between words in a text. In this blog post, we explored the concept of collocations and demonstrated their usage in Python using nltk’s Collocations class. By leveraging nltk’s functionality, we can easily discover and analyze collocations to enhance our natural language processing and text analysis tasks.