[파이썬] nltk 단어 네트워크 및 그래프 분석

Natural Language Toolkit (NLTK) is a popular library for natural language processing (NLP) tasks in Python. It provides various tools and techniques for analyzing textual data. One interesting aspect of NLTK is its ability to analyze word networks and perform graph analysis. In this blog post, we will explore how to use NLTK for word network and graph analysis.

Tokenization and Corpus Creation

Before we can start with network and graph analysis, we need to tokenize the text into individual words or tokens. NLTK provides several tokenization methods that can be used based on the requirements. Once we have the tokens, we can create a corpus, which is a collection of texts or documents.

import nltk
from nltk.corpus import inaugural

# Load the inaugural corpus
corpus = inaugural.words()

# Tokenize the corpus
tokens = nltk.word_tokenize(' '.join(corpus))

# Create a nltk.Text object for further analysis
text = nltk.Text(tokens)

Building Word Networks

Now that we have the tokenized corpus, we can build word networks using NLTK. A word network, also known as a co-occurrence network, captures the relationships between words based on how frequently they appear together in the text. NLTK provides a function to build a word network, which returns a graph object.

from nltk import bigrams
import networkx as nx

# Get the bigrams from the text
bigrms = list(bigrams(text))

# Create a weighted graph from the bigrams
graph = nx.Graph()

for bigram in bigrms:
    w1, w2 = bigram
    if w1 != w2:
        if graph.has_edge(w1, w2):
            graph[w1][w2]['weight'] += 1
        else:
            graph.add_edge(w1, w2, weight=1)

Visualizing Word Networks

Once the word network is built, we can visualize it using various graph visualization techniques. NLTK provides a simple graph plotting function that can be used to visualize the word network.

import matplotlib.pyplot as plt

# Plot the graph
pos = nx.spring_layout(graph)
nx.draw_networkx(graph, pos, node_size=1000, font_size=10, with_labels=True)

# Show the plot
plt.show()

Analyzing Word Networks

With the word network and visualization in place, we can now perform various graph analysis tasks using NLTK. For example, we can calculate the degree centrality of each word, which measures how central a word is within the network.

# Calculate degree centrality
degree_centrality = nx.degree_centrality(graph)

# Print the top 10 words with the highest degree centrality
sorted_degree_centrality = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:10]

for word, centrality in sorted_degree_centrality:
    print(f"{word}: {centrality}")

Conclusion

NLTK provides powerful tools for word network and graph analysis in Python. By leveraging these capabilities, we can gain insights into the relationships between words in textual data. Whether it’s exploring co-occurrence patterns, visualizing word networks, or performing graph analysis, NLTK offers a wide range of functionality for NLP tasks.

In this blog post, we learned how to tokenize a corpus, build a word network, visualize it, and perform graph analysis using NLTK. This is just the tip of the iceberg, and NLTK has much more to offer for NLP enthusiasts. So go ahead, dive into NLTK, and discover the hidden connections within your texts!