[파이썬] nltk 청크 나누기 (Chunking)

Chunking, also known as Shallow Parsing, is a natural language processing (NLP) technique used to group words in a sentence into meaningful chunks based on their syntactic structure. It helps in extracting specific pieces of information from text, such as identifying noun phrases, verb phrases, etc.

The Natural Language Toolkit (NLTK) is a powerful Python library that provides various tools and algorithms for NLP tasks. NLTK offers a simple and efficient way to perform chunking using the inbuilt chunking module.

Installing NLTK

To get started with chunking in NLTK, you need to install the NLTK library. Open your terminal or command prompt and run the following command:

pip install nltk

Importing NLTK and Preprocessing Text

Once NLTK is installed, you need to import it in your Python script:

import nltk

Next, you will need to download the necessary NLTK packages for tokenization and chunking. Run the following commands:

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

After downloading the required packages, you need to preprocess your text by tokenizing it into words and tagging each word with its part-of-speech (POS) tag:

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "I love natural language processing and chunking."
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)

Chunking with NLTK

Now that we have preprocessed our text, we can perform chunking using the NLTK chunking module. NLTK provides a predefined grammar pattern for chunking, which can be modified as per your requirements.

from nltk.chunk import RegexpParser

# Define grammar pattern for chunking
grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN>}  # Noun phrase: determiner/possessive pronoun + adjective(s) + noun(s)
    VP: {<VB.*><NP|PP|CLAUSE>+$}  # Verb phrase: verb + noun phrase/prepositional phrase/clause
    PP: {<IN><NP>}  # Prepositional phrase: preposition + noun phrase
    CLAUSE: {<NP><VP>}  # Clause: noun phrase + verb phrase
"""

# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(grammar)

# Perform chunking on tagged tokens
chunks = chunk_parser.parse(tagged_tokens)

# Print the resulting chunks
print(chunks)

Exploring Chunking Results

The output of chunking is a tree-like structure known as a parse tree or a chunk tree. You can visualize the parse tree or extract specific information from it.

To visualize the parse tree, you can use the .draw() method:

chunks.draw()

To extract specific information from the parse tree, you can iterate over the tree and access its nodes:

for subtree in chunks.subtrees():
    if subtree.label() == 'NP':
        print('Noun Phrase:', ' '.join([word for word, pos in subtree.leaves()]))
    elif subtree.label() == 'VP':
        print('Verb Phrase:', ' '.join([word for word, pos in subtree.leaves()]))
    elif subtree.label() == 'PP':
        print('Prepositional Phrase:', ' '.join([word for word, pos in subtree.leaves()]))
    elif subtree.label() == 'CLAUSE':
        print('Clause:', ' '.join([word for word, pos in subtree.leaves()]))

Conclusion

Chunking is a valuable technique in natural language processing for extracting meaningful information from text. NLTK provides a convenient way to perform chunking using its inbuilt chunking module. By defining appropriate grammar patterns, you can extract specific chunks like noun phrases, verb phrases, prepositional phrases, and more. It’s always worth exploring and experimenting with different patterns to achieve the desired chunking results.

Happy chunking!