[파이썬] nltk 동사 및 명사 구 추출

Natural Language Toolkit (NLTK) is a powerful library for natural language processing (NLP) in Python. It provides various tools and functionalities to process and analyze text data, including the extraction of verbs and noun phrases.

In this blog post, we will explore how to use NLTK to extract verbs and noun phrases from a given text using Python.

Installation

Before we begin, make sure you have NLTK installed on your machine. You can install NLTK using pip:

pip install nltk

Text Preprocessing

First, we need to preprocess our text by tokenizing it into words. NLTK provides the word_tokenize function for this purpose. Here’s an example of how to use it:

import nltk

text = "I love playing soccer with my friends."
tokens = nltk.word_tokenize(text)

Verb Extraction

To extract verbs from the text, we can use NLTK’s part-of-speech (POS) tagging functionality. POS tagging assigns a category (such as noun, verb, adjective, etc.) to each word in a sentence.

pos_tags = nltk.pos_tag(tokens)
verbs = [word for word, pos in pos_tags if pos.startswith('V')]

In the above code, we use a list comprehension to filter out words that have a POS tag starting with ‘V’, which represents verbs.

Noun Phrase Extraction

NLTK also provides chunking functionality to extract noun phrases from text. Chunking is the process of grouping words into “chunks” based on their POS tags. Here’s an example of how to do noun phrase extraction:

grammar = r"NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
chunks = chunk_parser.parse(pos_tags)

noun_phrases = []
for subtree in chunks.subtrees():
    if subtree.label() == 'NP':
        words = [word for word, pos in subtree.leaves()]
        noun_phrase = ' '.join(words)
        noun_phrases.append(noun_phrase)

In the above code, we define a simple grammar to extract noun phrases: NP (Noun Phrase) consists of an optional determiner (‘DT’), followed by zero or more adjectives (‘JJ’), and ending with a noun (‘NN’). We use the RegexpParser class to parse the POS tags and extract the noun phrases.

Conclusion

In this blog post, we have demonstrated how to use NLTK in Python to extract verbs and noun phrases from a given text. NLTK provides handy functionalities like tokenization, POS tagging, and chunking, which can be combined to perform various NLP tasks.

By extracting verbs and noun phrases, you can gain valuable insights from text data and perform further analysis or processing. NLTK’s extensive documentation and community support make it a great choice for NLP tasks.

Remember to install NLTK and preprocess your text before extracting verbs and noun phrases. Happy coding with NLTK!