[파이썬] nltk 개체명 인식 (Named Entity Recognition)

NLTK (Natural Language Toolkit) is a popular Python library for processing natural language data. One of the useful features of NLTK is named entity recognition, which allows us to extract and classify named entities from text.

Named entity recognition (NER) is the process of identifying and classifying named entities in text into predefined categories such as person names, organization names, locations, medical terms, etc. It is an important task in natural language processing (NLP) and can be used in various applications such as information retrieval, text summarization, question answering, and machine translation.

Installing NLTK

To get started with NLTK, you need to install it. Open your terminal (or command prompt) and run the following command:

pip install nltk

After installation, you can import it into your Python script using the following code:

import nltk

Using NLTK for Named Entity Recognition

Once NLTK is installed, you can use it to perform named entity recognition on your text data. NLTK provides various pre-trained models and tools for this purpose.

Tokenization

Before performing named entity recognition, we need to tokenize the text into individual words or tokens. Tokenization is the process of splitting text into smaller units, such as words or sentences. NLTK provides a built-in tokenizer that we can use:

from nltk.tokenize import word_tokenize

text = "John is studying at Stanford University."
tokens = word_tokenize(text)

Named Entity Recognition

After tokenization, we can use NLTK’s named entity recognition module to identify named entities in the text. NLTK provides a pre-trained model called nltk.ne_chunk() that uses part-of-speech tagging to identify named entities. Here is an example:

from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize

text = "John is studying at Stanford University."
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)
named_entities = ne_chunk(tagged_tokens)

print(named_entities)

The ne_chunk() function takes a list of tagged tokens as input and returns a parse tree with named entities labeled. You can then iterate over the parse tree to extract the named entities and their corresponding labels.

Conclusion

Named entity recognition is a powerful technique in natural language processing that allows us to extract and classify named entities from text. NLTK provides pre-trained models and tools for performing named entity recognition in Python. By using NLTK’s named entity recognition module, we can easily identify important entities in text data and use them for various NLP tasks.