[파이썬] nltk n-gram 모델

06 Sep 2023

nltk

The n-gram model is a language model that predicts the next word in a sequence based on the previous n-1 words. It is widely used in natural language processing (NLP) tasks such as speech recognition, machine translation, and text generation. The nltk (Natural Language Toolkit) library provides a powerful n-gram modeling implementation in Python.

To use the nltk n-gram model, follow these steps:

Step 1: Install nltk

Ensure that nltk is installed on your system. If not, you can install it using the following command:

pip install nltk

Step 2: Import the necessary libraries

Import the required libraries in your Python script:

import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

Step 3: Preprocess the text data

Tokenize the text into a list of words. You can use the word_tokenize function from nltk to perform tokenization:

text = "I love to code in Python"
tokens = word_tokenize(text)

Step 4: Generate n-grams

Next, generate n-grams from the tokenized words using the ngrams function from nltk. Specify the value of n based on the desired context size:

n = 2  # Generate bigrams
ngram_list = list(ngrams(tokens, n))

Step 5: Count the frequency of n-grams

Create a frequency distribution of the n-grams using FreqDist from nltk:

frequency_dist = nltk.FreqDist(ngram_list)

Step 6: Predict the next word

To predict the next word given a sequence of n-1 words, use the max() function on the frequency distribution:

previous_words = ('code', 'in')
predicted_word = frequency_dist[previous_words].max()

You can extend the above code to predict multiple next words or to build a complete language model that generates text.

The nltk n-gram model is a powerful tool for language modeling. It provides a simple yet effective way to predict the next word based on contextual information. Whether you are working on recommendation systems, chatbots, or text generation tasks, the n-gram model can play a crucial role in improving the quality and relevance of the output.