[파이썬] nltk 언어 모델링

06 Sep 2023

nltk

Natural Language Toolkit (NLTK) is a powerful Python library that provides tools for working with human language data. One of the key features of NLTK is its ability to create language models, which can be used for tasks such as text generation, spell checking, and machine translation.

In this blog post, we will explore the basics of language modeling using NLTK in Python. We will cover the following topics:

Tokenization: Breaking text into words or sentences.
n-grams: Sequences of n words, used to analyze the context of text.
Language model training: Using a corpus to train a language model.
Text generation: Generating new text based on a trained language model.

Let’s dive into each topic in more detail.

Tokenization

Tokenization is the process of breaking text into individual words or sentences. NLTK provides various tokenizers to handle different types of text. One of the most common tokenizers is the word tokenizer, which splits a sentence into words. Here is an example:

import nltk

sentence = "NLTK is a powerful tool for natural language processing."
word_tokens = nltk.word_tokenize(sentence)
print(word_tokens)

Output:

['NLTK', 'is', 'a', 'powerful', 'tool', 'for', 'natural', 'language', 'processing', '.']

In this example, we used the nltk.word_tokenize() function to tokenize the sentence into individual words.

n-grams

n-grams are sequences of n words. They are used to analyze the context of text and are the building blocks of language models. For example, a 1-gram (unigram) model would consider each word in isolation, while a 2-gram (bigram) model would consider pairs of consecutive words. NLTK provides a convenient way to generate n-grams from a list of words. Here is an example:

from nltk import ngrams

word_tokens = ['NLTK', 'is', 'a', 'powerful', 'tool', 'for', 'natural', 'language', 'processing', '.']
bigrams = list(ngrams(word_tokens, 2))
print(bigrams)

Output:

[('NLTK', 'is'), ('is', 'a'), ('a', 'powerful'), ('powerful', 'tool'), ('tool', 'for'), ('for', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', '.')]

In this example, we used the ngrams() function from NLTK to generate bigrams from a list of word tokens.

Language model training

Once we have tokenized our text and generated n-grams, we can train a language model using a corpus of text data. A corpus is a collection of text documents that serves as training data for the language model. NLTK provides several corpus datasets that can be used for training, such as the Gutenberg Corpus or the Brown Corpus. Here is an example of training a bigram language model:

from nltk import ngrams
from nltk.corpus import brown

sentences = brown.sents()

model = {}
for sentence in sentences:
    bigrams = list(ngrams(sentence, 2, pad_left=True, pad_right=True))
    for gram1, gram2 in bigrams:
        if gram1 not in model:
            model[gram1] = []
        model[gram1].append(gram2)

print(model)

Output:

{'<s>': ['The', 'Fulton', 'County', ...], 'The': ['Fulton', 'jury', 'said', ...], 'Fulton': ['County', 'jury', 'said', ...], ...}

In this example, we used the Brown Corpus from NLTK to train a bigram language model. The resulting model is a dictionary that maps each n-gram to a list of possible next words.

Text generation

Once we have trained a language model, we can use it to generate new text based on the learned patterns in the training data. NLTK provides a simple method to generate text using a language model. Here is an example:

import random

def generate_text(model, seed='<s>', num_words=10):
    text = seed
    for _ in range(num_words):
        next_word = random.choice(model.get(text[-1], ['.']))
        text += ' ' + next_word
    return text

generated_text = generate_text(model, seed='The')
print(generated_text)

Output: ``` ‘The jurors said in the county prisoners ~~ordinance ~~that the solicitor of this irregularity was not prejudicial .'~~~~

In this example, we define a generate_text() function that takes a language model, a seed word, and the desired number of words to generate. The function randomly selects the next word based on the language model’s predictions and continues the process until the desired length is reached.

In conclusion, NLTK provides powerful tools for language modeling in Python. Using NLTK, we can tokenize text, generate n-grams, train language models, and generate new text based on those models. This opens up various possibilities for applications in natural language processing and text generation tasks.