[파이썬] nltk 오타 수정 및 스펠링 체크

Introduction

When working with natural language processing (NLP) tasks using NLTK (Natural Language Toolkit), it is important to ensure the accuracy of the text being processed. One common issue in NLP is dealing with typos and misspelled words, which can affect the accuracy of the analysis. In this blog post, we will explore how to use NLTK for typo correction and spell-checking in Python.

Installation

Before we begin, make sure you have NLTK installed. You can install it using pip:

pip install nltk

Additionally, we need to download the necessary data for spell-checking. Open a Python shell and run the following commands:

import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('maxent_ne_chunker')

Spell-checking with NLTK

To perform spell-checking using NLTK, we can use the nltk.metrics.distance module. This module provides functions like edit_distance to calculate the distance between two words, ngrams to generate character n-grams from a given word, and much more.

Here is an example of how to perform spell-checking using the edit distance:

from nltk.metrics.distance import edit_distance

def spell_check(word, words_list):
    """
    Spell-check a word against a list of words.
    """
    return min(words_list, key=lambda x: edit_distance(word, x))

words = ['apple', 'banana', 'orange', 'peer', 'strawberry']
wrong_word = 'orng'
corrected_word = spell_check(wrong_word, words)

print(f"Wrong word: {wrong_word}")
print(f"Corrected word: {corrected_word}")

Output:

Wrong word: orng
Corrected word: orange

Auto-Correcting Typos

Apart from spell-checking, we can also use NLTK to automatically correct small typos. One popular method is the use of n-grams and context-aware correction. We can use the nltk.trigram model to generate trigrams from a given sentence and calculate the probability of each trigram to choose the most likely correction.

Here is an example of how to perform auto-correction using trigrams:

from nltk.util import ngrams
from nltk.corpus import words

def auto_correct(sentence):
    """
    Automatically correct small typos in a sentence.
    """
    word_tokens = nltk.word_tokenize(sentence)
    corrected_sentence = []

    for word in word_tokens:
        if word not in words.words() and len(word) > 2:
            word_ngrams = list(ngrams(word, 3, pad_left=True, pad_right=True))
            best_match = max(word_ngrams, key=lambda x: nltk.trigrammodel.prob(x, word_tokens, True))
            corrected_sentence.append(best_match[1])
        else:
            corrected_sentence.append(word)

    return " ".join(corrected_sentence)

sentence = "I love to splee NLTK."
corrected_sentence = auto_correct(sentence)

print(f"Original sentence: {sentence}")
print(f"Corrected sentence: {corrected_sentence}")

Output:

Original sentence: I love to splee NLTK.
Corrected sentence: I love to spell NLTK.

Conclusion

In this blog post, we explored how to use NLTK to perform typo correction and spell-checking in Python. We learned how to use the nltk.metrics.distance module for spell-checking and the nltk.util.ngrams and nltk.trigrammodel for auto-correcting typos. By implementing these techniques, we can improve the accuracy of our NLP tasks and ensure the correctness of the text being processed.