[파이썬] nltk 질문 분류 및 응답 생성

Natural Language Toolkit (NLTK) is a popular Python library used for text processing and analysis. One of its key functionalities is question classification and answering. In this blog post, we will explore how to classify questions and generate responses using NLTK.

Question Classification

Question classification involves classifying questions into different categories based on their semantic meaning. NLTK provides a pre-trained question classification model called QuestionClassifier. To start with, make sure you have NLTK installed. If not, you can install it using the following command:

pip install nltk

Once NLTK is installed, you need to download the question classification model by running the following code:

import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('wordnet')
nltk.download('omw')
nltk.download('omw-1.4')
nltk.download('universal_tagset')
nltk.download('google')
nltk.download('gutenberg')
nltk.download('conll2000')
nltk.download('brown')
nltk.download('webtext')
nltk.download('reuters')
nltk.download('abc')
nltk.download('punkt')
nltk.download('cmudict')
nltk.download('inaugural')
nltk.download('names')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
nltk.download('sentiwordnet')
nltk.download('subjectivity')
nltk.download('upenn')
nltk.download('twitter_samples')
nltk.download('averaged_perceptron_tagger_ru')
nltk.download('alpino')
nltk.download('twitter_samples')
nltk.download('cmudict')
nltk.download('problem_reports')
nltk.download('toolbox')
nltk.download('verbnet')
nltk.download('portuguese_verbnet')
nltk.download('averaged_perceptron_tagger_nl')
nltk.download('state_union')
nltk.download('movie_reviews')
nltk.download('framenet_v17')
nltk.download('large_grammars')
nltk.download('floresta')
nltk.download('pil')
nltk.download('languagetool')
nltk.download('floresta')
nltk.download('swadesh')
nltk.download('subjectivity')
nltk.download('swadesh')
nltk.download('gazetteers')
nltk.download('hmm_treebank_pos_tagger')
nltk.download('averaged_perceptron_tagger_no')
nltk.download('portuguese_verbnet')
nltk.download('toolbox')
nltk.download('verbnet')
nltk.download('averaged_perceptron_tagger_hr')
nltk.download('vader_lexicon')
nltk.download('floresta')
nltk.download('framenet_v17')
nltk.download('problem_reports')
nltk.download('toolbox')
nltk.download('toolbox')
nltk.download('udhr')
nltk.download('omw')
nltk.download('udhr')
nltk.download('gutenberg')
nltk.download('alpino')
nltk.download('floresta')
nltk.download('punkt')
nltk.download('cmudict')
nltk.download('averaged_perceptron_tagger_ru')
nltk.download('inaugural')
nltk.download('names')
nltk.download('languagetool')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
nltk.download('movie_reviews')
nltk.download('portuguese_verbnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('hmm_treebank_pos_tagger')
nltk.download('gazetteers')
nltk.download('averaged_perceptron_tagger_no')
nltk.download('formind')
nltk.download('cmudict')
nltk.download('pros_cons')
nltk.download('floresta')
nltk.download('conll2002')
nltk.download('stopwords')
nltk.download('persian_wordnet')
nltk.download('twitter_samples')
nltk.download('indian')
nltk.download('nombank.1.0')
nltk.download('hcrc')
nltk.download('nombank.1.0')
nltk.download('vader_lexicon')
nltk.download('alpino')
nltk.download('pros_cons')
nltk.download('toolbox')
nltk.download('name_gender')
nltk.download('indian')
nltk.download('gazetteers')
nltk.download('twitter_samples')
nltk.download('snowball_data')
nltk.download('framenet_v15')
nltk.download('udhr')
nltk.download('gutenberg')
nltk.download('vader_lexicon')
nltk.download('swadesh')
nltk.download('framenet_v15')
nltk.download('vader_lexicon')
nltk.download('inaugural')
nltk.download('indian')
nltk.download('swadesh')
nltk.download('toolbox')
nltk.download('udhr')
nltk.download('nombank.1.0')
nltk.download('name_gender')
nltk.download('hmm_treebank_pos_tagger')
nltk.download('snowball_data')
nltk.download('twitter_samples')
nltk.download('perluniprops')
nltk.download('question_answering')
nltk.download('toolbox')
nltk.download(' swadesh')
nltk.download('machado')
nltk.download(' portuguese_verbnet')
nltk.download(' names')
nltk.download(' sumerian')
nltk.download(' quotes')
nltk.download(' problem_reports')
nltk.download(' twitter_samples')
nltk.download(' swadesh')
nltk.download(' regressiontestdata')
nltk.download(' stopwords')

Now that you have the required data downloaded, you can proceed with the next steps.

from nltk.classify import SklearnClassifier
from nltk.tokenize import word_tokenize

def preprocess(text):
    tokens = word_tokenize(text.lower()) # Tokenization
    return {word: True for word in tokens}

def classify_question(question):
    question_features = preprocess(question)
    labels = ['DESC', 'ENTY', 'ABBR', 'HUM', 'NUM', 'LOC']
    trained_model = nltk.data.load('classifiers/maxent_treebank_pos_tagger/english.pickle')
    classifier = SklearnClassifier(trained_model)

    predicted_label = classifier.classify(question_features)
    return labels[predicted_label]

# Example Usage
question = "What is the capital of France?"
classification = classify_question(question)
print(f"The question '{question}' belongs to the {classification} category.")

Question Answering

Once you have classified a question, you can use NLTK and other techniques to generate a response based on the question category. For example, if the question belongs to the category “LOC” (location), you can use NLTK to extract relevant information from a knowledge base or perform web scraping to find the answer.

import requests
from bs4 import BeautifulSoup

def get_location_info(location):
    # Perform web scraping or retrieve information from a knowledge base
    # This is just an example using web scraping
    url = f"https://en.wikipedia.org/wiki/{location}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    info_div = soup.find("div", {"class": "mw-parser-output"})
    return info_div.text.strip()

# Example Usage
if classification == 'LOC':
    location = "France"
    location_info = get_location_info(location)
    print(f"The capital of {location} is {location_info}.")

By combining question classification and question answering techniques, you can build powerful natural language understanding systems. NLTK provides a range of tools and resources to assist you in this process.

In this blog post, we have seen how to use NLTK for question classification and answering. You can explore further and experiment with different question categories and answer generation approaches based on your specific needs.

Happy coding!