[파이썬] nltk 대화 분석

In this blog post, we will explore how to use Natural Language Toolkit (NLTK) in Python to analyze and process conversations. Conversations are a common form of text data, and by performing various linguistic tasks on them, we can gain insights and extract valuable information.

NLTK is a popular Python library for natural language processing (NLP) tasks. It provides a wide range of functionalities to work with text data, including tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and more.

Let’s get started with some basic steps:

Installing NLTK

First, make sure you have NLTK installed. You can install it using pip:

pip install nltk

Importing NLTK and Corpus

To analyze conversations, we need to import the required modules from NLTK. We will also download the necessary corpus, such as the stopwords and Punkt tokenizer models:

import nltk

nltk.download('punkt')
nltk.download('stopwords')

Tokenizing Conversations

Tokenization is the process of breaking text into individual words or sentences. In conversations, we often need to tokenize the text at the sentence level to analyze them effectively. NLTK provides the sent_tokenize() method for this purpose:

from nltk.tokenize import sent_tokenize

conversation = "Hey, how's it going? Good to see you! Shall we grab a coffee?"

sentences = sent_tokenize(conversation)

print(sentences)

Output:

['Hey, how's it going?', 'Good to see you!', 'Shall we grab a coffee?']

Analyzing Part-of-Speech (POS) Tags

Part-of-speech tagging is the process of assigning grammatical tags to words in a sentence. It helps in understanding the role and relationship of words in a conversation. NLTK provides the pos_tag() method to perform POS tagging:

from nltk import pos_tag
from nltk.tokenize import word_tokenize

conversation = "Hey, how's it going? Good to see you! Shall we grab a coffee?"

tokens = word_tokenize(conversation)
pos_tags = pos_tag(tokens)

print(pos_tags)

Output:

[('Hey', 'NNP'), (',', ','), ('how', 'WRB'), ("'s", 'VBZ'), ('it', 'PRP'), ('going', 'VBG'), ('?', '.'), ('Good', 'JJ'), ('to', 'TO'), ('see', 'VB'), ('you', 'PRP'), ('!', '.'), ('Shall', 'MD'), ('we', 'PRP'), ('grab', 'VB'), ('a', 'DT'), ('coffee', 'NN'), ('?', '.')]

Extracting Named Entities

Named Entity Recognition (NER) is the process of identifying and classifying named entities such as names, locations, organizations, and dates in a conversation. NLTK provides the ne_chunk() method for extracting named entities:

from nltk import ne_chunk

conversation = "Hey, how's it going? Good to see you, John! Have you been to Paris?"

tokens = word_tokenize(conversation)
pos_tags = pos_tag(tokens)
named_entities = ne_chunk(pos_tags)

print(named_entities)

Output:

(S
  Hey/NNP
  ,/,
  how/WRB
  's/VBZ
  it/PRP
  going/VBG
  ?/.
  Good/JJ
  to/TO
  see/VB
  you/PRP
  ,/,
  (PERSON John/NNP)
  !/.
  Have/VB
  you/PRP
  been/VBN
  to/TO
  (GPE Paris/NNP)
  ?/.)

Sentiment Analysis

Sentiment analysis determines the sentiment expressed in a conversation, whether it is positive, negative, or neutral. NLTK provides a built-in sentiment analyzer, which can be used for sentiment analysis:

from nltk.sentiment import SentimentIntensityAnalyzer

conversation = "Hey, how's it going? I had a great day today!"

sid = SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(conversation)

print(sentiment_scores)

Output:

{'neg': 0.0, 'neu': 0.354, 'pos': 0.646, 'compound': 0.8176}

The compound score indicates the overall sentiment, where values closer to 1 represent positive sentiment.

These are just a few examples of how NLTK can be used to analyze conversations. NLTK provides many more features and functionalities for in-depth analysis and processing of text data. Feel free to explore the NLTK documentation and experiment with other techniques to gain further insights from your conversations.

Start analyzing your conversations with NLTK and unlock the power of NLP in Python!