[파이썬] `nltk`와 `pandas`의 통합

nltk and pandas

Introduction

Natural Language Processing (NLP) and data analysis are two important fields in the domain of data science. nltk (Natural Language Toolkit) is a popular library used for NLP tasks, while pandas is a powerful library used for data manipulation and analysis. In this blog post, we will explore how we can integrate nltk and pandas to perform NLP tasks on large datasets.

Installing Required Libraries

Before we begin, let’s make sure we have nltk and pandas installed in our Python environment. If not, we can install them using the following commands:

!pip install nltk
!pip install pandas

Loading and Preprocessing Data

To demonstrate the integration of nltk and pandas, let’s consider a scenario where we have a dataset of customer reviews for a product. We want to analyze the sentiment of these reviews using nltk, and then store the results in a new column in our pandas DataFrame.

import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer

# Load the dataset into a pandas DataFrame
data = pd.read_csv('customer_reviews.csv')

# Initialize the Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

# Apply sentiment analysis on each review and store the results in a new column
data['sentiment_score'] = data['review'].apply(lambda x: sia.polarity_scores(x)['compound'])

# Display the updated DataFrame
print(data.head())

In the code above, we first load the dataset into a pandas DataFrame. We then initialize the Sentiment Intensity Analyzer from nltk.sentiment and apply it to each review using the apply function. The sentiment score is computed using compound polarity from the analyzer’s polarity_scores method. Finally, we add the sentiment scores as a new column in our DataFrame.

Analyzing Data

Now that we have computed the sentiment scores for each review, we can perform various data analysis tasks using pandas. For example, we can calculate the average sentiment score per product category and visualize the results using a bar chart.

import matplotlib.pyplot as plt

# Calculate the average sentiment score per product category
average_sentiment = data.groupby('category')['sentiment_score'].mean()

# Create a bar chart to visualize the results
average_sentiment.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Product Category')
plt.ylabel('Average Sentiment Score')
plt.title('Sentiment Analysis of Customer Reviews')
plt.show()

In the code snippet above, we group the DataFrame by the ‘category’ column and calculate the mean of the ‘sentiment_score’ column for each category. We then create a bar chart using matplotlib.pyplot to visualize the average sentiment scores per product category.

Conclusion

Integrating nltk and pandas allows us to combine the power of NLP and data analysis to gain insights from unstructured text data. In this blog post, we explored how to perform sentiment analysis on customer reviews using nltk and store the results in a pandas DataFrame. We also learned how to analyze the sentiment scores and visualize the results using pandas and matplotlib. This integration opens up endless possibilities for analyzing and extracting valuable information from large text datasets.

Happy coding!