[파이썬] requests-html 스크레이핑된 데이터의 분석

06 Sep 2023

requests-html

In this blog post, we will explore how to analyze data scraped using the requests-html library in Python. If you’re unfamiliar with web scraping or the requests-html library, it’s recommended to check out some introductory resources before diving into this post.

What is requests-html?

requests-html is a Python library that allows you to easily scrape and parse web pages. It is built on top of the popular requests library and provides additional capabilities such as parsing HTML content, running JavaScript code, and handling asynchronous requests.

To install requests-html, you can use pip:

$ pip install requests-html

Scraping Data with requests-html

To scrape data from a web page using requests-html, you need to perform the following steps:

Create an HTML session: First, you need to create an instance of the HTMLSession class.

from requests_html import HTMLSession

session = HTMLSession()

Send a request and render the page: You can use the get method of the session object to send a GET request to a specific URL. After receiving the response, use the render method to execute JavaScript code (if present) and render the page.

response = session.get('https://example.com')
response.html.render()

Extract data using CSS selectors: Once the page is rendered, you can extract data using CSS selectors. For example, to extract all the text from <h1> tags, you can use the xpath method.

headings = response.html.xpath('//h1//text()')

Explore and manipulate data: After extracting the data, you can store it in a suitable data structure (e.g., list, dictionary) for further analysis.

Analyzing Scraped Data

Once we have scraped and extracted data using requests-html, we can perform various analyses on it depending on our requirements. Let’s go through a few examples:

Word Frequency Analysis

One common analysis is to determine the frequency of words in the scraped data. We can store the extracted text in a list and use the Counter class from the collections module to count the occurrences of each word.

from collections import Counter

word_list = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
word_frequency = Counter(word_list)

print(word_frequency)

The output will be:

Counter({'apple': 3, 'banana': 2, 'cherry': 1})

Sentiment Analysis

If the scraped data contains text content such as reviews or comments, we can perform sentiment analysis to determine the sentiment expressed. There are several libraries available for sentiment analysis in Python, such as NLTK or TextBlob.

from textblob import TextBlob

reviews = ['I love this product!', 'This is the worst service ever!']
sentiments = []

for review in reviews:
    blob = TextBlob(review)
    sentiment_score = blob.sentiment.polarity
    if sentiment_score > 0:
        sentiments.append('Positive')
    elif sentiment_score < 0:
        sentiments.append('Negative')
    else:
        sentiments.append('Neutral')

print(sentiments)

The output will be:

['Positive', 'Negative']

Data Visualization

Another way to analyze scraped data is by visualizing it. You can use popular libraries like Matplotlib or Seaborn to create visualizations such as bar charts, scatter plots, or word clouds.

import matplotlib.pyplot as plt

word_frequency = {'apple': 3, 'banana': 2, 'cherry': 1}

words = list(word_frequency.keys())
frequency = list(word_frequency.values())

plt.bar(words, frequency)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Word Frequency Analysis')
plt.show()

This code will display a bar chart representing the word frequency.

Conclusion

In this blog post, we explored how to analyze data scraped using the requests-html library in Python. We covered the basics of web scraping with requests-html, extracting data using CSS selectors, and performing various analyses on the scraped data. Remember to always check the terms of use and legality before scraping any website, and be respectful of the website’s resources. Happy scraping and analyzing!