[파이썬] requests-html 인터랙티브 콘텐츠 스크레이핑

06 Sep 2023

requests-html

In this blog post, we will explore how to use the requests-html library in Python to scrape interactive content from websites. While traditional web scraping tools struggle with dynamic and interactive web pages, requests-html provides a simple and elegant solution for handling such scenarios.

What is requests-html?

requests-html is a Python library that combines the power of requests and pyppeteer to offer a seamless experience for scraping and parsing web pages. It allows you to execute JavaScript, interact with web elements, and extract dynamically generated content.

Installation

To get started, you’ll need to install requests-html. You can do this by running the following command:

pip install requests-html

Additionally, since requests-html utilizes pyppeteer, you will also need to install a compatible version of Chromium or Google Chrome. You can install Chromium using the following command:

sudo apt install chromium

Basic Usage

To begin scraping interactive content with requests-html, we first need to import and initialize the HTMLSession class from the library. This class allows us to create a session object to send requests and interact with web pages.

from requests_html import HTMLSession

session = HTMLSession()

Next, we can use the session object to send a GET request to a web page and render it with JavaScript support.

response = session.get('https://www.example.com')
response.html.render()

After rendering the page, we can access and extract the content we need. For example, let’s say we want to scrape the titles of all the links on the page.

# Find all the link elements
links = response.html.find('a')

# Extract the titles
for link in links:
    print(link.text)

Running JavaScript

One of the key features of requests-html is the ability to execute JavaScript code, which is essential for scraping interactive content. To do this, we can utilize the render method on an element to ensure that JavaScript is fully executed.

# Render the JavaScript on a specific element
element.render()

This feature allows us to interact with web pages and scrape the dynamically generated content that wouldn’t be available through traditional scraping methods.

Conclusion

requests-html is a powerful library for scraping interactive content in Python. With its ability to execute JavaScript, interact with web elements, and extract dynamic content, it provides an effective solution for web scraping tasks that involve complex and interactive websites.

In this blog post, we’ve only scratched the surface of what requests-html can do. I encourage you to explore the documentation, experiment with different scenarios, and leverage the full potential of this library for your web scraping needs.