[파이썬] requests-html HTMLResponse 객체

06 Sep 2023

requests-html

In web scraping and crawling scenarios, it is often necessary to extract specific information from HTML pages. requests-html is a powerful Python library that provides an easy and intuitive approach to parse HTML pages and extract data.

Installation

To get started with requests-html, you need to install it. Open your terminal or command prompt and run the following command:

pip install requests-html

Once the installation is complete, you can import HTMLSession from requests_html module:

from requests_html import HTMLSession

Fetching and Parsing HTML Pages

HTMLSession is the main class provided by requests-html for making HTTP requests and parsing HTML content. Let’s see how to use it to fetch and parse HTML pages.

Create an instance of HTMLSession:

session = HTMLSession()

Send an HTTP GET request to a URL using the get() method. This method returns an HTMLResponse object:

response = session.get('https://example.com')

Access the HTML content of the response using the html property:

html_content = response.html

Extracting Data from HTML Pages

Once you have obtained the HTML content, you can use various methods and properties of the HTMLResponse object to extract data from the HTML structure.

Finding Elements

You can find HTML elements using different methods like find(), find_all(), and CSS selectors.

# Find the first element matching a specific CSS selector
element = response.html.find('#id-of-element', first=True)

# Find all elements matching a specific CSS selector
elements = response.html.find('.class-of-elements')

# Find all anchor tags
links = response.html.find('a')

Accessing element attributes and text

Once you have obtained an element, you can access its attributes and text using the following properties:

# Get the value of an attribute
attribute_value = element.attrs['attribute-name']

# Get the text content of an element
text_content = element.text

Summary

The requests-html library provides a convenient way to fetch and parse HTML pages in Python. With its easy-to-use API, you can quickly extract data from HTML structures using CSS selectors or other methods. Whether you are building web scrapers, performing data analysis, or conducting web automation, requests-html is a versatile tool to have in your Python toolkit.

Make sure to check the official documentation of requests-html for more details and examples on how to use its functionalities effectively. Happy coding!