[파이썬] requests-html `HTMLSession` 객체

The requests-HTML library in Python provides a convenient way to scrape and parse HTML web pages. The library offers an HTMLSession object that allows you to make HTTP requests and easily navigate through the resulting HTML.

In this blog post, we will explore the capabilities of the HTMLSession object and demonstrate its usage with some example code.

Installation

Before we begin, make sure you have the requests-HTML library installed. You can install it using pip:

pip install requests-html

Creating an HTMLSession

To start using the HTMLSession, first import it from the requests_html module:

from requests_html import HTMLSession

Next, create an instance of the HTMLSession class:

session = HTMLSession()

This will create a new session object that we can use to make HTTP requests and interact with HTML content.

Making HTTP requests

The session object provides several methods to make HTTP requests, such as get(), post(), put(), and delete(). These methods work similarly to their counterparts in the requests library.

Here’s an example of how to use the get() method to fetch a web page:

response = session.get('https://example.com')

The get() method returns a Response object that contains the server’s response. We can use the html property of the Response object to access and parse the HTML content:

html_content = response.html

The HTMLSession object provides various methods to navigate through the HTML content, such as find(), find_all(), find_one(), etc. These methods allow you to locate specific elements or extract data from the HTML.

For example, to find all the anchor tags (<a>) in the HTML content, you can use the find() method with the appropriate CSS selector:

anchors = html_content.find('a')

The anchors variable will contain a list of Element objects, each representing an anchor tag found in the HTML.

Extracting data from HTML

Once you have located the desired HTML elements, you can extract their data using various methods provided by the Element objects. For example, you can retrieve the text content of an element using the text property:

for anchor in anchors:
    print(anchor.text)

This will print the text content of each anchor tag found in the HTML.

Conclusion

The HTMLSession object in the requests-HTML library provides a powerful and easy-to-use interface for scraping and parsing HTML web pages. It allows you to make HTTP requests, navigate through HTML content, and extract data with minimal effort.

In this blog post, we have explored the basic usage of the HTMLSession object. However, the library offers many more advanced features and options that can enhance your web scraping projects. Feel free to refer to the library’s documentation for more information and examples.

Happy scraping!