[파이썬] requests-html 리소스 제한하기

In web scraping or web automation tasks, it’s important to control the resources used by the HTTP requests we make. Setting resource limits can help us manage our bandwidth, prevent hitting rate limits on websites, and ensure efficient use of system resources.

In this blog post, we will explore how to limit resources in the requests-html library using Python.

Installation

First, let’s install the required libraries. Open your terminal and run the following command:

pip install requests-html

Limiting Resource Usage

The requests-html library provides several options to limit resource usage during HTTP requests. We will mainly focus on the following two options:

  1. Timeout: Limit the maximum time to wait for a response from the server.
  2. Session Retries: Set the maximum number of retries for failed requests.

Timeout

Timeout allows us to set a maximum time limit for a request. If the server takes longer to respond than the specified timeout value, an exception will be raised. We can set the timeout value in seconds in the get() method. Here’s an example:

from requests_html import HTMLSession

session = HTMLSession()

# Set a timeout of 10 seconds for the request
response = session.get('https://example.com', timeout=10)

Session Retries

The requests-html library uses a session object to manage HTTP requests. We can set the maximum number of retries for failed requests using the session.retries attribute. By default, session.retries is set to 3. Here’s an example:

from requests_html import HTMLSession

session = HTMLSession()

# Set the maximum number of retries to 5
session.retries = 5

# Make a request
response = session.get('https://example.com')

Conclusion

Limiting resource usage in web scraping and web automation tasks is crucial to maintain efficiency, avoid excessive bandwidth usage, and handle server timeouts effectively. In this blog post, we explored how to set resource limits using the requests-html library in Python.

By setting timeout values and the number of retries appropriately, we can have better control over our HTTP requests and optimize resource usage.

Remember to respect the server’s terms of service and apply resource limits responsibly to avoid any unwanted consequences.

Happy scraping!