[파이썬] requests-html 리다이렉션 처리

In web scraping or web automation, dealing with redirects is a common challenge. When using the requests-html library in Python, there are several ways to handle redirections effectively.

Method 1: Using the history attribute

The requests-html library provides an attribute called history that contains a list of all the response objects from the redirection chain. By checking the length of this list, you can determine if any redirects occurred.

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.example.com')

if response.history:
    print(f"Redirection occurred! Total redirects: {len(response.history)}")
    final_url = response.url
    print(f"Final URL: {final_url}")

In this example, we create an HTMLSession object and send a GET request to the URL 'https://www.example.com'. If any redirection occurs, the code checks the history list and prints the total number of redirects along with the final URL after redirection.

Method 2: Using the allow_redirects parameter

Another way to handle redirects in requests-html is by using the allow_redirects parameter in the get() method. By setting this parameter to False, the library will not automatically follow redirects.

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.example.com', allow_redirects=False)

if response.status_code == 302:
    redirect_url = response.headers['Location']
    print(f"Redirect URL: {redirect_url}")

In this example, we explicitly set the allow_redirects parameter to False to prevent automatic redirection. If a redirect occurs, we check the status code of the response to confirm it is a redirect (e.g., 302 Found) and then extract the redirect URL from the Location header.

Method 3: Handling redirects manually

Sometimes, you may need more control over redirects, such as performing custom actions or handling specific types of redirects differently. In such cases, you can handle redirects manually using the next() method.

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.example.com', allow_redirects=False)

while response.status_code in [301, 302]:
    redirect_url = response.headers['Location']
    response = session.get(redirect_url, allow_redirects=False)

print(f"Final URL: {response.url}")

In this example, we initially set allow_redirects to False to disable automatic redirection. Then, we use a while loop to keep checking the status code and extracting the redirect URL from the Location header until we reach the final URL after all redirects.

By employing these methods, you can effectively handle redirects when using the requests-html library in Python, ensuring accurate results in your web scraping or automation tasks.