[파이썬] requests-html 리캡챠 우회 전략

With the ever-increasing popularity of web scraping, websites are implementing various strategies to protect their data. One popular method used to prevent automated scraping is the addition of reCAPTCHA challenges. However, as developers, we often need to bypass these challenges in order to access the data we require. In this blog post, we will explore some strategies to bypass reCAPTCHA challenges using the requests-html library in Python.

Introduction to requests-html

requests-html is a Python library that allows you to easily scrape websites by handling the underlying HTTP requests and rendering of the HTML content. It provides a high-level API that integrates the functionality of both requests and BeautifulSoup, making it a powerful tool for web scraping.

Understanding reCAPTCHA

Before we discuss bypassing reCAPTCHA challenges, let’s quickly understand what reCAPTCHA is. reCAPTCHA is a service provided by Google that distinguishes between humans and automated bots. It presents different challenges (like image puzzles or checkbox verifications) that users need to solve in order to prove their humanness.

Bypassing reCAPTCHA with selenium

One approach to bypassing reCAPTCHA challenges is to use a headless browser like Selenium. Selenium allows you to automate interactions with websites, including solving reCAPTCHA challenges.

Here’s an example of using Selenium with requests-html to bypass reCAPTCHA challenges:

from requests_html import HTMLSession
from selenium import webdriver
from selenium.webdriver.common.by import By

session = HTMLSession()
driver = webdriver.Firefox()

# Load the web page
response = session.get('https://example.com')

# Check if reCAPTCHA challenge exists
captcha = response.html.find('.g-recaptcha', first=True)
if captcha:
    # Solve the reCAPTCHA challenge using Selenium
    driver.get(response.url)
    driver.switch_to.frame(driver.find_element(By.XPATH, '//iframe[contains(@src, "recaptcha")]'))
    driver.find_element(By.XPATH, '//*[contains(@class, "rc-anchor-center-item")]').click()

    # Get the updated response after solving the challenge
    cookies = driver.get_cookies()
    session.cookies.update({cookie['name']: cookie['value'] for cookie in cookies})
    response = session.get(response.url)

# Use the updated response for scraping
# ...

In this example, we first check if the web page contains a reCAPTCHA challenge. If it does, we use Selenium to automate the solving of the challenge. Once the reCAPTCHA challenge is solved, we update the session cookies and retrieve the updated response using requests-html.

Bypassing reCAPTCHA with third-party APIs

Another strategy to bypass reCAPTCHA challenges is to use third-party APIs that provide reCAPTCHA solving services. These APIs use machine learning algorithms to analyze and solve the challenges.

To use a third-party API, you need to sign up for an account and obtain an API key. There are multiple third-party providers available, such as 2Captcha, Anti-Captcha, or DeathByCaptcha.

Here’s an example of using the 2Captcha API to bypass reCAPTCHA challenges:

from requests_html import HTMLSession

session = HTMLSession()

# Load the web page
response = session.get('https://example.com')

# Check if reCAPTCHA challenge exists
captcha = response.html.find('.g-recaptcha', first=True)
if captcha:
    # Solve the reCAPTCHA challenge using 2Captcha API
    api_key = 'YOUR_2CAPTCHA_API_KEY'
    site_key = 'YOUR_RECAPTCHA_SITE_KEY'
    page_url = response.url
    params = {'key': api_key, 'method': 'userrecaptcha', 'googlekey': site_key, 'pageurl': page_url}
    response = session.post('http://2captcha.com/in.php', data=params)
    captcha_id = response.text.split('|')[1]

    # Wait for the challenge to be solved
    while True:
        response = session.get(f'http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}')
        if response.text.startswith('OK'):
            break

    # Get the solved reCAPTCHA token
    token = response.text.split('|')[1]

    # Include the solved token in your subsequent requests
    session.cookies.set('g-recaptcha-response', token)

# Use the updated response for scraping
# ...

In this example, we send a request to the 2Captcha API to solve the reCAPTCHA challenge. We provide the necessary parameters, including the API key, reCAPTCHA site key, and the URL of the web page. We then continuously check the API response until the challenge is solved. Finally, we extract the solved reCAPTCHA token and include it in the subsequent requests.

Conclusion

Bypassing reCAPTCHA challenges while web scraping can be a challenging task. In this blog post, we explored two strategies to overcome reCAPTCHA challenges using the requests-html library in Python. Whether you choose to use a headless browser like Selenium or an external reCAPTCHA solving service like 2Captcha, always ensure you comply with the terms of service of the websites you are scraping.