[파이썬] Scrapy ReCaptcha 우회

06 Sep 2023

Scrapy

In web scraping, encountering CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) can be frustrating. ReCaptcha is one of the common challenges that web scrapers face. However, with the right techniques and tools, it is possible to bypass ReCaptcha and continue scraping the desired data.

In this blog post, we will explore how to bypass ReCaptcha using Scrapy, a powerful web scraping framework in Python.

Understanding ReCaptcha

ReCaptcha is a security mechanism implemented on websites to prevent automated scraping. It presents users with a challenge, usually in the form of distorted text or image, and expects them to solve it to prove they are human. This makes it difficult for bots to proceed with scraping.

Bypassing ReCaptcha with Scrapy

To bypass ReCaptcha in Scrapy, we can utilize one of the following methods:

Solving ReCaptcha manually: This method involves making use of human-powered CAPTCHA solving services. Scrapy can be configured to pause scraping and prompt the user to solve the ReCaptcha manually. Once solved, the scraping can continue.
Using a ReCaptcha solving service: There are third-party services that specialize in solving ReCaptchas on behalf of users. These services have APIs that can be integrated with Scrapy. The ReCaptcha challenge can be sent to the service API, and the response can be used to proceed with scraping.
Browser automation: This method involves using a headless browser automation tool like Selenium or Puppeteer with Scrapy. The headless browser can navigate to the page with ReCaptcha and solve it, allowing Scrapy to continue the scraping process.

Sample Code Using 2Captcha Service

Let’s take a look at a sample code snippet that uses the 2Captcha service to solve ReCaptcha challenges:

import scrapy
from scrapy_splash import SplashRequest
from urllib.parse import urlencode

class MySpider(scrapy.Spider):
    name = 'recaptcha_spider'

    def start_requests(self):
        # ...
        # Logic for making initial request to webpage with ReCaptcha
        # ...

        # Extract ReCaptcha challenge from response
        recaptcha_challenge = response.css('your-recaptcha-selector').get()

        # Send ReCaptcha challenge to 2Captcha service
        api_key = 'your-2captcha-api-key'
        captcha_id = self._send_recaptcha_challenge(api_key, recaptcha_challenge)

        # Poll 2Captcha service for solution
        solution = self._poll_solution(api_key, captcha_id)

        # Build POST data with ReCaptcha solution and resume scraping
        data = {
            'g-recaptcha-response': solution,
            # ...
            # Other required POST parameters
            # ...
        }
        yield scrapy.FormRequest(url='your-target-url', method='POST', formdata=data, callback=self.parse)

    def _send_recaptcha_challenge(self, api_key, challenge):
        # Code to send ReCaptcha challenge to 2Captcha service and get captcha ID
        # ...

    def _poll_solution(self, api_key, captcha_id):
        # Code to poll 2Captcha service for solution using captcha ID
        # ...

    def parse(self, response):
        # ...
        # Parsing and extraction logic for scraped data
        # ...

Please note that this is just an example code snippet and may require additional customization based on the specific website and ReCaptcha implementation you are dealing with.

Conclusion

Bypassing ReCaptcha while using Scrapy for web scraping is a challenging task. However, with the right tools and techniques, it can be overcome. Remember to always follow ethical scraping practices and ensure you have proper permissions to scrape the target website.

Happy scraping!