[파이썬] Scrapy HTTP 인증

Scrapy is a powerful web scraping framework written in Python. It provides a simple and efficient way to extract data from websites. In this blog post, we will discuss how to handle HTTP authentication in Scrapy.

Basic Authentication

Basic authentication is a commonly used method for securing websites. It requires users to provide a username and password to access protected resources. Scrapy provides built-in support for handling basic authentication.

To send HTTP authentication with Scrapy, you need to provide the username and password in the start_requests method of your spider. Here’s an example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    
    def start_requests(self):
        url = 'http://example.com/protected'
        yield scrapy.Request(url, callback=self.parse, meta={'dont_redirect': True}, 
                             headers={'Authorization': 'Basic BASE64_ENCODED_CREDENTIALS'})

    def parse(self, response):
        # Parse the response here
        pass

Make sure to replace BASE64_ENCODED_CREDENTIALS with the actual Base64-encoded credentials. You can use the base64 library in Python to encode the credentials.

Digest Authentication

Digest authentication is another method for securing websites. It works similar to basic authentication but uses a hashing algorithm to protect the password during transmission. Scrapy doesn’t provide built-in support for digest authentication, but you can handle it by setting the appropriate headers manually.

Here’s an example of how to send digest authentication with Scrapy:

import scrapy
import hashlib
from scrapy.http.headers import Headers

class MySpider(scrapy.Spider):
    name = 'my_spider'
    
    def start_requests(self):
        url = 'http://example.com/protected'
        realm = 'MyRealm'
        username = 'my_username'
        password = 'my_password'
        
        ha1 = hashlib.md5(f"{username}:{realm}:{password}".encode()).hexdigest()
        ha2 = hashlib.md5(f"GET:{url}".encode()).hexdigest()
        nonce = 'nonce'
        
        response = scrapy.Request(url, callback=self.parse, method='GET', headers={
            'Authorization': f'Digest username="{username}", realm="{realm}", nonce="{nonce}", '
                            f'uri="{url}", response="{hashlib.md5(f"{ha1}:{nonce}:{ha2}".encode()).hexdigest()}"'
        })
        
        response.meta['dont_redirect'] = True
        return response
    
    def parse(self, response):
        # Parse the response here
        pass

Make sure to replace realm, username, password, and nonce with appropriate values. Also, note that you need to compute the MD5 hash of different components to generate the response header.

Conclusion

Handling HTTP authentication in Scrapy is essential when scraping websites that require users to authenticate. By following the examples provided, you can easily send basic and digest authentication with Scrapy. This will allow you to access protected resources and extract the data you need for your web scraping projects. Happy scraping!