[파이썬] Scrapy 리다이렉션 처리

Scrapy is a powerful web scraping framework for Python. It enables developers to easily extract data from websites by leveraging its built-in functionalities. One important aspect of web scraping is handling redirects, which allows you to efficiently navigate through websites that use redirections.

In this blog post, we’ll explore how to handle redirects in Scrapy using Python. We’ll cover different scenarios and provide code examples to demonstrate the implementation.

Scenario 1: Following redirects automatically

Scrapy provides a built-in mechanism to automatically follow redirects when making requests. This can be achieved by setting the REDIRECT_ENABLED configuration option to True. By default, it is already enabled, so you don’t need to make any additional changes.

# settings.py

REDIRECT_ENABLED = True

With this setting enabled, Scrapy will automatically follow redirects and return the response from the final redirected URL. This greatly simplifies the code and allows you to focus on the scraping logic instead of handling redirects manually.

Scenario 2: Custom redirect handling

In some cases, you may want to handle redirects manually instead of relying on Scrapy’s automatic redirection handling. This could be useful for scenarios where you need to perform additional operations before following the redirect.

To manually handle redirects, you can override the Spider class’s handle_httpstatus_list attribute. This attribute is a list of HTTP status codes that the Spider will handle explicitly instead of passing them to the parse method. By default, it includes all “300” and “301” status codes, which are typically associated with redirects.

# spider.py

from scrapy.spiders import Spider
from scrapy.http import Request

class MySpider(Spider):

    handle_httpstatus_list = [301, 302]  # Only manually handle 301 and 302 redirects

    def parse(self, response):
        # Custom redirect handling logic
        if response.status in self.handle_httpstatus_list:
            # Perform additional operations here
            redirected_url = self.get_redirected_url(response)
            yield Request(redirected_url, callback=self.parse_redirect)

    def parse_redirect(self, response):
        # Process the final redirected response here
        pass

In this example, we override the handle_httpstatus_list attribute to explicitly handle only the “301” and “302” status codes. Inside the parse method, we can perform any additional operations before extracting the redirected URL and creating a new request to process it in the parse_redirect method.

Conclusion

Handling redirects is an important aspect of web scraping, and Scrapy provides various ways to handle them effectively. By enabling automatic redirect following or implementing custom redirect handling, you can efficiently navigate through websites that use redirections and extract the desired data.

Scrapy’s flexibility allows you to easily adapt to different scenarios and customize the redirect handling process according to your scraping requirements. So, the next time you encounter redirects during web scraping with Scrapy, you know how to handle them efficiently using Python.