[파이썬] Scrapy IP 프록시 설정

When scraping websites, it is important to protect your identity and avoid getting blocked. One way to achieve this is by using IP proxies to mask your real IP address. In this blog post, we will explore how to configure IP proxies in Scrapy, a powerful web scraping framework in Python.

Installing Dependencies

Before we begin, make sure you have Scrapy and the required dependencies installed on your system. You can install Scrapy using pip:

pip install scrapy

Setting Up Proxies in Scrapy

To set up IP proxies in Scrapy, we need to override the default downloader middleware and add our custom middleware. This middleware will handle the proxy request and response.

Create a new file called middlewares.py and add the following code:

from scrapy import signals
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware

class ProxyMiddleware(HttpProxyMiddleware):
  
    def process_request(self, request, spider):
        # replace 'proxy_host' and 'proxy_port' with your proxy details
        request.meta['proxy'] = 'http://proxy_host:proxy_port'

This middleware will intercept every request made by Scrapy and set the proxy attribute in the meta field of the request to our proxy address. Make sure to replace 'proxy_host' and 'proxy_port' with your actual proxy details.

Enabling the Middleware

In your Scrapy project’s settings.py file, add the following line to enable the custom proxy middleware:

DOWNLOADER_MIDDLEWARES = {
    'yourproject.middlewares.ProxyMiddleware': 543,
}

Replace 'yourproject' with the actual name of your Scrapy project.

Testing the Setup

To test if the proxy setup is working correctly, create a new spider or modify an existing one. Add the following code to the spider’s parse method:

def parse(self, response):
    ip_address = response.xpath('//body//span[contains(@class, "ip_address")]').get()
    yield {
        'IP': ip_address
    }

This code extracts the IP address from the website’s response. Replace the XPath expression with the appropriate one for your target website.

Run your Scrapy spider using the following command:

scrapy crawl yourspider

You should now see that the IP address returned is the IP address of the proxy server rather than your actual IP address.

Conclusion

Setting up IP proxies in Scrapy can help you bypass IP restrictions and scrape websites without revealing your true identity. By following the steps outlined in this blog post, you can configure Scrapy to utilize IP proxies effectively. Happy scraping!

If you have any questions or feedback, feel free to leave a comment below.