[파이썬] Scrapy 헤더 조작

Scrapy is a powerful web scraping framework in Python that allows you to extract data from websites. When making requests using Scrapy, it is important to understand how to manipulate headers to achieve different objectives. Headers can be used to send additional information to the server, such as user agent, language preferences, and authentication credentials.

In this blog post, we will explore various techniques to manipulate headers in Scrapy to enhance our web scraping capabilities.

1. Setting User Agents

User agents are strings that identify the web browser or client making the request. Some websites may block requests from bots or certain user agents. To overcome this, we can set a custom user agent in our Scrapy spider.

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    
    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }
        yield scrapy.Request(url='https://www.example.com', headers=headers, callback=self.parse)
    
    def parse(self, response):
        # Parsing logic here
        pass

2. Adding Custom Headers

In addition to the user agent, we may want to send other headers to the server. These headers could be used for authentication, cookie management, or any other purpose required by the target website. We can do this by adding custom headers to our Scrapy requests.

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    
    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
            'Referer': 'https://www.example.com',
            'Authorization': 'Bearer token123'
        }
        yield scrapy.Request(url='https://www.example.com', headers=headers, callback=self.parse)
    
    def parse(self, response):
        # Parsing logic here
        pass

3. Modifying Headers during Spider Runtime

There may be situations where we need to modify headers dynamically during the spider runtime. This can be done by overriding the start_requests method and using the meta attribute of the Scrapy Request object.

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    
    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }
        yield scrapy.Request(url='https://www.example.com', headers=headers, meta={'proxy': 'http://proxy.example.com:8080'}, callback=self.parse)
    
    def parse(self, response):
        # Modifying headers dynamically
        headers = response.request.headers
        headers['Referer'] = 'https://www.example.com'
        
        # Making subsequent requests with modified headers
        yield scrapy.Request(url='https://www.example.com/next', headers=headers, callback=self.parse_next)
    
    def parse_next(self, response):
        # Parsing logic for the next page
        pass

In conclusion, understanding how to manipulate headers in Scrapy can greatly enhance our web scraping capabilities. By setting custom user agents, adding custom headers, and modifying headers during the spider runtime, we can mimic different client behaviors and access the desired data more effectively.