[파이썬] Scrapy User Agent 설정

Scrapy is a powerful and flexible web scraping framework written in Python. It allows you to extract data from websites easily and efficiently. One important aspect of web scraping is to mimic the behavior of a real user browsing the website. This includes setting the user agent for your Scrapy spiders.

The user agent is a string that identifies the web browser and operating system of the user. Many websites use the user agent to analyze and optimize their content delivery. By setting a proper user agent, you can enhance the success rate of your web scraping and avoid being blocked by websites.

To set the user agent in Scrapy, you need to modify the settings.py file in your Scrapy project. Open the settings.py file and add the following code:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

In this example, we are setting the user agent to mimic Google Chrome on Windows. You can replace the user agent string with any other user agent you want.

Another way to set the user agent is by modifying the settings.py file programmatically. This can be useful if you want to randomize the user agent or rotate it periodically. Here’s an example of how to do it:

from scrapy import signals

class RandomUserAgentMiddleware(object):

    def __init__(self, user_agents):
        self.user_agents = user_agents

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings.getlist('USER_AGENTS'))
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        spider.settings.set('USER_AGENT', 'random_user_agent')

SETTING = 'USER_AGENTS'

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36',
]

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomUserAgentMiddleware': 400,
}

In this example, we define a middleware class called RandomUserAgentMiddleware that selects a random user agent from a list and sets it as the user agent for each request.

Remember to update the DOWNLOADER_MIDDLEWARES setting in settings.py to include your middleware.

Setting the user agent for your Scrapy spiders is an important step in avoiding detection and ensuring the success of your web scraping. Remember to handle user agent selection wisely and ethically. Happy scraping!