[파이썬] Scrapy robots.txt 준수

06 Sep 2023

Scrapy

In web scraping, it is important to respect the rules set by websites to maintain ethical and legal practices. The robots.txt file is a standard used by websites to communicate guidelines and limitations for web crawlers or robots.

In this blog post, we will explore how to implement robots.txt compliance in Python using the Scrapy framework. Scrapy is a powerful and flexible web scraping framework that provides built-in support for handling robots.txt rules.

What is robots.txt?

The robots.txt file is a text file located in the root directory of a website that specifies which parts of the website are allowed to be crawled by web robots. It helps in preventing scraping of sensitive or copyrighted information and ensures that web crawlers do not overload the server by making too many requests.

The robots.txt file follows a specific syntax, where website owners can define rules using directives such as User-agent, Disallow, and Allow. These directives inform web robots which parts of the website to access or avoid.

Scrapy and robots.txt

Scrapy, being a comprehensive web scraping framework, provides convenient tools to handle robots.txt rules. When a Scrapy spider visits a website, it automatically checks the robots.txt file to determine if it is allowed to crawl certain pages or follow links.

Scrapy uses the RobotsTxtMiddleware middleware to enable robots.txt compliance. This middleware reads the robots.txt file for each request that the spider makes and adjusts the crawling behavior accordingly.

Implementing robots.txt compliance in Scrapy

To enable robots.txt compliance in a Scrapy project, follow these steps:

Step 1: Create a new Scrapy project

$ scrapy startproject myproject

Step 2: Open the `settings.py` file

The settings.py file contains configuration settings for the Scrapy project.

Step 3: Add the `RobotsTxtMiddleware` to the middleware settings

In the settings.py file, locate the DOWNLOADER_MIDDLEWARES setting and add the following line:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
}

This configures Scrapy to enable the RobotsTxtMiddleware with a priority of 100.

Step 4: Run the Scrapy spider

With the robots.txt compliance enabled, run your Scrapy spider using the command:

$ scrapy crawl myspider

Scrapy will automatically check the robots.txt file for each request the spider makes and adjust the crawling behavior accordingly.

Handling robots.txt directives

Scrapy provides a few convenience methods to handle robots.txt directives in your spider code. Here are some examples:

Checking if a URL is allowed

You can use the robotstxt module to check if a URL is allowed to be crawled. Here’s an example:

from scrapy import robotstxt

allowed = robotstxt.is_allowed("http://www.example.com", "User-agent: *\nDisallow: /private/")
print(allowed)  # Output: False

In this example, we check if the URL “http://www.example.com” is allowed to be crawled according to the specified robots.txt rules.

Fetching the robots.txt file

You can fetch and parse the robots.txt file using the robotstxt module. Here’s an example:

from scrapy import robotstxt

robots = robotstxt.RobotsTxt("http://www.example.com")
print(robots.allowed("/private/")) # Output: False
print(robots.disallowed("/public/")) # Output: True

In this example, we fetch and parse the robots.txt file for “http://www.example.com” and check if specific paths are allowed or disallowed for crawling.

Conclusion

Robots.txt compliance is an essential aspect of web scraping in order to adhere to ethical practices and ensure website owners’ guidelines are respected. With Scrapy’s built-in support for handling robots.txt rules, implementing compliance becomes straightforward and hassle-free.

Scrapy’s RobotsTxtMiddleware automatically checks the robots.txt file for each request made by a spider and adjusts crawling behavior accordingly. Additionally, Scrapy provides convenient methods to handle robots.txt directives within spider code.

By incorporating Scrapy’s robots.txt support, you can build robust and responsible web scraping applications in Python. Happy scraping!

Please note that web scraping may be subject to legal restrictions and should be done responsibly and with respect for website terms of service.