[파이썬] Scrapy 스크린샷 캡처

06 Sep 2023

Scrapy

Scrapy is a powerful web scraping framework that allows developers to extract data from websites. One useful feature of Scrapy is its ability to capture screenshots of web pages. In this blog post, we will explore how to use Scrapy to capture screenshots in Python.

Installing Scrapy

To get started, we first need to install Scrapy. You can do this by running the following command in your command prompt or terminal:

pip install Scrapy

Make sure you have Python and pip installed on your system before running this command.

Creating a Scrapy Project

Once Scrapy is installed, let’s create a new Scrapy project. Open your command prompt or terminal and navigate to the directory where you want to create the project. Then, run the following command:

scrapy startproject screenshot_project

This will create a new directory called screenshot_project with the basic structure of a Scrapy project.

Writing the Spider

Next, let’s create a spider to scrape the web page and capture a screenshot. In the spiders directory of your project, create a new Python file called screenshot_spider.py. Open the file and add the following code:

import scrapy
from scrapy_screenshot import ScreenshotMiddleware

class ScreenshotSpider(scrapy.Spider):
    name = "screenshot"

    def start_requests(self):
        urls = [
            'https://www.example.com',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Get the current URL
        url = response.url

        # Capture the screenshot
        yield scrapy.Request(url=url, callback=self.save_screenshot)

    def save_screenshot(self, response):
        # Save the screenshot as a file
        filename = response.url.split("/")[-2] + '.png'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved screenshot as {}'.format(filename))

In this spider, we define a start_requests method that generates the URLs we want to scrape. We then define a parse method that captures the screenshot and calls the save_screenshot method.

The save_screenshot method saves the screenshot as a file using the URL as the filename.

Configuring Scrapy Settings

To enable the screenshot capturing functionality, we need to configure Scrapy settings. Open the settings.py file in your project and add the following lines:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_screenshot.middleware.ScreenshotMiddleware': 600,
}

SCRAPERPROXY_ENABLED = False

The ScreenshotMiddleware is a middleware that captures the screenshots. We set its priority to 600. The SCRAPERPROXY_ENABLED setting is optional and is used to disable the use of proxy servers.

Running the Spider

To run the spider and capture the screenshots, navigate to your project’s directory in the command prompt or terminal and run the following command:

scrapy crawl screenshot

Scrapy will start crawling the URLs specified in the start_requests method of the spider and capture screenshots of the web pages.

Conclusion

In this blog post, we have explored how to use Scrapy to capture screenshots in Python. We started by installing Scrapy and creating a new Scrapy project. Then, we wrote a spider to scrape the web page and capture a screenshot. Finally, we configured Scrapy settings and ran the spider. Scrapy’s built-in screenshot capturing functionality makes it easy to automate the task of capturing web page screenshots.