[파이썬] Scrapy Cloud-based Scrapy 실행

Scrapy Logo

Scrapy is a powerful and flexible web scraping framework written in Python. It provides a convenient way to extract data from websites by defining spiders, which are Python classes that define how to navigate and parse web pages. One of the advantages of using Scrapy is its ability to handle large-scale web scraping tasks efficiently.

In this blog post, we will explore how to execute a Scrapy spider in the cloud using Scrapy Cloud. Scrapy Cloud is a cloud-based platform provided by Scrapinghub that allows you to schedule and run your Scrapy spiders in a distributed manner. This can be beneficial when you need to scrape a large number of websites or when you want to speed up the scraping process.

Setting up a Scrapy Cloud Account

To begin, you will need to create an account on Scrapy Cloud. Simply visit the Scrapinghub website (https://scrapinghub.com/) and sign up for a free account. Once you have successfully signed up, you will gain access to the Scrapy Cloud dashboard.

Creating a Scrapy Project

Next, we need to create a Scrapy project. Open your terminal or command prompt, navigate to your desired project directory, and run the following command:

scrapy startproject myproject

This will create a new directory named myproject that contains the basic structure of a Scrapy project.

Defining a Scrapy Spider

Inside the myproject directory, there is a file named spiders. This is where we will define our Scrapy spider. Create a new Python file called quotes_spider.py within the spiders directory and open it in your preferred text editor.

Within this file, define a class named QuotesSpider that inherits from scrapy.Spider. This class will represent our spider and contains the logic for navigating and scraping web pages. Here is an example:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        # Extract data from web page here
        pass

In this example, our spider is named "quotes" and it starts scraping from the URL 'http://quotes.toscrape.com/page/1/'. The parse method is responsible for handling the response from the web page and extracting the desired data.

Deploying the Scrapy Spider to Scrapy Cloud

To deploy our Scrapy spider to Scrapy Cloud, we need to install the shub command-line tool. Run the following command in your terminal:

pip install shub

Once installed, run the following command inside your project directory to deploy the spider:

shub login

This will prompt you to enter your Scrapy Cloud API key. You can find your API key in the Scrapy Cloud dashboard. After successfully logging in, run the following command to deploy the spider:

shub deploy

Scrapy Cloud will package and upload your project to the cloud. You can monitor the deployment process in your terminal.

Running the Scrapy Spider on Scrapy Cloud

Now that our spider is deployed, we can schedule and run it on Scrapy Cloud. Switch to the Scrapy Cloud dashboard and navigate to your project. From the left sidebar, click on “Jobs” and then click the “New Job” button.

Fill in the necessary details, such as the spider name, start URLs, and any other job parameters. Once you have configured the job, click the “Run” button to start the scraping process.

Monitoring the Scraping Process

Scrapy Cloud provides a convenient interface to monitor and analyze the progress and results of your scraping jobs. You can view the status, logs, and statistics of your running or completed jobs from the Scrapy Cloud dashboard.

Conclusion

Executing Scrapy spiders in the cloud using Scrapy Cloud is a powerful way to scale and speed up your web scraping tasks. It provides a distributed and efficient environment for running your spiders, making it easier to scrape large amounts of data from websites.

By following the steps outlined in this blog post, you should now have a good understanding of how to set up a Scrapy Cloud account, deploy your Scrapy project, and run your spiders in the cloud. Happy scraping!