Scrapy is a powerful and flexible web scraping framework that allows you to easily extract data from websites. Scrapy Cloud, provided by Scrapinghub, is a cloud-based platform that allows you to deploy and run your Scrapy spiders in a distributed manner. In this blog post, we will explore the steps to use Scrapy Cloud in Python.
Step 1: Install Scrapy and create a spider
First, make sure you have Scrapy installed. You can install it using pip:
pip install scrapy
Next, create a Scrapy spider. A spider is responsible for defining how to navigate a website and extract data from it. Let’s create a simple spider that scrapes quotes from “http://quotes.toscrape.com”:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"http://quotes.toscrape.com/"
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get()
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Step 2: Deploy the spider to Scrapy Cloud
Now that we have our spider ready, let’s deploy it to Scrapy Cloud.
- Sign up for a Scrapy Cloud account at https://app.scrapinghub.com/.
-
Install the Scrapinghub command-line tool using pip:
pip install shub
-
Log in to Scrapinghub by running the following command and providing your Scrapinghub API key:
shub login
-
Create a new Scrapy project:
scrapy startproject <project_name>
-
Go to the project directory:
cd <project_name>
-
Create a “scrapinghub.yml” configuration file with the following content:
projects: default: <project_id>
Replace
<project_id>
with your Scrapinghub project ID. -
Deploy the spider to Scrapy Cloud:
shub deploy
Step 3: Run the spider on Scrapy Cloud
After successful deployment, you can run the spider on Scrapy Cloud using the Scrapinghub API. Here’s an example of how to run the spider and retrieve the scraped data:
import requests
API_URL = "https://app.scrapinghub.com/api/run.json"
project_id = "<project_id>"
spider_name = "quotes"
params = {
"project": project_id,
"spider": spider_name
}
response = requests.post(API_URL, data=params)
job_id = response.json()["jobid"]
job_url = f"https://app.scrapinghub.com/p/{project_id}/{job_id}"
print(f"Spider started, job URL: {job_url}")
# Wait for the spider to finish
finished = False
while not finished:
job_status = requests.get(job_url).json()["state"]
if job_status in ("finished", "cancelled", "deleted"):
finished = True
# Retrieve the scraped data
items_url = f"https://storage.scrapinghub.com/items/{job_id}"
response = requests.get(items_url)
data = response.json()["items"]
print("Scraped data:")
for item in data:
print(item)
Replace <project_id>
with your Scrapinghub project ID. This code will start the spider on Scrapy Cloud, wait for it to finish, and then retrieve the scraped data from the job.
That’s it! You have successfully used Scrapy Cloud in Python to deploy and run your Scrapy spider in a distributed manner. Happy web scraping!