[파이썬] Scrapy REST API 스크레이핑

Introduction

Scrapy is a powerful and flexible web scraping framework written in Python. It allows you to scrape data from websites and extract useful information. In this blog post, we will learn how to build a REST API using Scrapy, so that we can easily access and consume the scraped data.

Setting up Scrapy

Before we start building the REST API, let’s make sure we have Scrapy installed. You can install it using pip:

pip install scrapy

Building the Scrapy Spider

The first step is to build a Scrapy spider. A spider is responsible for navigating through websites and extracting the desired data. We can create a new Scrapy project and add a spider to it. Here is an example of a simple spider:

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["http://example.com"]

    def parse(self, response):
        # Extracting data from the response
        data = response.css('h1::text').extract_first()

        # Returning the scraped data
        yield {
            'data': data
        }

This spider visits http://example.com and extracts the content of the <h1> tag. You can modify the spider to meet your specific scraping needs.

Creating the REST API

To create the REST API, we will use the Flask microframework. Flask allows us to easily build web applications in Python. Here is an example of a simple Flask application:

from flask import Flask, jsonify
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

app = Flask(__name__)

@app.route('/scrape', methods=['GET'])
def scrape():
    # Running the Scrapy spider
    process = CrawlerProcess(get_project_settings())
    process.crawl('myspider')
    process.start()

    # Getting the scraped data
    data = process.spiders.get('myspider').output

    # Returning the scraped data as JSON
    return jsonify(data)

if __name__ == '__main__':
    app.run()

This Flask application defines a single route /scrape which triggers the Scrapy spider and returns the scraped data as JSON.

Running the REST API

To run the REST API, you need to execute the Flask application. Simply run the following command:

python app.py

The REST API will be accessible at http://localhost:5000/scrape. When you visit this URL, the Scrapy spider will be triggered and the scraped data will be returned as JSON.

Conclusion

By combining Scrapy and Flask, we can build a powerful REST API for web scraping in Python. This allows us to easily access and consume the scraped data in a structured format. If you want to learn more about Scrapy and building web scrapers, check out the official Scrapy documentation. Happy scraping!