[파이썬] Scrapy 프로젝트 생성

06 Sep 2023

Scrapy

Scrapy logo

Scrapy is an open-source web crawling framework written in Python. It provides a set of tools and libraries for building web spiders to efficiently extract data from websites. In this blog post, we will learn how to create a Scrapy project in Python.

Step 1: Install Scrapy

Before we can create a Scrapy project, we need to install Scrapy itself. Open your command line interface and run the following command:

pip install scrapy

This command will install Scrapy and all its dependencies.

Step 2: Create a new Scrapy project

Now that Scrapy is installed, let’s create a new Scrapy project. Open your command line interface and navigate to the directory where you want to create your project. Then, run the following command:

scrapy startproject myproject

Replace myproject with the name you want to give your project. This command will create a new directory with the same name as your project, containing the basic structure of a Scrapy project.

Step 3: Navigate to the project directory

Next, navigate to the project directory by running the following command:

cd myproject

Replace myproject with the name of your project.

Step 4: Create a Scrapy spider

In Scrapy, a spider is responsible for defining how to crawl a website and extract data from it. To create a new spider, run the following command:

scrapy genspider myspider example.com

Replace myspider with the name you want to give your spider, and example.com with the starting URL of the website you want to crawl.

This command will create a new Python file in the spiders directory of your project, with the name myspider.py. In this file, you can define how your spider should crawl the website and extract data.

Step 5: Customize the spider

Open the myspider.py file in a text editor and define the crawling logic and data extraction code. You can use the Scrapy API to send requests to web pages, parse the HTML content, and extract data using CSS or XPath selectors.

Here’s an example of a simple spider that extracts the titles of all the articles on a website:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']
    
    def parse(self, response):
        article_titles = response.css('h2.article-title::text').getall()
        for title in article_titles:
            yield {
                'title': title
            }

Step 6: Run the spider

Finally, it’s time to run the spider and see it in action. Open your command line interface, navigate to the project directory, and run the following command:

scrapy crawl myspider

Replace myspider with the name of your spider.

Scrapy will start crawling the website and extract the data according to the logic defined in your spider. The extracted data will be displayed in the command line interface.

Congratulations! You have successfully created a Scrapy project in Python and built a spider to crawl a website and extract data. Scrapy provides many powerful features and options to customize your web scraping workflow, making it a popular choice among developers for data extraction tasks. Happy scraping!