[파이썬] Scrapy 데이터 저장 포맷

06 Sep 2023

Scrapy

Scrapy is a powerful web scraping framework in Python that enables developers to easily extract data from websites. Once you have extracted the data, it is important to choose the right storage format to save and analyze it later. In this blog post, we will explore various data storage formats supported by Scrapy and how to save data in each format using Python.

CSV (Comma-Separated Values)

CSV is one of the most common and straightforward data storage formats. It represents tabular data as plain text, where each line corresponds to a row and column values are separated by commas.

To save data in CSV format with Scrapy, you can use the built-in csv module in Python. Here’s an example code snippet:

import csv

class MySpider(scrapy.Spider):
    name = "my_spider"

    def parse(self, response):
        # Code to extract data
        
        # Save data in CSV format
        with open('data.csv', 'w', newline='') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(['Title', 'Description']) # Write header
            writer.writerows(data) # Write rows

In the above code, we first import the csv module. Inside the parse method, we open a CSV file in write mode using open method. We then create a csv.writer instance and use the writerow method to write the header row and writerows method to write the data rows.

JSON (JavaScript Object Notation)

JSON is a popular data interchange format that is widely used for storing and transferring data. It is human-readable and easy to parse.

Scrapy provides a built-in support for saving data in JSON format. Here’s an example code snippet:

import json

class MySpider(scrapy.Spider):
    name = "my_spider"

    def parse(self, response):
        # Code to extract data
        
        # Save data in JSON format
        with open('data.json', 'w') as jsonfile:
            json.dump(data, jsonfile)

In the above code, we import the json module. Inside the parse method, we open a JSON file in write mode using open method. We then use the json.dump method to save the data in JSON format.

MongoDB

MongoDB is a popular NoSQL database that provides a flexible and scalable storage solution. Scrapy provides an integration with MongoDB using the pymongo library.

To save data in MongoDB, you need to install pymongo library first. Here’s an example code snippet:

import pymongo

class MyMongoPipeline(object):
    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

In the above code, we define a custom pipeline class to save data in MongoDB. We initialize the class with the MongoDB URI and database name. Inside the process_item method, we insert the scraped item into the MongoDB collection.

To enable MongoDB integration, you need to configure the settings.py file of your Scrapy project:

ITEM_PIPELINES = {
   'my_project.pipelines.MyMongoPipeline': 300,
}

MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'my_database'

In the above code, we configure the ITEM_PIPELINES setting to enable the custom pipeline. We also set the MongoDB URI and database name.

These are just a few examples of data storage formats and methods supported by Scrapy. Depending on your specific requirements, you can choose the most suitable format to save and analyze your valuable scraped data.