[파이썬] Scrapy와 BeautifulSoup 통합

06 Sep 2023

Scrapy

Scrapy와 BeautifulSoup은 둘 다 파이썬으로 작성된 웹 크롤링 및 스크레이핑 도구입니다. 각각의 도구는 각각의 장단점을 가지고 있으며, 때로는 두 도구를 함께 사용하여 크롤링 및 스크레이핑 작업을 수행하는 것이 이점이 될 수 있습니다.

Scrapy

Scrapy는 웹 크롤링 및 스크레이핑을 위한 강력한 프레임워크로, 다수의 웹 페이지를 순차적으로 방문하고 데이터를 추출하는 기능을 제공합니다. Scrapy는 비동기적으로 작동하여 동시에 여러 웹 페이지를 크롤링할 수 있어서 효율적인 크롤링이 가능합니다. 또한, 선택적인 자동화 기능 및 데이터 파이프라인 옵션도 제공됩니다.

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    
    def start_requests(self):
        urls = [
            'http://www.example.com/page1',
            'http://www.example.com/page2',
            'http://www.example.com/page3',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    
    def parse(self, response):
        data = response.css('div.my-div').extract()
        yield {'data': data}

BeautifulSoup

BeautifulSoup은 HTML 및 XML 문서를 파싱하고 데이터를 추출하는 데 사용되는 파이썬 라이브러리입니다. BeautifulSoup을 사용하면 간단한 인터페이스를 통해 웹 페이지의 요소를 탐색하고 원하는 데이터를 추출할 수 있습니다. BeautifulSoup은 다양한 파싱 방법을 제공하여 손쉽게 웹 페이지를 분석할 수 있습니다.

from bs4 import BeautifulSoup
import requests

url = 'http://www.example.com/page1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

data = soup.find_all('div', class_='my-div')

통합 방법

Scrapy와 BeautifulSoup을 함께 사용하여 크롤링 및 스크레이핑 작업을 수행할 수 있습니다. Scrapy를 사용하여 웹 페이지를 크롤링하고, 각 웹 페이지의 내용을 BeautifulSoup을 사용하여 분석할 수 있습니다.

import scrapy
from bs4 import BeautifulSoup

class MySpider(scrapy.Spider):
    name = 'my_spider'
    
    def start_requests(self):
        urls = [
            'http://www.example.com/page1',
            'http://www.example.com/page2',
            'http://www.example.com/page3',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'html.parser')
        data = soup.find_all('div', class_='my-div')
        yield {'data': data}

Scrapy와 BeautifulSoup을 통합하여 사용하면 Scrapy의 강력한 크롤링 기능과 BeautifulSoup의 간편한 데이터 추출 기능을 모두 활용할 수 있습니다. 이를 통해 더 유연하고 효율적인 웹 크롤링 및 스크레이핑 작업을 수행할 수 있습니다.