[파이썬] Scrapy 자바스크립트 실행 페이지 스크레이핑

06 Sep 2023

Scrapy

Scraping websites that heavily rely on JavaScript can be quite challenging. Traditional web scraping methods may not work efficiently in such cases. However, with the help of Scrapy, a powerful web scraping framework in Python, we can easily scrape JavaScript-rendered web pages.

In this tutorial, we will focus on scraping JavaScript-rendered pages using Scrapy. Let’s get started!

Prerequisites

Python 3.x installed on your system
Scrapy library installed (pip install scrapy)

Creating a Scrapy project

First, let’s create a new Scrapy project. Open your terminal or command prompt and run the following command:

scrapy startproject javascript_scraping

This will create a new directory named javascript_scraping with the basic structure for our project.

Next, create a new Scrapy spider using the following command:

cd javascript_scraping
scrapy genspider javascript_spider example.com

Replace example.com with the URL of the website you want to scrape.

Configuring Scrapy to render JavaScript

To enable JavaScript execution in Scrapy, we need to use a middleware called SplashMiddleware. Splash is a lightweight web rendering service that allows us to interact with JavaScript-rendered pages.

First, install Splash using the following command:

pip install scrapy-splash

Next, open the settings.py file in your project and add the following lines:

SPLASH_URL = 'http://localhost:8050/'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Ensure that you have Splash running on your system. You can install and run Splash using Docker by running the following commands:

docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash

Modifying the Scrapy spider

Now, open the spider file javascript_spider.py located in the spiders directory. Update the spider’s start_requests method as follows:

import scrapy
from scrapy_splash import SplashRequest

class JavascriptSpider(scrapy.Spider):
    name = 'javascript_spider'
    allowed_domains = ['example.com']

    def start_requests(self):
        urls = ['http://example.com/page1', 'http://example.com/page2']
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

    def parse(self, response):
        # Process the JavaScript-rendered page here
        pass

Replace example.com/page1 and example.com/page2 with the URLs of the pages you want to scrape.

Running the Scrapy spider

To run the Scrapy spider and start scraping the JavaScript-rendered pages, use the following command:

scrapy crawl javascript_spider

Scrapy will send requests to Splash, which will render the JavaScript and return the HTML content to Scrapy for further processing.

Conclusion

In this tutorial, we learned how to scrape JavaScript-rendered pages using Scrapy. We used the Splash middleware to enable JavaScript execution in Scrapy and modified the spider to send requests to Splash for rendering. Scrapy then processed the rendered HTML content to extract the desired data.

Scraping JavaScript-rendered pages can open up a whole new world of possibilities for web scraping. With Scrapy and Splash, you can easily scrape websites that heavily depend on JavaScript. Remember to use web scraping responsibly and always respect the website’s terms of use. Happy scraping!