[파이썬] Scrapy 동적 로딩 요소 처리

06 Sep 2023

Scrapy

Scrapy is a powerful web scraping framework in python. It provides a convenient way to extract data from websites by crawling through the webpages. However, when it comes to scraping websites with dynamic content loading, Scrapy requires a different approach.

Dynamic content loading refers to the loading of elements on a webpage that are not present in the initial HTML source, but are loaded dynamically through JavaScript or AJAX requests. These elements are often used to display additional information or update the page without a full refresh.

In this blog post, we will explore how to handle dynamic loading elements in Scrapy using python. We will discuss two common approaches:

Using Selenium: Selenium is a popular tool for automating web browsers. It allows us to interact with webpages and scrape dynamically loaded content. We can integrate Selenium with Scrapy to load pages with dynamic elements and extract the required data.
Analyzing AJAX requests: Another way to handle dynamic loading elements is to analyze the AJAX requests made by the webpage. We can intercept these requests and extract the required data directly from the response, without actually rendering the webpage.

Using Selenium in Scrapy

Selenium provides a WebDriver API that allows us to control a browser directly from our python code. Here’s an example of how we can use Selenium in Scrapy to handle dynamic loading elements.

import scrapy
from scrapy_selenium import SeleniumRequest

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def start_requests(self):
        yield SeleniumRequest(url='https://www.example.com', callback=self.parse)

    def parse(self, response):
        # Use Selenium to handle dynamic loading elements
        driver = response.meta['driver']
        # Extract the required data from the loaded page
        # ...

        yield {
            'data': data,
        }

In this example, we use the SeleniumRequest class from the scrapy_selenium package to send a request to the webpage. The response will contain a driver object, which allows us to interact with the webpage using Selenium.

We can then use the driver object to handle the dynamic loading elements and extract the required data. Finally, we yield the data in a structured format.

Analyzing AJAX Requests

Alternatively, instead of using Selenium, we can analyze the AJAX requests made by the webpage to handle dynamic loading elements. This approach requires some knowledge of web development and network analysis tools.

Here’s an example of how we can analyze AJAX requests in Scrapy.

import scrapy
import json

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def start_requests(self):
        yield scrapy.Request(url='https://www.example.com', callback=self.parse)

    def parse(self, response):
        # Extract the required data from the initial page
        # ...

        # Extract the AJAX request URL and parameters
        url = 'https://www.example.com/ajax'
        params = {
            'page': 2,
            'limit': 10,
        }

        # Make the AJAX request
        yield scrapy.FormRequest(url=url, formdata=params, callback=self.parse_ajax)

    def parse_ajax(self, response):
        # Extract the required data from the AJAX response
        ajax_data = json.loads(response.text)
        data = ajax_data['data']
        # ...

        yield {
            'data': data,
        }

In this example, we start by sending a request to the webpage and extracting the required data from the initial page. Then, we identify the AJAX request URL and parameters that are responsible for loading the dynamic content.

Next, we make the AJAX request using the scrapy.FormRequest class and pass the callback function parse_ajax to handle the response. Inside the parse_ajax function, we extract the required data from the AJAX response.

Conclusion

Handling dynamic loading elements in Scrapy can be a bit challenging, but with the right tools and techniques, it can be accomplished effectively. In this blog post, we explored two common approaches: using Selenium and analyzing AJAX requests.

Depending on the specific website and requirements, one approach may be more suitable than the other. It’s important to understand the underlying mechanisms of dynamic content loading and choose the approach accordingly.

Happy scraping!