[파이썬] Scrapy 웹 페이지 변화 감지

With the fast-paced nature of the internet, web page monitoring has become crucial for various purposes like tracking competitive prices, monitoring news updates, or scraping data for research purposes. In this blog post, we will explore how to use Scrapy, a powerful web scraping framework in Python, to detect changes in web pages and notify us whenever there is an update.

Setting up Scrapy

Before we begin, let’s make sure we have Scrapy installed. Open your terminal and run the following command to install Scrapy:

pip install scrapy

Once Scrapy is installed, we can proceed with creating a new Scrapy project.

Creating a Scrapy Project

To create a new Scrapy project, run the following command in your terminal:

scrapy startproject webpage_monitoring
cd webpage_monitoring

This will create a new Scrapy project called “webpage_monitoring” with the necessary files and folders.

Creating a Spider

A Spider is the core component of a Scrapy project. It defines how to crawl and extract data from a website. In our case, we want to detect changes in a web page, so we need to define the URL we want to monitor and the logic to compare the current version of the page with the previous version.

Create a new file called webpage_monitoring_spider.py under the webpage_monitoring/spiders directory. Open the file and add the following code:

import scrapy

class WebpageMonitoringSpider(scrapy.Spider):
    name = 'webpage_monitoring'
    start_urls = ['https://example.com']  # Replace with the URL you want to monitor

    def __init__(self):
        self.previous_version = None

    def parse(self, response):
        current_version = response.body  # Adjust this line to extract the relevant content from the web page

        if self.previous_version is not None and current_version != self.previous_version:
            # Web page has changed, notify or perform desired actions
            print("Web page has changed!")

        # Update previous version with the current version for the next comparison
        self.previous_version = current_version

In this spider, we define the start_urls list, which contains the URLs we want to monitor. Inside the parse method, we extract the content of the web page and compare it with the previous version. If there is a difference, we can perform any desired actions, such as sending a notification or storing the updated data.

Running the Spider

To run the spider and start monitoring the web page, execute the following command in your terminal:

scrapy crawl webpage_monitoring

Scrapy will start crawling the specified URL and compare the current version with the previous version. If there is a change, it will print the “Web page has changed!” message.

Automating the Monitoring Process

Monitoring a web page manually is not efficient or practical. To automate the monitoring process and receive notifications whenever a change occurs, we can use scheduling tools like cronjobs or incorporate the spider into a larger script.

For example, you can create a Python script that runs the spider at regular intervals using the schedule library:

import schedule
import time
from scrapy import cmdline

def run_spider():
    cmdline.execute("scrapy crawl webpage_monitoring".split())

schedule.every(10).minutes.do(run_spider)

while True:
    schedule.run_pending()
    time.sleep(1)

This script runs the spider every 10 minutes using the schedule library.

Conclusion

In this blog post, we learned how to use Scrapy to monitor changes in web pages. By comparing the current version with the previous version, we can detect any updates and perform desired actions. By automating this process, we can stay informed about changes happening on the web in real-time. Scrapy provides a powerful and flexible framework for web page monitoring and a wide range of web scraping tasks. Happy web monitoring!