[파이썬] requests-html 웹사이트 변경 감지

06 Sep 2023

requests-html

In this blog post, we will explore how to use the requests-html library in Python to detect changes on a website. Website monitoring is essential for various purposes, such as tracking updates on news websites, monitoring competitors’ pages, or ensuring the availability and correctness of critical web services.

Overview

requests-html is a powerful library that allows us to send HTTP requests, parse HTML content, and extract relevant information with ease. By combining this library with a few additional tools, we can create a system that periodically checks the content of a website and notifies us when changes occur.

Prerequisites

To follow this tutorial, make sure you have Python and the requests-html library installed on your system. You can install requests-html using pip:

pip install requests-html

Additionally, we will use the beautifulsoup4 library to process HTML content and the hashlib module to calculate hashes for comparison purposes. Install these dependencies as well:

pip install beautifulsoup4

Performing Initial Website Request

To start, we need to make an initial request to the website we want to monitor. We will then save the HTML content and calculate its hash value.

import requests
from bs4 import BeautifulSoup
import hashlib

# Define the website URL
url = "https://www.example.com"

# Send a GET request to the website
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Calculate the hash value of the HTML content
html_hash = hashlib.md5(response.text.encode()).hexdigest()

Checking for Changes

To monitor the website for changes, we can repeat the initial request periodically and compare the obtained hash with the previous one. If the hash values differ, it means that the webpage has been updated.

We can also extract specific information from the webpage using the soup object and send notifications as required.

import time

# Define the interval (in seconds) between each check
interval = 300  # 5 minutes

while True:
    # Perform the website request
    response = requests.get(url)
    new_hash = hashlib.md5(response.text.encode()).hexdigest()

    # Compare the new and previous hashes
    if new_hash != html_hash:
        # Website has changed, perform necessary actions here
        print("Website updated!")

        # Update the stored hash value
        html_hash = new_hash

    # Wait for the specified interval
    time.sleep(interval)

Conclusion

Using the requests-html library in Python, along with beautifulsoup4 and hashlib, we can easily detect changes on a website. By periodically comparing hash values, we can effectively monitor website updates and take appropriate actions when necessary.

This approach is a basic implementation, but you can enhance it further by customizing the comparison logic or integrating it with a notification system to receive alerts when changes occur.