[파이썬] Beautiful Soup 4 웹 스크레이핑 모니터링 및 알림

Beautiful Soup 4 is a powerful Python library for web scraping. It allows you to extract data from HTML and XML documents easily and efficiently. In this blog post, we will explore how to use Beautiful Soup 4 to scrape websites, monitor the changes in the scraped data, and receive notifications when the data changes.

Prerequisites

Before we begin, make sure you have Beautiful Soup 4 installed on your system. You can install it using pip with the following command:

pip install beautifulsoup4

Scraping Websites with Beautiful Soup 4

To scrape a website, we first need to fetch the HTML content of the webpage. We can use a library like requests to send an HTTP GET request and retrieve the HTML. Once we have the HTML content, we can pass it to Beautiful Soup to parse and extract the desired information.

Here’s an example of scraping a webpage and extracting the titles of all the articles on the page:

import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
url = "https://example.com"

# Send an HTTP GET request and retrieve the HTML content
response = requests.get(url)

# Create a Beautiful Soup object to parse the HTML
soup = BeautifulSoup(response.content, "html.parser")

# Find all the article titles using the appropriate CSS selector
article_titles = []
for article in soup.select("article"):
    title = article.select_one("h2").text.strip()
    article_titles.append(title)

print(article_titles)

Monitoring Changes in Scraped Data

Now that we know how to scrape a webpage using Beautiful Soup, let’s explore how we can monitor changes in the scraped data. We can achieve this by regularly running our scraping script and comparing the new data with the previously scraped data.

We can store the scraped data in a file or a database and compare it with the new data every time we run the script. If there are any differences, we can take appropriate actions, such as sending notifications or triggering other workflows.

Here’s an example of how we can monitor changes in the article titles from the previous example:

import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
url = "https://example.com"

# Send an HTTP GET request and retrieve the HTML content
response = requests.get(url)

# Create a Beautiful Soup object to parse the HTML
soup = BeautifulSoup(response.content, "html.parser")

# Find all the article titles using the appropriate CSS selector
article_titles = []
for article in soup.select("article"):
    title = article.select_one("h2").text.strip()
    article_titles.append(title)

# Read the previously scraped article titles from a file or a database
with open("previous_titles.txt", "r") as file:
    previous_titles = file.read().splitlines()

# Compare the new and previous article titles
new_titles = [title for title in article_titles if title not in previous_titles]

if new_titles:
    print("New article titles:")
    for title in new_titles:
        print(title)
    # Update the previous titles with the new titles
    with open("previous_titles.txt", "w") as file:
        file.write("\n".join(article_titles))
else:
    print("No new article titles")

Receiving Notifications for Data Changes

To receive notifications for data changes, we can integrate our scraping script with a notification service, such as email, SMS, or push notification services. When we detect changes in the scraped data, we can trigger a notification to be sent to our desired recipients.

Here’s an example of sending an email notification when there are new article titles:

import smtplib
from email.mime.text import MIMEText

# ... (previous code)

if new_titles:
    # Send an email notification with the new article titles
    sender = "your-email@example.com"
    recipient = "recipient-email@example.com"
    subject = "New article titles"
    message = "\n".join(new_titles)

    msg = MIMEText(message)
    msg["Subject"] = subject
    msg["From"] = sender
    msg["To"] = recipient

    smtp_server = "smtp.example.com"
    smtp_port = 587
    username = "your-username"
    password = "your-password"

    with smtplib.SMTP(smtp_server, smtp_port) as server:
        server.starttls()
        server.login(username, password)
        server.sendmail(sender, recipient, msg.as_string())

In this example, we use the smtplib library to send an email notification. You can modify the code to use other notification services according to your needs.

Conclusion

Beautiful Soup 4 is a versatile library for web scraping in Python. In this blog post, we learned how to use Beautiful Soup 4 to scrape websites, monitor changes in the scraped data, and receive notifications for data changes. By combining Beautiful Soup 4 with other tools and services, we can create powerful web scraping and monitoring solutions.