[파이썬] Beautiful Soup 4 스크레이핑한 데이터의 분석

Beautiful Soup is a Python library used for web scraping purposes. It provides an easy way to extract information from HTML and XML documents. In this blog post, we will explore how to use Beautiful Soup 4 to scrape data from websites and analyze it using Python.

Installing Beautiful Soup 4

To get started, we need to install Beautiful Soup 4. Open your terminal and run the following command:

pip install beautifulsoup4

Scraping Data with Beautiful Soup 4

To scrape data from a website, we need to follow these steps:

  1. Send an HTTP request to the URL of the webpage we want to scrape.
  2. Retrieve the HTML content of the webpage.
  3. Parse the HTML content using Beautiful Soup.

Here’s an example code snippet that demonstrates how to scrape data from a website using Beautiful Soup 4:

import requests
from bs4 import BeautifulSoup

# Send a GET request to the website URL
url = "https://example.com"
response = requests.get(url)

# Check the response status code
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")
    # Extract the desired data from the parsed HTML
    data = soup.find("div", {"class": "content"}).text
    # Print the scraped data
    print(data)
else:
    print("Failed to retrieve data from the website.")

In the example above, we use the requests library to send an HTTP GET request to the website URL. We then check if the response status code is 200 (indicating a successful request). If it is, we use Beautiful Soup’s find method to extract the content of a specific <div> element with a class attribute of “content”.

Analyzing Scraped Data

Once we have scraped the data from a website, we can perform various analyses using Python. Here are a few examples:

Let’s take a look at an example of analyzing scraped data:

import pandas as pd

# Assume 'data' contains the scraped text
data = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum commodo eget leo in sagittis. Nulla nec iaculis leo. In pulvinar urna libero."

# Count the occurrences of each word in the scraped text
word_counts = {}
for word in data.split():
    if word not in word_counts:
        word_counts[word] = 1
    else:
        word_counts[word] += 1

# Convert the word counts dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(word_counts, orient="index", columns=["Count"])

# Sort the DataFrame by word count in descending order
df = df.sort_values(by="Count", ascending=False)

# Print the top 5 most frequently occurring words
print(df.head(5))

In this example, we use the pandas library to analyze the scraped text. We count the occurrences of each word using a dictionary, convert the dictionary to a pandas DataFrame, sort the DataFrame by word count, and finally print the top 5 most frequently occurring words.

Conclusion

Beautiful Soup 4 is a powerful tool for scraping and analyzing data from websites. It provides an intuitive and Pythonic way to extract information from HTML and XML documents. By combining Beautiful Soup with other libraries such as requests and pandas, you can perform a wide range of data scraping and analysis tasks efficiently in Python.