[파이썬] Beautiful Soup 4 자바스크립트 기반 페이지 스크레이핑

06 Sep 2023

beautiful soup

In web scraping, it’s common to encounter web pages that heavily rely on JavaScript for rendering their content. Traditional web scraping techniques may not be sufficient to extract data from such pages. However, with Beautiful Soup 4 (BS4) library in Python, you can still scrape JavaScript-based pages.

Why Beautiful Soup 4?

Beautiful Soup is a powerful library in Python that allows you to parse HTML and XML documents effortlessly. It provides a simple and intuitive way to navigate, search, and modify the parsed data. Beautiful Soup 4 provides robust support for dealing with JavaScript-based pages by combining it with other libraries like requests and selenium.

Prerequisites

Before we get started, make sure you have the following installed:

Python (version 3.x recommended)
Beautiful Soup 4 (beautifulsoup4 package)
requests (requests package)
selenium (selenium package) (optional)

You can install beautifulsoup4 and requests by running the following command in your terminal:

pip install beautifulsoup4 requests

Using Beautiful Soup 4 for JavaScript-based Page Scraping

To illustrate how Beautiful Soup 4 can be used for scraping JavaScript-based pages, let’s consider an example where we scrape a website that loads its content using JavaScript.

Step 1: Send a GET request using `requests`

We start by sending a GET request to the page we want to scrape using the requests library. The response will contain the raw HTML content of the page.

import requests

url = "https://example.com"
response = requests.get(url)
html_content = response.text

Step 2: Parse the HTML content with Beautiful Soup

Next, we need to parse the HTML content using Beautiful Soup. This converts the raw HTML into a structured tree that we can navigate and extract data from.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "html.parser")

Step 3: Extract data from the parsed HTML

Now that we have the parsed HTML in the soup object, we can use Beautiful Soup’s various methods to extract the desired data. This includes finding elements, navigating the HTML tree, and accessing attributes and text content.

# Find all <a> tags with a specific class
links = soup.find_all("a", class_="link-class")

# Get the text content of each link
link_texts = [link.text for link in links]

# Print the extracted link texts
for text in link_texts:
    print(text)

Optional: Scraping JavaScript-generated content using Selenium

If the web page heavily relies on JavaScript for rendering content that cannot be accessed with Beautiful Soup alone, we can combine it with Selenium to automate a web browser.

from selenium import webdriver

# Launch a web browser (e.g., Chrome)
driver = webdriver.Chrome()

# Access the page URL
driver.get(url)

# Wait for the page to load (if necessary)
# ...

# Get the page source (after the JavaScript execution)
html_content = driver.page_source

# Close the web browser
driver.quit()

# Parse the HTML content with Beautiful Soup as before
soup = BeautifulSoup(html_content, "html.parser")

# Extract data using Beautiful Soup methods
# ...

By using Selenium, we can effectively scrape JavaScript-generated content by simulating user interactions on the page.

Conclusion

Beautiful Soup 4 is a valuable tool for scraping JavaScript-based pages in Python. With its intuitive API and support for navigating and manipulating parsed data, it enables you to extract valuable information from complicated web pages. Whether you use it standalone or in combination with other libraries like Selenium, Beautiful Soup 4 simplifies the scraping process and makes it accessible to developers of all levels.