[파이썬] Beautiful Soup 4 JSON 형식으로 내보내기

Beautiful Soup is a popular Python library used for web scraping and parsing HTML and XML documents. It simplifies the process of extracting data from these documents by providing a convenient API. In addition to parsing HTML and XML, Beautiful Soup also allows exporting the extracted data in different formats. One of the formats it supports is JSON.

In this blog post, we will explore how to use Beautiful Soup 4 to parse HTML and XML documents and export the extracted data in JSON format.

Installing Beautiful Soup 4

Before we can start, we need to have Beautiful Soup 4 installed. You can install it using pip by running the following command:

pip install beautifulsoup4

Parsing HTML/XML using Beautiful Soup 4

Once Beautiful Soup is installed, we can start parsing HTML or XML documents. Let’s assume we have an HTML document named “example.html”. We can parse it using the following code:

from bs4 import BeautifulSoup

with open("example.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")

Here, we use the open function to open the HTML file, and then pass the file object to Beautiful Soup’s BeautifulSoup class along with the parser to use (html.parser in this case).

Extracting Data using Beautiful Soup 4

Now that we have the parsed document, we can use Beautiful Soup’s API to extract the desired data. The API provides various methods and selectors to navigate and search through the document.

# Extracting data from HTML elements
title = soup.title.string
first_paragraph = soup.p.string

# Extracting data using CSS selectors
links = soup.select("a[href^='http']")

In the above example, we extract the title of the document, the content of the first paragraph, and all the links that start with “http”.

Exporting Data in JSON Format

To export the extracted data in JSON format, we can use the built-in json module in Python. We convert the extracted data into a Python dictionary or list, and then use the json.dump function to write the data into a JSON file.

import json

data = {
    "title": title,
    "first_paragraph": first_paragraph,
    "links": [link.get("href") for link in links],
}

with open("output.json", "w") as fp:
    json.dump(data, fp, indent=4)

In the above example, we create a dictionary named data containing the extracted data. We then open a new file named “output.json” in write mode and use json.dump to write the data into the file. The indent=4 argument is optional and adds indentation to improve readability.

Conclusion

Beautiful Soup 4 provides a powerful and convenient way to parse HTML and XML documents and extract data from them. By using its API and combining it with the json module in Python, we can easily export the extracted data in JSON format.

I hope this blog post has helped you understand how to export data in JSON format using Beautiful Soup 4 in Python. Happy scraping!