[파이썬] Beautiful Soup 4 html.parser 및 xml.parser 사용

06 Sep 2023

beautiful soup

Beautiful Soup is a powerful Python library for web scraping and parsing HTML and XML documents. It provides a simple API for extracting and manipulating data from these documents.

In this blog post, we will explore how to use the html.parser and xml.parser parsers in Beautiful Soup 4.

Installation

To install Beautiful Soup, you can use pip:

pip install beautifulsoup4

Usage

To start parsing HTML or XML documents using Beautiful Soup, we first need to import it:

from bs4 import BeautifulSoup

Parsing HTML

Let’s begin by parsing an HTML document. We first open the HTML file and pass it to BeautifulSoup along with the parser we want to use, which in this case is html.parser:

from bs4 import BeautifulSoup

with open("example.html") as file:
    soup = BeautifulSoup(file, "html.parser")

Using the parsed HTML, we can now extract data from the document. For example, to find all the links in the HTML document, we can use the find_all method:

links = soup.find_all("a")
for link in links:
    print(link["href"])

Parsing XML

Beautiful Soup can also be used to parse XML documents. The process is very similar to parsing HTML documents. We pass the XML file and the xml.parser to BeautifulSoup:

from bs4 import BeautifulSoup

with open("example.xml") as file:
    soup = BeautifulSoup(file, "xml.parser")

Once we have the parsed XML, we can extract data just like we did with HTML. For instance, to find all the tags with the name “item” in the XML document, we can use the find_all method:

items = soup.find_all("item")
for item in items:
    print(item.text)

Conclusion

In this blog post, we have seen how to use the html.parser and xml.parser in Beautiful Soup 4 to parse and extract data from HTML and XML documents. With the flexibility and simplicity of Beautiful Soup, it becomes a valuable tool for web scraping and data extraction tasks in Python.

Beautiful Soup provides many other useful features and methods, so make sure to refer to the official documentation for more information on manipulating and traversing parsed documents. Happy parsing!