[파이썬] Beautiful Soup 4 태그의 트리 구조 탐색

06 Sep 2023

beautiful soup

Beautiful Soup is a popular Python library used for web scraping and parsing HTML or XML documents. It provides an easy way to navigate and search the document’s tree-like structure, making it convenient to extract desired information from web pages.

In this blog post, we will explore how to navigate and search the tree structure of HTML tags using Beautiful Soup 4 in Python.

Installing Beautiful Soup 4

Before we start, let’s make sure we have Beautiful Soup 4 installed. You can install it using pip by running the following command:

pip install beautifulsoup4

If you don’t have pip installed, you can refer to the official Python documentation to install it.

Loading an HTML document

To begin, we first need to load an HTML document that we want to work with. We can do this by passing the HTML as a string or by parsing a local file. Here’s an example of loading an HTML document from a local file:

from bs4 import BeautifulSoup

# Load the HTML document
with open('index.html', 'r') as file:
    html = file.read()

# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')

In the code above, we use the BeautifulSoup constructor to create a BeautifulSoup object called soup. We pass the HTML document and the desired parser type (html.parser in this case) as arguments.

Navigating the tree structure

Once we have the BeautifulSoup object, we can navigate and search the tree structure of the HTML tags. Here are some common methods for navigation:

.contents: Returns a list of all direct children of the current tag.
.parent: Returns the parent tag of the current tag.
.next_sibling: Returns the next sibling tag of the current tag.
.previous_sibling: Returns the previous sibling tag of the current tag.

Here’s an example that demonstrates these navigation methods:

# Navigate the tree structure
first_div = soup.div
next_sibling_of_div = first_div.next_sibling
parent_of_div = first_div.parent

# Print the results
print(f"Next sibling of div: {next_sibling_of_div}")
print(f"Parent of div: {parent_of_div}")

In the code above, we start by accessing the first div tag in the document using soup.div. Then, we use the .next_sibling and .parent methods to navigate to the next sibling and parent tag respectively.

Searching for specific tags

Apart from navigation, Beautiful Soup provides ways to search for specific tags or elements in the document. Here are some common methods for searching:

.find(): Returns the first tag that matches the specified criteria.
.find_all(): Returns a list of all tags that match the specified criteria.

Here’s an example that demonstrates these searching methods:

# Search for specific tags
first_h1 = soup.find('h1')
all_paragraphs = soup.find_all('p')

# Print the results
print(f"First h1 tag: {first_h1}")
print(f"All paragraphs: {all_paragraphs}")

In the code above, we use the .find() method to find the first h1 tag and the .find_all() method to find all p tags in the document.

Conclusion

Beautiful Soup 4 provides extensive features for navigating and searching the tree structure of HTML tags. By using the methods and techniques discussed in this blog post, you can easily extract the desired information from HTML documents during web scraping or parsing tasks in Python.

In the next blog post, we will explore more advanced features of Beautiful Soup 4, such as filtering tags based on attributes and manipulating the content of the tags. Stay tuned!