[파이썬] Beautiful Soup 4 `children` 및 `descendants` 속성 사용

Beautiful Soup is a powerful Python library for web scraping. It allows you to extract data from HTML and XML documents in an intuitive and convenient way. One of the key features of Beautiful Soup is the ability to navigate and search through the parsed document using various methods and attributes.

In this blog post, we will focus on two important attributes in Beautiful Soup 4: children and descendants. These attributes provide us with different ways to access and manipulate the elements within an HTML/XML document.

children 속성

The children attribute is used to access the direct children of an element. It returns a generator object that yields the immediate children elements of the given element.

Here’s an example to demonstrate its usage:

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <div>
      <h1>Beautiful Soup 4</h1>
      <p>A Python library for web scraping</p>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
div_element = soup.find('div')

for child in div_element.children:
    print(child)

In the above example, we have an HTML document with a div element containing an h1 and a p element. We use the find method to locate the div element and then iterate over its children. The output of the above code will be:

<h1>Beautiful Soup 4</h1>
<p>A Python library for web scraping</p>

As you can see, the children attribute only returns the immediate children elements of the given element.

descendants 속성

The descendants attribute is used to access all the descendants of an element, recursively. It returns a generator object that yields all the descendants of the given element.

Here’s an example to demonstrate its usage:

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <div>
      <h1>Beautiful Soup 4</h1>
      <p>A Python library for web scraping</p>
      <ul>
        <li>Easy to use</li>
        <li>Flexible</li>
        <li>Powerful</li>
      </ul>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
div_element = soup.find('div')

for descendant in div_element.descendants:
    print(descendant)

In the above example, we have an HTML document with a nested structure. The div element contains an h1, p, and ul element, and the ul element contains li elements. We use the find method to locate the div element and then iterate over its descendants. The output of the above code will be:

<h1>Beautiful Soup 4</h1>
Beautiful Soup 4
<p>A Python library for web scraping</p>
A Python library for web scraping
<ul>
<li>Easy to use</li>
<li>Flexible</li>
<li>Powerful</li>
</ul>
<li>Easy to use</li>
Easy to use
<li>Flexible</li>
Flexible
<li>Powerful</li>
Powerful

As you can see, the descendants attribute returns all the descendants of the given element, including the nested elements.

Conclusion

In this blog post, we have explored two important attributes in Beautiful Soup 4: children and descendants. These attributes provide us with different ways to access and manipulate the elements within an HTML/XML document. By leveraging these attributes, we can easily navigate the parsed document and extract the desired information for our web scraping tasks.