[파이썬] Beautiful Soup 4 조건을 기반으로 태그 필터링

Beautiful Soup is a Python library that is widely used for web scraping tasks. It provides a convenient way to parse HTML and XML documents and extract useful data from them. One of the powerful features of Beautiful Soup is its ability to filter and search for specific tags based on certain conditions.

In this blog post, we will explore how to perform tag filtering in Beautiful Soup 4 based on different conditions. Let’s get started!

Installation

Before we dive into the code, make sure you have Beautiful Soup 4 installed on your system. To install it, you can use pip:

pip install beautifulsoup4

Alternatively, you can install Beautiful Soup 4 using conda:

conda install beautifulsoup4

Importing Beautiful Soup

To begin using Beautiful Soup, we need to import it in our Python script. Here’s how you can import the necessary modules:

from bs4 import BeautifulSoup

HTML Parsing

To demonstrate tag filtering, we first need some HTML content to work with. We can either download an HTML file or use a string containing the HTML code. For the sake of simplicity, let’s use a string:

html_content = """
<html>
<body>
<div class="container">
<h1>Welcome to my website</h1>
<p>This is some sample text.</p>
<a href="https://www.example.com">Link 1</a>
<a href="https://www.example.com">Link 2</a>
</div>
<div class="container">
<h2>About me</h2>
<p>I am a web developer.</p>
<a href="https://www.example.com">Link 3</a>
</div>
</body>
</html>
"""

Creating a BeautifulSoup Object

Next, we need to create a BeautifulSoup object to parse the HTML content. We pass the HTML content along with the parser of our choice to the BeautifulSoup constructor:

soup = BeautifulSoup(html_content, 'html.parser')

Filtering by Tag Name

The simplest form of tag filtering is to select tags based on their name. Beautiful Soup provides the find_all() method to search for all occurrences of a specific tag. Here’s an example:

tags = soup.find_all('a')
for tag in tags:
    print(tag)

This code will find all the <a> tags in the HTML content and print them out.

Output:

<a href="https://www.example.com">Link 1</a>
<a href="https://www.example.com">Link 2</a>
<a href="https://www.example.com">Link 3</a>

Filtering by Class Name

Sometimes, we may only be interested in tags that have a specific class name. Beautiful Soup allows us to filter tags based on their class name using the class_ parameter. Here’s an example:

tags = soup.find_all('div', class_='container')
for tag in tags:
    print(tag)

This code will find all the <div> tags with the class name “container” and print them out.

Output:

<div class="container">
<h1>Welcome to my website</h1>
<p>This is some sample text.</p>
<a href="https://www.example.com">Link 1</a>
<a href="https://www.example.com">Link 2</a>
</div>
<div class="container">
<h2>About me</h2>
<p>I am a web developer.</p>
<a href="https://www.example.com">Link 3</a>
</div>

Filtering by Other Attributes

In addition to filtering by tag name and class name, Beautiful Soup allows us to filter tags based on other attributes as well. We can use the attrs parameter to specify the attribute-value pairs we are interested in. Here’s an example:

tags = soup.find_all('a', attrs={'href': 'https://www.example.com'})
for tag in tags:
    print(tag)

This code will find all the <a> tags with the href attribute set to “https://www.example.com” and print them out.

Output:

<a href="https://www.example.com">Link 1</a>
<a href="https://www.example.com">Link 2</a>
<a href="https://www.example.com">Link 3</a>

Conclusion

Beautiful Soup 4 provides powerful capabilities for filtering and searching for specific tags in HTML and XML documents. By leveraging the tag filtering techniques explained in this blog post, you can extract relevant information from web pages more efficiently.

In this post, we covered basic tag filtering by tag name, class name, and other attributes. You can further explore Beautiful Soup’s documentation to discover more advanced filtering options and techniques.

I hope you found this blog post helpful. Happy web scraping with Beautiful Soup 4!