[파이썬] Beautiful Soup 4 CSS 선택자 사용하기

06 Sep 2023

beautiful soup

Beautiful Soup is a powerful library in Python for web scraping. It allows us to extract data from HTML and XML documents effortlessly. In addition to its robust parsing capabilities, Beautiful Soup also provides a convenient way to search and navigate through the parsed data using CSS selectors.

What are CSS selectors?

CSS selectors are patterns used to select HTML elements based on their class, id, attributes, and other properties. They are commonly used in CSS stylesheets to apply styling to specific elements on a web page. Beautiful Soup leverages CSS selectors to locate elements within the parsed HTML structure.

How to use CSS selectors with Beautiful Soup 4

To use CSS selectors in Beautiful Soup, we need to import the library and create a BeautifulSoup object from the HTML document. Here’s an example:

from bs4 import BeautifulSoup

# Assume html_doc is the HTML document we want to parse
soup = BeautifulSoup(html_doc, 'html.parser')

Once we have the BeautifulSoup object, we can use the .select() method to find elements using CSS selectors. The .select() method accepts a CSS selector as a string argument and returns a list of matching elements. Here’s an example:

# Select all <a> elements with class "link"
links = soup.select('a.link')

# Select the first <h1> element with id "title"
title = soup.select_one('h1#title')

# Select all <img> elements with alt attribute containing "logo"
logos = soup.select('img[alt*=logo]')

In the above example, soup.select('a.link') selects all <a> elements with the class “link”, soup.select_one('h1#title') selects the first <h1> element with the id “title”, and soup.select('img[alt*=logo]') selects all <img> elements with an alt attribute containing the word “logo”.

Additional tips and examples

Selecting by attribute values

We can use CSS selectors to select elements based on their attribute values. For example:

# Select all <a> elements with href attribute starting with "https://"
links = soup.select('a[href^="https://"]')

# Select all <input> elements with type attribute set to "text"
text_inputs = soup.select('input[type="text"]')

Combining multiple selectors

We can combine multiple CSS selectors to narrow down our search. For example:

# Select all <div> elements inside <section> elements
divs_inside_sections = soup.select('section div')

# Select all <a> elements inside <div> elements with class "container"
links_in_container = soup.select('div.container a')

Navigating the parsed structure

Beautiful Soup provides additional methods, such as .parent, .find_next_sibling, and .find_all_next, to navigate and search within the parsed structure. These methods can be useful when we need to extract data based on the element’s position or relationship with other elements.

Conclusion

Using CSS selectors with Beautiful Soup 4 can greatly simplify the process of scraping and extracting data from HTML documents. By leveraging the familiar syntax of CSS selectors, we can easily locate and retrieve the desired elements with minimal code. The ability to combine selectors and navigate the parsed structure further enhances the flexibility and power of Beautiful Soup. So go ahead and give it a try in your next web scraping project!