[파이썬] Beautiful Soup 4 웹 페이지 인코딩 다루기

06 Sep 2023

beautiful soup

Introduction

When working with web scraping in Python, we often encounter web pages with different encodings. Handling these encodings correctly is essential to ensure that the text we scrape is accurate and usable. Beautiful Soup 4, a popular Python library for web scraping, provides built-in support for handling web page encodings effortlessly. In this blog post, we will explore how to deal with web page encodings using Beautiful Soup 4 in Python.

Understanding Encodings

Before diving into Beautiful Soup 4’s encoding handling, it’s important to understand what encodings are. An encoding is a set of rules that maps characters to their binary representations. It defines how text data is stored and interpreted by computers. Some common encodings include UTF-8, ISO-8859-1, and ASCII.

Basic Usage of Beautiful Soup 4

To begin, we need to install Beautiful Soup 4. We can do this by running the following command:

pip install beautifulsoup4

Once installed, we can import it into our Python script:

from bs4 import BeautifulSoup

Next, we will obtain the HTML content of a web page and create a Beautiful Soup object. Here’s an example using requests to retrieve the web page:

import requests

# Make a GET request to the web page
response = requests.get("https://www.example.com")

# Create a Beautiful Soup object
soup = BeautifulSoup(response.content, "html.parser")

Handling Encodings with Beautiful Soup 4

Now that we have our Beautiful Soup object, we can handle the encoding of the web page content. Beautiful Soup 4 automatically tries to detect the encoding of the web page and converts it to Unicode, making it easier for us to work with the text data.

To access the text content of an element, we can simply use the .text attribute. Beautiful Soup will handle the encoding conversion for us. For example:

# Assuming we have a <p> tag with Korean text
paragraph = soup.find("p")
print(paragraph.text)

Beautiful Soup will handle the encoding conversion from the web page to Unicode, allowing us to print the Korean text correctly.

If we need to specify a specific encoding, we can do so by passing the encoding as an argument to the BeautifulSoup constructor. For example:

soup = BeautifulSoup(response.content, "html.parser", from_encoding="iso-8859-1")

By explicitly specifying the encoding, Beautiful Soup will convert the web page content using the given encoding.

Conclusion

Beautiful Soup 4 provides seamless support for handling web page encodings. By using Beautiful Soup, we can easily convert the web page content to Unicode, regardless of the underlying encoding. This simplifies the process of web scraping and ensures that our data is accurate and usable. With its easy-to-use API and versatility, Beautiful Soup 4 is a great choice for web scraping tasks involving different encodings.