[파이썬] 웹 스크래핑과 문학 정보 추출

01 Sep 2023

python

웹 스크래핑이란 인터넷에서 웹사이트의 내용을 자동으로 추출하는 기술을 말합니다. 이를 통해 인터넷 상의 다양한 정보를 손쉽게 수집할 수 있습니다. 이번 블로그에서는 Python을 사용하여 웹 스크래핑을 통해 문학 정보를 추출하는 방법을 알아보겠습니다.

1. 웹 스크래핑을 위한 라이브러리

Python에서 웹 스크래핑을 위해 사용되는 많은 라이브러리들이 있습니다. 그 중에서도 BeautifulSoup과 requests 라이브러리를 사용하여 웹 페이지를 가져올 수 있습니다.

import requests
from bs4 import BeautifulSoup

url = "https://www.example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# 웹 페이지를 가져와서 BeautifulSoup 객체로 변환한 후에는 해당 객체를 통해 원하는 정보를 추출할 수 있습니다.

2. 문학 정보 추출 예제

예를 들어, Project Gutenberg에서 공개된 문학 작품 중에서 가장 인기 있는 작품들의 제목과 작가를 추출해보겠습니다.

import requests
from bs4 import BeautifulSoup

url = "https://www.gutenberg.org/browse/scores/top"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# 페이지 내에서 작품 제목과 작가를 추출하기 위해 해당 태그들을 찾아서 가져옵니다.
titles = soup.find_all("span", class_="title")
authors = soup.find_all("span", class_="subtitle")

for title, author in zip(titles, authors):
    print("작품 제목:", title.text.strip())
    print("작가:", author.text.strip())
    print()

이 코드를 실행하면 Project Gutenberg에서 인기 있는 작품들의 제목과 작가를 출력할 수 있습니다.

3. 데이터 저장하기

추출한 문학 정보를 파일로 저장하고 싶을 수도 있습니다. 예를 들어, CSV 파일로 저장해보겠습니다.

import csv
import requests
from bs4 import BeautifulSoup

url = "https://www.gutenberg.org/browse/scores/top"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

titles = soup.find_all("span", class_="title")
authors = soup.find_all("span", class_="subtitle")

with open("literature.csv", mode="w", encoding="utf-8", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Author"])

    for title, author in zip(titles, authors):
        writer.writerow([title.text.strip(), author.text.strip()])

print("문학 정보가 literature.csv 파일에 저장되었습니다.")

위 코드를 실행하면 “literature.csv”라는 파일에 추출한 문학 정보가 저장됩니다.

마무리

Python을 사용하여 웹 스크래핑을 통해 문학 정보를 추출하는 방법을 알아보았습니다. 이를 응용하여 다양한 웹 사이트에서 원하는 정보를 추출하고 활용할 수 있습니다. 웹 스크래핑은 데이터 분석, 자동화, 웹 크롤링 등 다양한 분야에서 유용하게 활용될 수 있으므로, 꼭 익혀두는 것이 좋습니다.