[python] aiohttp와 asyncio를 사용하여 비동기적으로 웹 크롤러 결과 필터링하기

21 Nov 2023

python

웹 크롤링 작업을 수행할 때, 비동기적인 방식으로 처리하면 성능과 처리 속도를 크게 향상시킬 수 있습니다. aiohttp는 비동기 HTTP 클라이언트로서, asyncio는 Python의 비동기 프로그래밍 모듈입니다. 이 두 모듈을 함께 사용하여 웹 크롤러 결과를 비동기적으로 필터링하는 방법을 알아보겠습니다.

1. aiohttp를 사용하여 웹 페이지 가져오기

먼저, aiohttp를 사용하여 비동기적으로 웹 페이지를 가져오는 방법을 알아보겠습니다. 아래는 aiohttp를 사용하여 특정 URL에서 웹 페이지를 가져오는 예제입니다.

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'https://example.com')
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

위의 예제에서 fetch 함수는 aiohttp 클라이언트 세션과 URL을 인자로 받아 해당 URL의 웹 페이지를 가져옵니다. main 함수에서는 aiohttp의 비동기 클라이언트 세션을 생성하고, fetch 함수를 사용하여 웹 페이지를 가져옵니다.

2. 가져온 웹 페이지에서 결과 필터링하기

웹 페이지를 가져왔으면 이제 원하는 내용을 필터링하여 추출하는 작업을 수행할 수 있습니다. 아래는 가져온 웹 페이지에서 특정 요소를 필터링하는 예제입니다.

from bs4 import BeautifulSoup

def filter_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 원하는 요소들을 필터링하는 로직을 작성합니다.
    filtered_elements = soup.find_all('div', class_='result')
    return filtered_elements

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'https://example.com')
        filtered_elements = filter_page(html)
        for element in filtered_elements:
            print(element.text)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

위의 예제에서 filter_page 함수는 BeautifulSoup을 사용하여 가져온 웹 페이지에서 필터링된 요소들을 추출합니다. find_all 메서드를 사용하여 원하는 요소를 필터링할 수 있습니다. 이후 filtered_elements를 순회하며 필요한 작업을 수행할 수 있습니다.

결론

위에서는 aiohttp와 asyncio를 사용하여 비동기적으로 웹 크롤러 결과를 필터링하는 방법에 대해 알아보았습니다. 이를 통해 성능과 처리 속도를 향상시키면서 웹 크롤링 작업을 보다 효율적으로 수행할 수 있습니다.