[python] aiohttp와 asyncio를 사용하여 비동기적으로 웹 크롤러 결과 필터링하기

21 Nov 2023

python

소개

웹 크롤러는 인터넷에서 데이터를 수집하는 데 사용되는 도구입니다. 비동기 크롤링은 여러 요청을 동시에 처리하여 시간과 자원을 절약할 수 있습니다. 이번 포스트에서는 Python의 aiohttp와 asyncio를 사용하여 비동기적으로 웹 크롤링 결과를 필터링하는 방법을 알아보겠습니다.

aiohttp란?

aiohttp는 Python에서 비동기 HTTP 클라이언트 및 서버를 구축하기 위한 라이브러리입니다. 이 라이브러리는 asyncio와 함께 사용될 때 가장 효율적인 비동기 네트워킹 솔루션을 제공합니다.

asyncio란?

asyncio는 비동기 코드를 작성하기 위한 라이브러리입니다. 이를 통해 코루틴(coroutine)을 사용하여 비동기 작업을 수행할 수 있습니다. asyncio는 aiohttp와 함께 사용되어 비동기 웹 크롤러를 구현하는 데 사용됩니다.

코드 예제

다음은 aiohttp와 asyncio를 사용하여 비동기적으로 웹 크롤러 결과를 필터링하는 간단한 예제입니다.

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def filter_urls(urls):
    filtered_urls = []
    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            task = asyncio.create_task(fetch(session, url))
            tasks.append(task)
        results = await asyncio.gather(*tasks)
        for url, result in zip(urls, results):
            if "keyword" in result:
                filtered_urls.append(url)
    return filtered_urls

if __name__ == "__main__":
    urls = ["https://website1.com", "https://website2.com", "https://website3.com"]
    filtered_urls = asyncio.run(filter_urls(urls))
    print(filtered_urls)

위의 코드에서는 fetch 함수를 통해 각 URL에 대한 비동기 HTTP GET 요청을 보냅니다. 그런 다음, filter_urls 함수에서는 fetch 함수를 병렬로 실행하여 결과를 가져온 후 특정 키워드가 결과에 있는지 확인합니다. 결과에 키워드가 포함된 URL은 filtered_urls 리스트에 추가됩니다.

결론

이제 aiohttp와 asyncio를 사용하여 비동기적으로 웹 크롤러 결과를 필터링하는 방법에 대해 알아보았습니다. 비동기 프로그래밍을 통해 크롤링 작업의 효율성을 향상시킬 수 있으며, aiohttp와 asyncio는 이를 달성하기 위한 강력한 도구입니다.

소개

aiohttp란?

asyncio란?

코드 예제

결론

참고 자료