[파이썬] aiohttp 비동기로 웹 크롤링 하기

In this blog post, we will explore how to perform web scraping using the aiohttp library in Python. Web scraping is the process of extracting data from websites, and aiohttp is an asynchronous web framework that allows us to make HTTP requests and handle responses in an efficient manner. By leveraging the power of asynchronous programming, we can quickly crawl multiple web pages and retrieve desired information.

Installing aiohttp

To begin, make sure you have aiohttp installed on your system. You can use the following command to install it via pip:

pip install aiohttp

Creating an Asynchronous Web Crawler

Let’s now create a simple asynchronous web crawler using aiohttp. We will use Python’s asyncio module to handle asynchronous tasks. Here’s an example code snippet:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def crawl_websites(urls):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            task = asyncio.ensure_future(fetch(session, url))
            tasks.append(task)
        responses = await asyncio.gather(*tasks)
        return responses

async def main():
    urls = ["https://example.com", "https://example.org", "https://example.net"]
    responses = await crawl_websites(urls)
    for response in responses:
        print(response[:100])  # Print first 100 characters of each website's response

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

In the code above, we define an asynchronous function fetch that takes a ClientSession object and a URL. It makes a GET request to the URL and returns the response as text.

The crawl_websites function takes a list of URLs and uses asyncio.ensure_future to create a task for each URL’s fetch function. These tasks are then gathered using asyncio.gather to asynchronously execute the requests.

In the main function, we define a list of URLs, call crawl_websites using await, and print the first 100 characters of each website’s response.

To run this code, save it in a Python file (e.g., crawler.py) and execute it via the command line:

python crawler.py

Conclusion

With the power of aiohttp and asynchronous programming in Python, we can efficiently perform web scraping tasks. By making asynchronous requests, we can crawl multiple websites simultaneously, saving time and resources. The flexibility and simplicity of aiohttp make it a great choice for building web crawlers and performing various HTTP-based tasks.

Remember to be respectful and adhere to the website’s terms of service and legal requirements when scraping data.