[python] aiohttp를 사용하여 비동기적으로 웹 크롤러 결과 데이터베이스에 추가하기

21 Nov 2023

python

웹 크롤링은 실시간으로 다양한 웹 페이지에서 정보를 가져와야 하는 경우에 유용한 기술입니다. 최근에는 비동기 프로그래밍이 더욱 주목받고 있으며, aiohttp는 파이썬에서 비동기 HTTP 클라이언트를 사용하기 위한 인기있는 라이브러리입니다. 이번 블로그 포스트에서는 aiohttp를 사용하여 비동기적으로 웹 크롤링한 결과를 데이터베이스에 추가하는 방법을 알아보겠습니다.

데이터베이스 설정

먼저 데이터베이스에 접속하기 위해 필요한 설정을 해야 합니다. 여기서는 PostgreSQL을 예로 들겠습니다. 다음과 같이 psycopg2 라이브러리를 사용하여 데이터베이스에 연결하고 테이블을 생성합니다.

import psycopg2

# 데이터베이스 연결 설정
connection = psycopg2.connect(
    user="your_username",
    password="your_password",
    host="your_host",
    port="your_port",
    database="your_database"
)

# 테이블 생성
def create_table():
    try:
        cursor = connection.cursor()
        cursor.execute(
            """
            CREATE TABLE IF NOT EXISTS crawled_data (
                id SERIAL PRIMARY KEY,
                url TEXT,
                data TEXT
            )
            """
        )
        connection.commit()
    except (Exception, psycopg2.Error) as error:
        print("Failed to create table.", error)
    finally:
        if connection:
            cursor.close()
            connection.close()

aiohttp를 사용하여 웹 크롤러 작성

이제 aiohttp를 사용하여 비동기적으로 웹 크롤링을 수행하는 함수를 작성해보겠습니다. 다음과 같이 crawl_page 함수를 정의합니다.

import aiohttp

async def crawl_page(session, url):
    async with session.get(url) as response:
        data = await response.text()

        # 크롤링한 데이터를 데이터베이스에 추가
        insert_data(url, data)

async def main():
    urls = [...]  # 크롤링하고자 하는 웹 페이지 URL 리스트

    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            task = asyncio.create_task(crawl_page(session, url))
            tasks.append(task)
        await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

데이터베이스에 결과 추가

위에서 crawl_page 함수에서 insert_data 함수를 호출하여 크롤링한 데이터를 데이터베이스에 추가하는 것을 볼 수 있습니다. insert_data 함수는 다음과 같이 정의됩니다.

def insert_data(url, data):
    try:
        connection = psycopg2.connect(
            user="your_username",
            password="your_password",
            host="your_host",
            port="your_port",
            database="your_database"
        )
        cursor = connection.cursor()
        cursor.execute(
            "INSERT INTO crawled_data (url, data) VALUES (%s, %s)",
            (url, data)
        )
        connection.commit()
    except (Exception, psycopg2.Error) as error:
        print("Failed to insert data.", error)
    finally:
        if connection:
            cursor.close()
            connection.close()

실행

마지막으로 main 함수를 실행하여 웹 크롤링을 수행하고 데이터베이스에 결과를 추가합니다. 프로그램을 실행하기 위해 다음 명령어를 실행하세요.

python your_script.py

이제 aiohttp를 사용하여 비동기적으로 웹 크롤러 결과를 데이터베이스에 추가하는 방법을 알게 되었습니다. 이를 통해 웹 크롤링 작업을 더욱 효율적으로 처리할 수 있습니다.