[파이썬] requests-html 로그 생성 및 분석

When working with web scraping or automated web browsing tasks in Python, the requests-html library is a popular choice due to its simplicity and powerful features. However, it’s often necessary to generate and analyze logs to gain insights into the execution flow and potential issues. In this blog post, we will explore how to generate logs in requests-html and analyze them using Python.

Generating Logs with requests-html

requests-html doesn’t provide built-in logging capabilities, but we can easily integrate it with the logging module in Python to generate logs. Here’s an example of how to set up logging for your requests-html code:

import logging
from requests_html import HTMLSession

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create an HTML session
session = HTMLSession()

# Enable logging for requests-html
session.logger.setLevel(logging.DEBUG)

# Perform your web scraping tasks
response = session.get('https://example.com')

# Log the requests-html output
logger.debug(response.html.render())

In the above code, we import logging and HTMLSession from requests-html. We then set the logging level to INFO and create a logger object. Next, we create an HTML session and set the logging level for HTMLSession to DEBUG.

Finally, we perform our web scraping tasks using session.get() and log the output using the logger’s debug() method.

Analyzing Logs with Python

Once we have generated the logs, we can analyze them using Python libraries like re for regular expressions, pandas for data manipulation, and matplotlib for data visualization. Here’s an example of how to analyze the logs to extract useful information:

import re
import pandas as pd
import matplotlib.pyplot as plt

# Read the log file
with open('log.txt', 'r') as file:
    log_data = file.read()

# Extract relevant information using regular expressions
matches = re.findall(r'Response status code: (\d+)', log_data)
status_codes = [int(match) for match in matches]

# Create a pandas DataFrame
df = pd.DataFrame({'status_code': status_codes})

# Plot the distribution of status codes
plt.hist(df['status_code'], bins=20, alpha=0.75)
plt.xlabel('Status Code')
plt.ylabel('Frequency')
plt.title('Distribution of Status Codes')
plt.show()

In the code above, we read the log file using open() and extract the response status codes using regular expressions. We then create a pandas DataFrame to store the status codes and plot their distribution using matplotlib.

This is just a basic example, but with the power of Python and its libraries, you can perform various analyses on your requests-html logs, such as extracting specific data, calculating response times, or identifying errors.

Conclusion

Generating and analyzing logs in requests-html can provide valuable insights into your web scraping or automated web browsing tasks. By integrating requests-html with the logging module and leveraging Python’s data analysis libraries, you can gain a deeper understanding of your code’s behavior and make improvements where necessary.