[파이썬] Scrapy 로그 파일 분석

Scrapy is a powerful and flexible web scraping framework that allows you to extract data from websites easily. When running a Scrapy spider, it generates log files that can be useful for debugging and analyzing the scraping process. In this blog post, we will explore how to analyze Scrapy log files using Python.

1. Reading the log file

To begin, we need to read the Scrapy log file using Python. We can use the open() function to open the log file and read its contents.

logfile = 'scrapy.log'
with open(logfile, 'r') as f:
    log_data = f.read()

Here, we assume that the log file is located in the same directory as the Python script. Make sure to provide the correct path if your log file is in a different location.

2. Parsing the log file

Next, we need to parse the log file and extract the relevant information. Scrapy log files are structured in a specific format, with each log entry containing important details such as the log level, timestamp, and log message.

We can use regular expressions (regex) to extract this information from each log entry. The re module in Python provides functions to work with regex patterns.

import re

log_entries = re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(\w+)\]:(.*)', log_data)

This regex pattern captures the timestamp, log level, and log message from each log entry. The findall() function returns a list of tuples, where each tuple represents a log entry.

3. Analyzing the log entries

Now that we have parsed the log file and obtained the log entries, we can analyze them to get insights about the scraping process. Common analysis tasks may include counting the number of log entries at each log level, identifying any errors or warnings, or extracting specific information from the log messages.

For example, let’s count the number of log entries at each log level:

log_levels = {}

for entry in log_entries:
    timestamp, level, message = entry
    if level in log_levels:
        log_levels[level] += 1
    else:
        log_levels[level] = 1

for level, count in log_levels.items():
    print(f"{level}: {count}")

This code snippet creates a dictionary log_levels to store the count of log entries for each log level. It iterates over each log entry, increments the count for the corresponding log level, and finally prints the count for each log level.

4. Visualizing the log data

To gain a better understanding of the log data, we can visualize it using various Python libraries such as matplotlib or seaborn. These libraries provide functions to create charts and plots that help in visualizing data.

For instance, let’s plot a bar chart showing the count of log entries at each log level using matplotlib:

import matplotlib.pyplot as plt

log_levels = {}

for entry in log_entries:
    timestamp, level, message = entry
    if level in log_levels:
        log_levels[level] += 1
    else:
        log_levels[level] = 1

plt.bar(log_levels.keys(), log_levels.values())
plt.xlabel('Log Level')
plt.ylabel('Count')
plt.title('Count of Log Entries at Each Log Level')
plt.show()

This code snippet creates a bar plot using the keys and values of the log_levels dictionary. It adds labels to the x and y-axis, sets a title for the plot, and displays it using plt.show().

Conclusion

Analyzing Scrapy log files can provide valuable insights into the scraping process, help identify any issues or errors, and optimize the scraping code. In this blog post, we learned how to read and parse Scrapy log files using Python. We also explored ways to analyze and visualize the log data for better understanding.