[파이썬] 대량 데이터 처리 및 저장 자동화

data automation

With the increasing amount of data being generated every day, efficient processing and storage of large datasets are essential for businesses to gain valuable insights. In this blog post, we will explore how to automate the handling of large datasets in Python.

1. Reading and Manipulating Data

Python provides a powerful library called pandas for data manipulation and analysis. To handle large datasets efficiently, we can utilize the following techniques:

a) Read Data in Chunks

When dealing with large files, it is often more practical to read them in smaller chunks rather than loading the entire dataset into memory. The pandas library provides the read_csv() function with a chunksize parameter that allows us to read the data in smaller portions. By iterating over the chunks, we can process them one at a time.

import pandas as pd

chunk_size = 10000
for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):
    # Process the chunk
    ...

b) Filter and Subset Data

In many cases, we only need a subset of the data for our analysis. Instead of manipulating the entire dataset, we can filter and subset the data based on specific conditions.

# Filter data based on a condition
filtered_data = chunk[chunk['column'] > 100]

# Select specific columns
subset_data = chunk[['column1', 'column2']]

2. Efficient Data Storage

Handling large datasets also involves efficient storage mechanisms. Python offers various options for storing and managing large amounts of data, such as:

a) SQLite Database

SQLite is a lightweight, file-based database that doesn’t require a separate server process. It is suitable for smaller datasets and can handle concurrent read operations efficiently.

import sqlite3

# Create a connection and cursor
conn = sqlite3.connect('data.db')
cursor = conn.cursor()

# Insert data into a table
cursor.execute("INSERT INTO table_name VALUES (?, ?, ...)", data)

# Commit the changes
conn.commit()

# Close the connection
conn.close()

b) HDF5 Format

The HDF5 (Hierarchical Data Format) is specifically designed for storing and managing large and complex datasets. It provides high-performance read and write operations, making it suitable for handling vast amounts of data.

import h5py

# Create an HDF5 file
file = h5py.File('data.h5', 'w')

# Create a dataset and write data
dataset = file.create_dataset('dataset_name', shape=(1000, 1000), dtype='float32')
dataset[:] = data

# Close the file
file.close()

c) Apache Parquet

Apache Parquet is a columnar storage file format that offers a high compression ratio and efficient query performance. It is particularly useful for analytical workloads on large datasets.

import pyarrow as pa
import pyarrow.parquet as pq

# Create a Table from data
table = pa.Table.from_pandas(data)

# Write the Table to a Parquet file
pq.write_table(table, 'data.parquet')

3. Automation with Python Scripts

To automate the data processing and storage tasks, we can create Python scripts and schedule them to run at specific intervals.

a) Using Python’s sched module

Python’s sched module provides functionality to schedule and run tasks at specified times. By defining a function that performs the data processing and storage operations, we can schedule it to run automatically.

import sched
import time

def data_processing():
    # Data processing code here

def main():
    # Create a scheduler
    scheduler = sched.scheduler(time.time, time.sleep)

    # Schedule the data processing function to run every day at a specific time
    scheduler.enterabs(time_to_run, 1, data_processing)

    # Run the scheduler
    scheduler.run()

if __name__ == '__main__':
    main()

b) Using cron jobs

Another approach for automation is using cron jobs, which is a time-based job scheduler in Unix-like operating systems. We can create a cron job that executes our Python script at predefined intervals.

# Edit crontab using the command `crontab -e`
# Add the following line to schedule the script to run daily at midnight
0 0 * * * python /path/to/script.py

Conclusion

By leveraging the power of Python and its libraries like pandas, sqlite3, h5py, and pyarrow, we can handle and automate the processing and storage of large datasets efficiently. Whether it’s reading data in chunks, filtering subsets, or utilizing various storage options, Python provides the tools necessary to handle the ever-growing data demands of businesses.

Remember, automation is key to saving time and effort when dealing with massive amounts of data. With the techniques and tips mentioned in this blog post, you can streamline your data processing and storage workflows with ease.