[파이썬] fastai 데이터 버전 관리

In any machine learning project, it is essential to keep track of the versions of the datasets being used. Data versioning allows you to maintain a record of the changes made to the datasets over time and enables reproducibility of your machine learning workflows.

The fastai library provides various tools and techniques for data versioning. In this blog post, we will explore how to effectively manage data versions in fastai using Python.

1. Initialize a Data Repository

To start with data versioning, you need to set up a data repository to store the different versions of your datasets. You can create a separate directory or use a version control system like Git to organize and manage your data versions.

import os
import shutil

data_repo = os.path.join(os.getcwd(), 'data_repo')
os.makedirs(data_repo, exist_ok=True)

Here, we create a data_repo directory using the os module and makedirs() function. The exist_ok=True argument ensures that the directory is not recreated if already present.

2. Version Your Datasets

Once the data repository is set up, you can create different versions of your datasets and store them in the repository. One way to version your datasets is by creating timestamped directories for each version.

import datetime

timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
version_dir = os.path.join(data_repo, f'version_{timestamp}')
os.makedirs(version_dir)

In this code snippet, we generate a timestamp using the strftime() function from the datetime module. We then create a directory with a version-specific name inside the data_repo directory.

3. Copy Datasets to Versioned Directory

Next, you need to copy the dataset files to the versioned directory using the shutil module.

source_dir = os.path.join(os.getcwd(), 'datasets')
dataset_files = os.listdir(source_dir)

for file in dataset_files:
    shutil.copy(os.path.join(source_dir, file), version_dir)

Here, we assume that the original datasets are present in a directory named ‘datasets’. We use listdir() function to get the list of dataset files and then copy them to the versioned directory using shutil.copy().

4. Track Data Versioning with Git (Optional)

If you are using a version control system like Git, you can track the changes made to the data repository and easily revert to previous versions if needed.

import subprocess

subprocess.run(['git', 'init'], cwd=data_repo)
subprocess.run(['git', 'add', '-A'], cwd=data_repo)
subprocess.run(['git', 'commit', '-m', 'Initial commit'], cwd=data_repo)

By executing these commands, you initialize a new Git repository, add all the files within the data repository, and make an initial commit.

5. Retrieving Specific Data Version

To retrieve a specific version of the dataset, you can simply access the desired version directory.

specific_version = os.path.join(data_repo, 'version_20211231_120000')

Once you have the path to the desired version directory, you can load the dataset from that location for further processing.

Conclusion

In this blog post, we explored how to manage data versions using fastai and Python. By following these steps, you can effectively track the changes made to your datasets, ensure reproducibility, and easily retrieve specific versions when needed. Effective data versioning is crucial for maintaining transparency and traceability in machine learning projects.