[파이썬] pandas Tidy 데이터 원칙

In the world of data analysis and manipulation, pandas is often the go-to library for most Python programmers. It provides a powerful and flexible toolkit for working with structured data. One of the key principles to keep in mind when using pandas is the concept of “tidy data”.

Tidy data is a framework introduced by Hadley Wickham, the creator of the R programming language’s tidyverse. The concept revolves around organizing and structuring data in a consistent manner, making it easier to analyze and visualize.

In this blog post, we’ll explore the pandas tidy data principles and demonstrate how to implement them in Python.

1. Each variable forms a column

The first principle of tidy data is that each variable should form a column. In other words, each distinct piece of information or attribute should be represented as a separate column in the DataFrame.

For example, if we are analyzing a dataset of student information, each variable such as name, age, gender, and grade should be stored in its own column.

import pandas as pd

data = {
    'name': ['John', 'Amy', 'Michael', 'Emily'],
    'age': [20, 22, 21, 19],
    'gender': ['Male', 'Female', 'Male', 'Female'],
    'grade': [85, 90, 78, 92]
}

df = pd.DataFrame(data)

2. Each observation forms a row

The second principle of tidy data states that each observation should form a row. This means that each individual data point or entry should be stored in its own row in the DataFrame.

For instance, if we have multiple measurements for a single student, each measurement should have its own row.

import pandas as pd

data = {
    'name': ['John', 'John', 'Amy', 'Amy'],
    'subject': ['Math', 'Physics', 'Math', 'Physics'],
    'grade': [85, 80, 90, 92]
}

df = pd.DataFrame(data)

3. Each type of observational unit forms a table

The third and final principle of tidy data states that each type of observational unit should form a separate table. In simpler terms, related data should be stored in separate DataFrames.

For example, if we have information about students and their courses, it would be better to have two separate DataFrames: one for student information and another for course information. These DataFrames can be linked together using a unique identifier such as a student ID.

import pandas as pd

# Student DataFrame
student_data = {
    'student_id': [1, 2, 3],
    'name': ['John', 'Amy', 'Michael'],
    'age': [20, 22, 21]
}

student_df = pd.DataFrame(student_data)

# Course DataFrame
course_data = {
    'student_id': [1, 2, 3],
    'course': ['Math', 'Physics', 'English'],
    'grade': [85, 90, 78]
}

course_df = pd.DataFrame(course_data)

By adhering to the pandas tidy data principles, you can ensure that your data is well-organized and structured, making it easier to perform analysis and extract insights.

In conclusion, tidy data is an important concept to consider when using pandas for data manipulation. By following the principles of tidy data, you can create more efficient and intuitive workflows for data analysis tasks.

Happy coding with pandas!