[파이썬] pandas 넓은 데이터 vs 긴 데이터

06 Sep 2023

pandas

When working with data in Python, especially with large datasets, it is crucial to choose the right data structure that optimizes performance and efficiency. Two common data structures in the pandas library are 넓은 데이터 (wide data) and 긴 데이터 (long data). These structures differ in their organization and manipulation of data, and understanding their characteristics can help make informed decisions when analyzing and manipulating datasets. In this blog post, we will explore the differences between 넓은 데이터 and 긴 데이터, and discuss their advantages and use cases.

넓은 데이터 (Wide Data)

In 넓은 데이터, also known as “wide format” or “wide data frame,” the values of different variables are stored in separate columns. Each row represents an individual observation or sample. This format is commonly used for cross-sectional analysis and is well-suited for simple calculations and operations that involve comparing or summarizing variables.

Here is an example of 넓은 데이터:

import pandas as pd

data_wide = {
    'Country': ['USA', 'Canada', 'Germany'],
    'Population': [327.2, 37.6, 82.79],
    'GDP': [21427, 1703, 3955],
    'Area': [9833517, 9976140, 357582]
}

df_wide = pd.DataFrame(data_wide)

In this example, each variable (Country, Population, GDP, Area) has its own column, and each row represents a specific country. This format makes it easy to compare values across variables, such as calculating the average GDP or finding the country with the largest area.

긴 데이터 (Long Data)

In 긴 데이터, also known as “long format” or “tidy data,” different observations of a variable are stacked in a single column, with additional columns providing contextual information. This format is commonly used for time series analysis, multi-level data, and when dealing with complex relationships between variables.

Here is an example of 긴 데이터:

import pandas as pd

data_long = {
    'Year': [2018, 2018, 2019, 2019],
    'Country': ['USA', 'Canada', 'USA', 'Canada'],
    'Population': [327.2, 37.6, 331.8, 37.8],
    'GDP': [21427, 1703, 21433, 1797],
    'Area': [9833517, 9976140, 9833517, 9976140]
}

df_long = pd.DataFrame(data_long)

In this example, each row represents a unique combination of Year and Country, with individual observations for Population, GDP, and Area in separate columns. This format allows for easier analysis of trends, comparisons, and aggregations based on different variables.

Advantages and Use Cases

The choice between 넓은 데이터 and 긴 데이터 depends on the specific analysis or task at hand. Here are some advantages and common use cases for each format:

넓은 데이터 (Wide Data)

Easy to understand and visualize, especially for simple comparisons and calculations.
Suitable for simple cross-sectional analysis and data manipulation tasks.
Works well for datasets with a small number of variables.

긴 데이터 (Long Data)

Facilitates more complex analyses, such as time series analysis or multi-level modeling.
Allows for easy manipulation and aggregation based on different variables.
Supports efficient storage and retrieval of data.
Works well for datasets with a large number of variables or observations.

In summary, 넓은 데이터 and 긴 데이터 are two different ways of organizing and structuring data in pandas. Choosing the appropriate format depends on the specific requirements of the analysis or task. Understanding the characteristics and advantages of both formats can help you make informed decisions when working with pandas in Python.