[파이썬] pandas Boolean 인덱싱

Boolean indexing is a powerful feature in pandas, a popular data manipulation library in Python. With Boolean indexing, we can select subsets of data based on conditional statements, making data filtering and manipulation more efficient.

What is Boolean indexing?

Boolean indexing in pandas involves selecting rows or certain elements of a DataFrame or Series based on a given condition or a set of conditions. These conditions usually involve logical operators such as & (and), | (or), and ~ (not).

Example scenario

Let’s say we have a DataFrame that contains information about students and their test scores. We want to filter the DataFrame to only include the rows where the test score is greater than or equal to 90. We can achieve this using Boolean indexing.

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'TestScore': [85, 92, 88, 95]}

df = pd.DataFrame(data)

# Boolean indexing to filter rows
filtered_df = df[df['TestScore'] >= 90]

print(filtered_df)

Output:

     Name  TestScore
1     Bob         92
3    Dave         95

Explanation

In the example above, we import the pandas library and create a DataFrame called df with student names (Name) and their respective test scores (TestScore).

We then use Boolean indexing on df by creating a filter condition (df['TestScore'] >= 90) inside square brackets. This condition checks whether the TestScore column is greater than or equal to 90 for each row. The result of this condition is a boolean Series, where True represents the rows that satisfy the condition and False represents the rows that do not.

Passing this boolean Series as an index to the original DataFrame df filters the DataFrame and returns only the rows where the condition is True. We assign this filtered DataFrame to a new variable called filtered_df.

Finally, we print the filtered_df which shows only the rows where the test score is greater than or equal to 90.

Conclusion

Boolean indexing in pandas allows us to efficiently filter and manipulate data based on selected conditions. By using logical operators and conditional statements, we can extract subsets of rows or specific elements from a DataFrame or Series. This feature is extremely useful for data analysis, handling large datasets, and performing complex data manipulations.