[파이썬][scikit-learn] scikit-learn SMOTE

06 Sep 2023

scikit learn

In machine learning, dealing with imbalanced datasets can be a challenging task. Imbalanced datasets refer to datasets where the number of instances in one class is much higher or lower than the number of instances in another class. This can negatively impact the performance of many machine learning algorithms, as they often struggle to effectively learn from minority class samples.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular technique used to address the class imbalance problem. It works by generating synthetic samples for the minority class, thereby increasing the overall number of instances for that class.

In this blog post, we will explore how to use the Scikit-learn library in Python to apply SMOTE to an imbalanced dataset.

Prerequisites

Python 3.x
Scikit-learn library (pip install scikit-learn)

Dataset

Let’s start by loading an example dataset that has imbalanced classes. For this demonstration, we will use the breast cancer dataset from Scikit-learn.

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target

Applying SMOTE

To apply SMOTE to the dataset, we need to import the SMOTE class from the imblearn.over_sampling module.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

Here, we create an instance of the SMOTE class and then call the fit_resample() method, passing in the feature matrix X and the target vector y. This method applies SMOTE to the dataset and returns the resampled feature matrix X_resampled and the resampled target vector y_resampled.

Visualizing the Results

To understand the impact of SMOTE on the dataset, we can plot a bar chart comparing the class distribution before and after applying SMOTE.

import matplotlib.pyplot as plt

class_distribution_before = {label: count for label, count in zip(data.target_names, data.target.value_counts())}
class_distribution_after = {label: count for label, count in zip(data.target_names, y_resampled.value_counts())}

plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.bar(class_distribution_before.keys(), class_distribution_before.values())
plt.title('Class Distribution Before SMOTE')

plt.subplot(122)
plt.bar(class_distribution_after.keys(), class_distribution_after.values())
plt.title('Class Distribution After SMOTE')

plt.tight_layout()
plt.show()

The above code creates a bar chart using the matplotlib library. It plots the class distribution before and after applying SMOTE, giving a visual representation of the impact of the oversampling technique.

Conclusion

In this blog post, we explored how to use the Scikit-learn library in Python to apply SMOTE to an imbalanced dataset. By generating synthetic samples for the minority class, SMOTE helps address the class imbalance problem in machine learning. Applying SMOTE can improve the performance of models, especially when dealing with imbalanced datasets.

Remember to install the required dependencies and experiment with different datasets to make the most out of SMOTE and enhance the accuracy of your machine learning models.

I hope you found this blog post helpful! If you have any questions or suggestions, feel free to leave a comment below.