Scikit-Learn is a powerful machine learning library that provides a wide range of functionalities for data preprocessing, model training, and evaluation. One of the important tasks in machine learning is sampling, which involves selecting a subset of data points from a larger dataset.
In this blog post, we will explore different sampling methods available in Scikit-Learn and discuss how they can be used in Python for various machine learning tasks.
1. Random Sampling
Random sampling is a commonly used technique where data points are randomly selected from the dataset without any specific criteria. This method is useful when you have a large dataset and want to create a smaller, representative subset for training or testing purposes.
from sklearn.model_selection import train_test_split
X = ... # input features
y = ... # target labels
# Randomly split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In the above example, we use the train_test_split
function from Scikit-Learn’s model_selection
module to randomly split the data into training and testing sets. The test_size
parameter specifies the proportion of the dataset to be used for testing, and random_state
ensures reproducibility of the results.
2. Stratified Sampling
Stratified sampling is a sampling technique that ensures the class distribution in the dataset is preserved in the sampled subset. This method is particularly useful when dealing with imbalanced datasets, where one class may be significantly underrepresented.
from sklearn.model_selection import StratifiedShuffleSplit
X = ... # input features
y = ... # target labels
# Perform stratified sampling using a predefined number of splits
stratified_split = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
for train_index, test_index in stratified_split.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
In the above example, we use the StratifiedShuffleSplit
class from Scikit-Learn’s model_selection
module to perform stratified sampling. The n_splits
parameter specifies the number of splits to perform, and test_size
determines the proportion of the dataset to be used for testing.
3. Oversampling and Undersampling
Oversampling and undersampling are techniques used to address the issue of imbalanced datasets. In oversampling, the minority class is replicated to match the number of samples in the majority class, whereas in undersampling, the majority class is reduced to match the number of samples in the minority class.
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
X = ... # input features
y = ... # target labels
# Random oversampling
oversampler = RandomOverSampler(random_state=42)
X_oversampled, y_oversampled = oversampler.fit_resample(X, y)
# Random undersampling
undersampler = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = undersampler.fit_resample(X, y)
In the example above, we use the RandomOverSampler
and RandomUnderSampler
classes from the imblearn
package to perform oversampling and undersampling respectively. The fit_resample
method is used to fit the sampling strategy to the data and generate the resampled output.
These are just a few of the sampling methods available in Scikit-Learn and its related packages. Depending on the specific requirements of your machine learning task, you can choose the appropriate sampling technique to improve the performance and reliability of your models.
Sampling plays a crucial role in machine learning, as it helps in obtaining representative subsets of data for training, testing, and evaluation. With Scikit-Learn, you have access to a variety of sampling techniques that can be easily applied in Python, enabling you to work with different types of datasets and improve the efficiency and accuracy of your machine learning models. So, make sure to familiarize yourself with these sampling methods and experiment with them in your own projects!