[파이썬][scikit-learn] scikit-learn 하이퍼파라미터 튜닝

Scikit-learn is one of the most popular machine learning libraries in Python. It provides a wide array of machine learning algorithms and tools to simplify the process of building and training models. One essential aspect of building effective models is hyperparameter tuning.

Hyperparameters are parameters that are set prior to the learning process and cannot be learned from the data. They control the behavior of the algorithm and can greatly impact the performance of your model. Tuning these hyperparameters involves finding the best combination of values that maximizes the model’s performance.

In this blog post, we’ll explore how to tune hyperparameters using scikit-learn in Python. We’ll cover two common strategies: Grid Search and Randomized Search.

Grid Search is a brute-force method that exhaustively searches through all possible combinations of hyperparameters. It involves defining a grid of hyperparameter values and evaluating the model’s performance for each combination. The advantage of grid search is that it guarantees finding the optimal set of hyperparameters within the specified grid.

Here is an example of how to perform grid search using scikit-learn:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 4, 6]
}

# Create the model
model = RandomForestClassifier()

# Instantiate the GridSearchCV object
grid_search = GridSearchCV(model, param_grid)

# Fit the grid search on the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

In this example, we create a grid of possible values for the hyperparameters n_estimators, max_depth, and min_samples_split of a Random Forest Classifier. We then create an instance of GridSearchCV and fit it on the training data. The best hyperparameters are obtained by calling best_params_ attribute.

Randomized Search is an alternative approach to hyperparameter tuning that randomly samples different combinations of hyperparameters from a given distribution. It allows for a more efficient search by exploring a larger hyperparameter space in a shorter time.

Here is an example of how to perform randomized search using scikit-learn:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
from scipy.stats import uniform

# Define the hyperparameter distribution
param_dist = {
    'C': uniform(loc=0, scale=1),
    'gamma': ['scale', 'auto'],
    'kernel': ['linear', 'rbf', 'poly']
}

# Create the model
model = SVC()

# Instantiate the RandomizedSearchCV object
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10)

# Fit the randomized search on the data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_

In this example, we define distributions for the hyperparameters C, gamma, and kernel of a Support Vector Classifier. The RandomizedSearchCV randomly samples 10 combinations from these distributions and fits the model on the training data. The best hyperparameters are obtained by calling best_params_ attribute.

Both grid search and randomized search are effective techniques for hyperparameter tuning, depending on the complexity of your problem and the size of the hyperparameter space. Experimenting with different combinations of hyperparameters can help you find the optimal configuration to improve the performance of your machine learning model.

So next time you’re working on a machine learning project using scikit-learn, don’t forget to explore the power of hyperparameter tuning for optimizing your models!

Happy coding!