[파이썬] xgboost 베이지안 최적화와 `xgboost`

07 Sep 2023

xgboost

xgboost (Extreme Gradient Boosting) is a popular machine learning algorithm known for its excellent performance in various tasks, such as regression, classification, and ranking. It is based on the gradient boosting framework.

In this blog post, we will explore how to optimize xgboost hyperparameters using Bayesian Optimization techniques. Bayesian Optimization is a powerful method for finding the optimal set of hyperparameters for a given machine learning model.

Installing Required Packages

Before we begin, make sure you have the following packages installed in your Python environment:

xgboost: The xgboost library for Python.
bayesian-optimization: A library for performing Bayesian Optimization.

You can install these packages using pip:

pip install xgboost
pip install bayesian-optimization

Loading Data

To demonstrate the Bayesian Optimization process, let’s assume we have a dataset stored in a CSV file. We will use the pandas library to load the data:

import pandas as pd

# Load data from CSV
data = pd.read_csv('data.csv')

Preprocessing Data

Before applying xgboost, it is essential to preprocess the data. This step involves handling missing values, encoding categorical variables, and splitting the data into training and testing sets. Here is an example of how you can preprocess the data:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Split features and target variable
X = data.drop('target', axis=1)
y = data['target']

# Encode categorical variables
cat_cols = X.select_dtypes(include='object').columns
for col in cat_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining the Objective Function

In Bayesian Optimization, we need to define an objective function that returns the value to be optimized. In our case, the objective function will evaluate the performance of the xgboost model using a chosen evaluation metric, such as accuracy, AUC-ROC, or mean squared error. Here’s an example of how you can define the objective function:

import xgboost as xgb
from sklearn.metrics import accuracy_score

# Define objective function
def objective(params):
    # Create xgboost classifier with given hyperparameters
    clf = xgb.XGBClassifier(
        n_estimators=int(params['n_estimators']),
        learning_rate=params['learning_rate'],
        max_depth=int(params['max_depth']),
        subsample=params['subsample'],
        gamma=params['gamma'],
        random_state=42
    )

    # Train the classifier
    clf.fit(X_train, y_train)

    # Predict on the test set
    y_pred = clf.predict(X_test)

    # Calculate accuracy score
    score = accuracy_score(y_test, y_pred)

    return score

Bayesian Optimization

Now that we have the objective function defined, we can proceed with the Bayesian Optimization process using the bayesian-optimization library. This library provides a convenient interface for optimizing the hyperparameters of a given function. Here’s an example of how you can use Bayesian Optimization with xgboost:

from bayes_opt import BayesianOptimization

# Define the hyperparameter search space
pbounds = {
    'n_estimators': (100, 1000),
    'learning_rate': (0.01, 0.3),
    'max_depth': (3, 10),
    'subsample': (0.5, 1),
    'gamma': (0, 10)
}

# Perform Bayesian Optimization
optimizer = BayesianOptimization(f=objective, pbounds=pbounds)
optimizer.maximize(init_points=10, n_iter=20)

After the optimization process completes, you can access the best set of hyperparameters using optimizer.max. These optimized hyperparameters can be used to train the final xgboost model.

Conclusion

In this blog post, we explored how to optimize xgboost hyperparameters using Bayesian Optimization techniques. By performing Bayesian Optimization, we can find the best set of hyperparameters for our xgboost model, thereby improving its performance. Next time you work with xgboost, consider using Bayesian Optimization to fine-tune your model and achieve better results.