[파이썬] xgboost 모델 앙상블 기법 적용

XGBoost is a popular machine learning algorithm known for its high performance and efficiency. With its strong predictive power, it is often used in many real-world applications. One way to further enhance the performance of XGBoost is through ensemble techniques.

In this blog post, we will explore how to apply ensemble techniques to improve the performance of an XGBoost model using Python.

What is Ensemble Learning?

Ensemble learning is a technique where multiple machine learning models are combined to make more accurate predictions than a single model. It leverages the principle of “wisdom of the crowd” by aggregating the predictions of multiple models.

Why use Ensemble with XGBoost?

XGBoost itself is a powerful algorithm that can provide good results on its own. However, by combining multiple XGBoost models, we can further improve the predictive power and generalization of the model. Ensemble techniques help to reduce overfitting, increase robustness, and improve overall performance.

Bagging Ensemble with XGBoost

One popular ensemble technique is bagging, which stands for bootstrap aggregating. Bagging involves training multiple models on different subsets of the training data, and then aggregating their predictions to make a final prediction.

To apply bagging with XGBoost, we can take the following steps:

  1. Randomly sample subsets of the training data with replacement.
  2. Train an XGBoost model on each subset.
  3. Combine the predictions of all models by averaging or voting.
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load and split the data
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a list to store the models
models = []

# Perform bagging with 5 models
for i in range(5):
    # Sample subset of the training data
    X_subset, _, y_subset, _ = train_test_split(X_train, y_train, train_size=0.8, random_state=i)
    
    # Train an XGBoost model on the subset
    model = xgb.XGBClassifier()
    model.fit(X_subset, y_subset)
    
    # Add the model to the list
    models.append(model)

# Make predictions using all models
predictions = []
for model in models:
    y_pred = model.predict(X_test)
    predictions.append(y_pred)

# Combine predictions by voting
ensemble_predictions = np.round(np.mean(predictions, axis=0))

# Evaluate the ensemble model
accuracy = accuracy_score(y_test, ensemble_predictions)
print("Ensemble Accuracy:", accuracy)

Boosting Ensemble with XGBoost

Another popular ensemble technique is boosting, which trains models in sequence, where each subsequent model tries to correct the mistakes made by the previous model. This leads to an ensemble model that is strong and capable of handling complex patterns in the data.

To apply boosting with XGBoost, we can use the xgb.train function of the XGBoost library. We can specify the number of boosting rounds and monitor the performance on a validation set to prevent overfitting.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load and split the data
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert data to DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set the parameters for XGBoost
param = {'max_depth': 3, 'eta': 0.1, 'objective': 'binary:logistic', 'eval_metric': 'error'}
num_round = 10

# Train the XGBoost model
model = xgb.train(param, dtrain, num_round)

# Make predictions
y_pred = model.predict(dtest)
predictions = [round(value) for value in y_pred]

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Boosting Accuracy:", accuracy)

Conclusion

Ensemble techniques such as bagging and boosting can significantly improve the performance of XGBoost models. By combining multiple models, we can enhance the predictive power and generalization of the model. In this blog post, we have explored how to apply bagging and boosting ensemble techniques using the XGBoost library in Python. Experimenting with different ensemble methods and hyperparameter tuning can further enhance the performance of the models.