[파이썬] xgboost 학습 과정 모니터링 및 시각화

XGBoost is a powerful machine learning algorithm known for its efficiency and accuracy. However, when training complex models with large datasets, it can be challenging to understand how the model is learning and improving over time. In this blog post, we will explore how to monitor and visualize the training process of an XGBoost model in Python.

Install Required Packages

First, let’s make sure that we have the necessary packages installed. We will be using xgboost, matplotlib, and graphviz for this tutorial. You can install these packages using pip:

pip install xgboost matplotlib graphviz

Training the XGBoost Model

To begin, we need to train an XGBoost model. For the sake of simplicity, let’s use a binary classification problem as an example.

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the data into DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set the parameters for the XGBoost model
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 3
}

# Train the XGBoost model
num_rounds = 100
model = xgb.train(params, dtrain, num_rounds, evals=[(dtest, 'test')])

Monitoring Training Progress

To monitor the training progress of the XGBoost model, we can utilize the watchlist parameter in the xgb.train() function. By passing a list of DMatrices and their corresponding names, we can observe the evaluation metrics on these datasets at each boosting round.

# Create a watchlist for monitoring the training progress
watchlist = [(dtrain, 'train'), (dtest, 'test')]

# Train the XGBoost model and monitor its progress
model = xgb.train(params, dtrain, num_boost_round=num_rounds, evals=watchlist)

Visualizing Training Progress

To visualize the training progress, we can plot the evaluation metric values at each boosting round using matplotlib. In this example, we will plot the log loss values for both the training and testing datasets.

import matplotlib.pyplot as plt

# Get the evaluation results from the model
results = model.eval_set(watchlist)

# Extract the log loss values for training and testing datasets
train_logloss = [result[1]['logloss'] for result in results]
test_logloss = [result[1]['test']['logloss'] for result in results]

# Plot the log loss values
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_rounds+1), train_logloss, label='train')
plt.plot(range(1, num_rounds+1), test_logloss, label='test')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('Training Progress')
plt.legend()
plt.show()

Conclusion

In this blog post, we have explored how to monitor and visualize the training process of an XGBoost model in Python. By monitoring the evaluation metrics at each boosting round, we can gain insights into how the model is learning and improving. Additionally, visualizing the training progress helps to understand if the model is overfitting or underfitting. This knowledge can be invaluable in fine-tuning the model and achieving better performance in real-world scenarios.