[파이썬] catboost 벤치마크 및 경쟁 모델과의 비교

Catboost is a powerful gradient boosting library that is highly popular for its ability to handle categorical features. In this blog post, we will perform a benchmark comparison of Catboost with other competing models in Python.

The Dataset

For this comparison, we will use the famous Titanic dataset, which contains information about passengers on the Titanic and their survival status. This dataset is widely used for classification tasks and is available in several machine learning libraries, including scikit-learn.

The Models

We will compare the performance of Catboost with two other popular gradient boosting libraries: XGBoost and LightGBM. These libraries are known for their efficiency and effectiveness in handling large-scale datasets.

Installation

Before we can start comparing the models, let’s ensure that we have all the necessary libraries installed. To install Catboost, XGBoost, and LightGBM, run the following commands in your Python environment:

pip install catboost
pip install xgboost
pip install lightgbm

Loading the Data

Let’s begin by loading the Titanic dataset and splitting it into training and testing sets:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the Titanic dataset
data = pd.read_csv('titanic.csv')

# Split the dataset into features and target variable
X = data.drop('Survived', axis=1)
y = data['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training and Evaluation

Now, we will train and evaluate each model using the same training and testing sets. For simplicity, we will use the default hyperparameters of each model. Note that these hyperparameters can be further tuned to improve the performance.

Catboost

Let’s start by training and evaluating the Catboost model:

from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

# Create a Catboost model
catboost_model = CatBoostClassifier()

# Fit the model on the training data
catboost_model.fit(X_train, y_train)

# Make predictions on the test data
catboost_predictions = catboost_model.predict(X_test)

# Calculate the accuracy of the model
catboost_accuracy = accuracy_score(y_test, catboost_predictions)

XGBoost

Next, let’s train and evaluate the XGBoost model:

import xgboost as xgb

# Create an XGBoost model
xgboost_model = xgb.XGBClassifier()

# Fit the model on the training data
xgboost_model.fit(X_train, y_train)

# Make predictions on the test data
xgboost_predictions = xgboost_model.predict(X_test)

# Calculate the accuracy of the model
xgboost_accuracy = accuracy_score(y_test, xgboost_predictions)

LightGBM

Lastly, let’s train and evaluate the LightGBM model:

import lightgbm as lgb

# Create a LightGBM model
lightgbm_model = lgb.LGBMClassifier()

# Fit the model on the training data
lightgbm_model.fit(X_train, y_train)

# Make predictions on the test data
lightgbm_predictions = lightgbm_model.predict(X_test)

# Calculate the accuracy of the model
lightgbm_accuracy = accuracy_score(y_test, lightgbm_predictions)

Comparison of Results

Finally, let’s compare the accuracies of the three models:

print("Catboost Accuracy:", catboost_accuracy)
print("XGBoost Accuracy:", xgboost_accuracy)
print("LightGBM Accuracy:", lightgbm_accuracy)

The output will show the accuracy scores for each model, allowing us to compare their performance.

Conclusion

In this blog post, we performed a benchmark comparison of Catboost with XGBoost and LightGBM using the Titanic dataset. The results showed the accuracy scores of each model, helping us evaluate and compare their performance. Remember that these models’ hyperparameters can be further tuned to achieve better results.

Catboost, with its capability to handle categorical features, provides a strong alternative for gradient boosting tasks and is worth considering when building classification models in Python.