[파이썬] `catboost`와 `scikit-learn` 연동

07 Sep 2023

catboost

Catboost and Scikit-learn

Catboost is a popular gradient boosting library that is known for its high performance and ease of use. Scikit-learn is a widely used Python library for machine learning that provides a consistent interface for various classification and regression algorithms. In this blog post, we will explore how to integrate Catboost with Scikit-learn to leverage the benefits of both libraries in your machine learning projects.

Installing the Required Libraries

First, make sure you have both Catboost and Scikit-learn installed in your Python environment. You can install them using the following commands:

pip install catboost
pip install scikit-learn

Once installed, you’re ready to start using Catboost and Scikit-learn together.

Training a Catboost Model with Scikit-learn Integration

To train a Catboost model using Scikit-learn, you need to import the required libraries and create an instance of the CatBoostClassifier class. Here’s an example code snippet:

from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Catboost classifier
model = CatBoostClassifier()

# Train the model
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

In the above code, we’re importing the necessary classes and functions from Catboost and Scikit-learn. The Iris dataset is loaded from Scikit-learn’s example datasets. Then, we split the dataset into train and test sets using the train_test_split() function. After that, we create an instance of the CatBoostClassifier class and train the model with the training data using the fit() method. Finally, we evaluate the model’s accuracy on the test set using the score() method.

Hyperparameter Tuning with Scikit-learn’s API

One of the strengths of Scikit-learn is its unified API for hyperparameter tuning. Catboost can take advantage of Scikit-learn’s API to optimize its hyperparameters using techniques like grid search or random search.

Here’s an example of how to perform hyperparameter tuning with Catboost using Scikit-learn’s GridSearchCV:

from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'iterations': [100, 200, 300],
    'learning_rate': [0.1, 0.01, 0.001]
}

# Create a Catboost classifier
model = CatBoostClassifier()

# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, cv=3)

# Train the model with the best hyperparameters
grid_search.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = grid_search.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

In this code, we define a dictionary param_grid that contains the hyperparameters to tune. We create an instance of the GridSearchCV class by passing the Catboost model, the parameter grid, and the number of cross-validation folds. Then, we fit the grid search object to the training data, which performs hyperparameter tuning. Finally, we evaluate the model’s accuracy on the test set using the score() method.

Conclusion

In this blog post, we have seen how to integrate Catboost with Scikit-learn in Python. We learned how to train a Catboost model using Scikit-learn’s API and perform hyperparameter tuning using Scikit-learn’s GridSearchCV. This integration allows us to take advantage of the powerful features of both libraries and build better machine learning models.