[파이썬] statsmodels 교차 검증

Cross-validation is a powerful technique used in machine learning and statistics to evaluate the performance of a model. It helps us estimate how well a model will generalize to new, unseen data. Statsmodels, a popular statistical library in Python, provides convenient functionalities for performing cross-validation efficiently.

In this blog post, we will explore various approaches to cross-validation using statsmodels. We will cover the following topics:

  1. K-Fold Cross-Validation: Splitting the dataset into k equal-sized folds and training the model on k-1 folds while testing on the remaining fold.
  2. Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold cross-validation where each fold contains only one observation.
  3. Repeated Cross-Validation: Repeating k-fold cross-validation multiple times with different random splits.
  4. Stratified Cross-Validation: Ensuring each fold contains approximately the same proportion of target classes as the whole dataset.
  5. Time Series Cross-Validation: Handling temporal dependencies in time series data during cross-validation.

K-Fold Cross-Validation

K-Fold cross-validation is a widely used technique that splits the dataset into k equal-sized folds or partitions. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold acting as the test set exactly once.

Let’s see how we can perform k-fold cross-validation using statsmodels:

import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import KFold

# Assume X and y are the features and target respectively
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])

k = 3  # Number of folds
kf = KFold(n_splits=k, shuffle=True)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = sm.Logit(y_train, X_train)  # Fit your model here
    result = model.fit()
    
    # Evaluate the model using appropriate metrics
    predictions = result.predict(X_test)
    # Perform evaluation here

Here, we use the KFold class from the sklearn.model_selection module to create the folds. We iterate over the train-test splits using the .split() method of the KFold object. Inside the loop, we can train our model on the training data and evaluate it on the test data.

Leave-One-Out Cross-Validation (LOOCV)

Leave-One-Out cross-validation (LOOCV) is a special case of k-fold cross-validation where each fold contains only one observation. It is particularly useful when working with small datasets. Statsmodels provides a convenient way to perform LOOCV using the LeaveOneOut class from the sklearn.model_selection module.

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = sm.Logit(y_train, X_train)  # Fit your model here
    result = model.fit()
    
    # Evaluate the model using appropriate metrics
    predictions = result.predict(X_test)
    # Perform evaluation here

In this case, LeaveOneOut splits the dataset in such a way that each fold only contains a single observation. The loop iterates over all the observations, allowing us to fit and evaluate the model iteratively.

Repeated Cross-Validation

Sometimes, a single run of k-fold cross-validation might not be sufficient to obtain reliable estimates of model performance. Repeated cross-validation can provide more robust estimates by repeating k-fold cross-validation multiple times with different random splits. Statsmodels does not have built-in support for repeated cross-validation, but we can achieve this by nesting two loops.

n_repeats = 5  # Number of times to repeat the cross-validation
kf = KFold(n_splits=k, shuffle=True)

for _ in range(n_repeats):
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        model = sm.Logit(y_train, X_train)  # Fit your model here
        result = model.fit()
        
        # Evaluate the model using appropriate metrics
        predictions = result.predict(X_test)
        # Perform evaluation here

In the above code snippet, we wrap the k-fold cross-validation loop inside a loop that repeats the process n_repeats times. This allows us to obtain more stable performance estimates.

Stratified Cross-Validation

In some scenarios, it is important to ensure that each fold contains approximately the same proportion of target classes as the whole dataset. Stratified cross-validation aims to preserve the class proportions during the cross-validation process. The StratifiedKFold class from the sklearn.model_selection module can be used to achieve this.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=k, shuffle=True)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = sm.Logit(y_train, X_train)  # Fit your model here
    result = model.fit()
    
    # Evaluate the model using appropriate metrics
    predictions = result.predict(X_test)
    # Perform evaluation here

Here, StratifiedKFold splits the dataset while preserving the class proportions. We pass both the features X and the target y to the .split() method to ensure that the class labels are stratified properly.

Time Series Cross-Validation

Time series data often exhibits temporal dependencies that must be taken into account during cross-validation. For such scenarios, statsmodels provides the TimeSeriesSplit class from the sklearn.model_selection module. It creates train-test splits for time series data based on consecutive time periods.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=k)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = sm.Logit(y_train, X_train)  # Fit your model here
    result = model.fit()
    
    # Evaluate the model using appropriate metrics
    predictions = result.predict(X_test)
    # Perform evaluation here

The TimeSeriesSplit class splits the data in chronological order, ensuring that each subsequent fold uses data that comes after the previous fold. This approach accounts for the temporal nature of the data and helps evaluate the model’s performance more accurately.

Conclusion

Cross-validation plays a crucial role in assessing the performance of a model. In this blog post, we explored various approaches to cross-validation using the statsmodels library in Python. We covered k-fold cross-validation, leave-one-out cross-validation, repeated cross-validation, stratified cross-validation, and time series cross-validation.

Statsmodels provides a powerful yet intuitive way to perform cross-validation, allowing us to evaluate and compare different models effectively. By leveraging these cross-validation techniques, we can make more informed decisions when building and fine-tuning our machine learning models.