[파이썬] statsmodels 모델 진단

Statsmodels is a powerful Python library that provides a wide range of statistical models and tools for data analysis. One of the key features of Statsmodels is its ability to perform model diagnostics, which helps analysts and data scientists assess the quality and validity of their models.

In this blog post, we will explore various model diagnostics techniques available in Statsmodels and how they can be used to evaluate the performance of your models. We will cover three main areas of model diagnostics: residual analysis, goodness-of-fit tests, and influential observations.

Residual Analysis

Residual analysis is a common technique used to assess the fit of a model to the data. It involves examining the residuals, which are the differences between the observed values and the predicted values from the model. Statsmodels provides various methods to analyze residuals, including:

Plots of Residuals

Statsmodels allows you to plot the residuals against the fitted values or against the independent variable(s) of your model. These plots can help you identify patterns or relationships in the residuals that may suggest problems with your model.

import statsmodels.api as sm
import matplotlib.pyplot as plt

# Fit your model
model = sm.OLS(y, X)
results = model.fit()

# Plot residuals against fitted values
plt.scatter(results.fittedvalues, results.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()

Autocorrelation of Residuals

Autocorrelation refers to the correlation between the residuals at different time points. High autocorrelation may indicate that the model is not adequately capturing the temporal dependencies in the data. Statsmodels provides a Durbin-Watson test to check for autocorrelation:

import statsmodels.stats.stattools as smtools

# Perform Durbin-Watson test
dw_statistic = smtools.durbin_watson(results.resid)

if dw_statistic > 1.5 and dw_statistic < 2.5:
    print("No significant autocorrelation detected.")
else:
    print("Significant autocorrelation detected.")

Goodness-of-Fit Tests

Goodness-of-fit tests assess how well the model fits the data. Statsmodels offers several statistical tests to evaluate the goodness of fit, including:

Chi-Square Test

The chi-square test compares the observed frequencies with the expected frequencies from the model. If the chi-square test statistic is significant, it indicates that the model does not fit the data well.

import scipy.stats as stats

# Perform chi-square test
chi2_statistic, p_value = stats.chisquare(observed_freq, expected_freq)

if p_value < 0.05:
    print("Model does not fit the data well.")
else:
    print("Model fits the data well.")

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test compares the cumulative distribution function (CDF) of the observed data with the CDF of the model. If the test statistic is significant, it suggests that the model is not a good fit to the data.

# Perform Kolmogorov-Smirnov test
ks_statistic, p_value = stats.kstest(observed_data, model_cdf)

if p_value < 0.05:
    print("Model does not fit the data well.")
else:
    print("Model fits the data well.")

Influential Observations

Influential observations, also known as outliers, can have a significant impact on the model’s performance and parameter estimates. Statsmodels provides methods to detect influential observations, including:

Cook’s Distance

Cook’s distance measures the influence of each observation on the model’s predictions. A high Cook’s distance suggests that the observation has a strong influence on the model, potentially affecting the model’s parameter estimates.

# Calculate Cook's distance
influence = results.get_influence()
cook_distance = influence.cooks_distance

# Check for influential observations
if any(cook_distance > threshold):
    print("Influential observations detected.")
else:
    print("No influential observations detected.")

Leverage

Leverage measures how much an observation deviates from the average values of the independent variables. Observations with high leverage can have a disproportionate impact on the model’s predictions.

# Calculate leverage
leverage = influence.hat_matrix_diag

# Check for high leverage observations
if any(leverage > threshold):
    print("High leverage observations detected.")
else:
    print("No high leverage observations detected.")

By utilizing these model diagnostic techniques in Statsmodels, you can gain valuable insights into the performance and assumptions of your statistical models. This allows you to make informed decisions about model selection, parameter estimation, and potential improvements to your analysis.

Remember to always carefully interpret the results and consider the context of your data when conducting model diagnostics.