[파이썬] statsmodels 영향력 진단

In many statistical analysis tasks, it is essential to determine the influence or outlier observations in a dataset. These observations can significantly impact the results of our analysis and model performance. The statsmodels library in Python provides powerful tools for diagnosing the influence of data points on regression models.

In this blog post, we will explore various techniques offered by statsmodels for influence diagnostics and demonstrate how to use them with Python code examples.

1. Cook’s Distance

One popular method for measuring the influence of a data point is through Cook’s distance, which evaluates the effect of each observation on the estimated regression coefficients. Higher values of Cook’s distance indicate greater influence of a specific observation on the model.

To calculate Cook’s distance using statsmodels, we can use the ols function from the statsmodels.formula.api module. Here’s an example:

import statsmodels.formula.api as smf
import pandas as pd

# Read the dataset
data = pd.read_csv('data.csv')

# Fit the linear regression model
model = smf.ols('y ~ x', data=data).fit()

# Calculate Cook's distance
influence = model.get_influence()
d = influence.cooks_distance

# Print the Cook's distance values
print(d)

2. Leverage and Influence

Another important concept in influence diagnostics is the combination of leverage and influence. Leverage measures how much a particular observation deviates from the average of predictor variables, while influence quantifies the impact of an observation on the regression coefficients.

To compute leverage and influence statistics using statsmodels, we can utilize the get_influence().summary_frame() method. Here’s an example:

import statsmodels.api as sm
import pandas as pd

# Read the dataset
data = pd.read_csv('data.csv')

# Add constant to the dataset
data = sm.add_constant(data)

# Fit the linear regression model
model = sm.OLS(data['y'], data[['const', 'x']]).fit()

# Calculate leverage and influence
influence = model.get_influence()
summary_frame = influence.summary_frame()

# Print the leverage and influence statistics
print(summary_frame[['hat_diag', 'cook_d']])

3. DFFITS

DFFITS is a combined measure of both leverage and influence of an observation. It indicates the change in the predicted response value when a particular data point is removed from the analysis. Larger DFFITS values correspond to more influential observations.

Statsmodels provides a convenient method to calculate DFFITS using get_influence().dffits. Here’s an example:

import statsmodels.formula.api as smf
import pandas as pd

# Read the dataset
data = pd.read_csv('data.csv')

# Fit the linear regression model
model = smf.ols('y ~ x', data=data).fit()

# Calculate DFFITS
influence = model.get_influence()
dffits = influence.dffits

# Print the DFFITS values
print(dffits)

Conclusion

In this blog post, we explored the various influence diagnostics techniques provided by statsmodels library in Python. These methods allow us to detect and evaluate the impact of individual observations on our regression models.

By utilizing Cook’s distance, leverage and influence, and DFFITS statistics, we can gain insights into the influential data points and make informed decisions on how to handle them in our analysis.

It is crucial to examine the influence of observations carefully and consider their impact while interpreting regression model results. Incorporating influence diagnostics in our workflow can lead to more accurate and reliable statistical analysis.