[파이썬] statsmodels 분포 적합성 검정

In statistical analysis, it is common to assess the goodness of fit of a probability distribution to a given dataset. The statsmodels library in Python provides a convenient way to perform various distributional fitting tests to evaluate the suitability of different distributions.

In this blog post, we will learn how to use statsmodels for distribution fitting tests in Python.

Getting Started

Before we can perform distribution fitting tests, we need to install the statsmodels library. You can install it using pip:

pip install statsmodels

Once the installation is complete, we can import the necessary modules and functions:

import numpy as np
import statsmodels.api as sm
from scipy import stats

Distribution Fitting Tests

To perform distribution fitting tests using statsmodels, we typically follow these steps:

  1. Generate or import a dataset.
  2. Fit a distribution to the dataset.
  3. Perform a goodness-of-fit test to evaluate the fitting.

Let’s explore a couple of examples to understand these steps.

Example 1: Normal Distribution

Suppose we have a dataset that we believe follows a normal distribution. We can perform the normality test using the statsmodels library. Here’s how to do it:

# Step 1: Generate or import dataset
np.random.seed(0)
data = np.random.normal(0, 1, 1000)

# Step 2: Fit a distribution
params = stats.norm.fit(data)
sample_mean = np.mean(data)
sample_std = np.std(data)

# Step 3: Perform goodness-of-fit test
kstest_result = stats.kstest(data, 'norm', args=params)
adtest_result = stats.anderson(data, 'norm')

print(f"Sample Mean: {sample_mean:.3f}")
print(f"Sample Std: {sample_std:.3f}")
print("Kolmogorov-Smirnov Test:")
print(f"  Statistic: {kstest_result.statistic:.3f}")
print(f"  P-value: {kstest_result.pvalue:.3f}")
print("Anderson-Darling Test:")
print(f"  Statistic: {adtest_result.statistic:.3f}")
print(f"  Critical Values: {adtest_result.critical_values}")

In this example, we generate a random dataset of size 1000 from the standard normal distribution. We then use the fit function from the scipy.stats module to estimate the parameters of the normal distribution. Finally, we perform the Kolmogorov-Smirnov (KS) test and the Anderson-Darling (AD) test to assess the goodness of fit. The results include the sample mean, sample standard deviation, test statistics, and p-values.

Example 2: Exponential Distribution

Let’s consider another example where we have a dataset that we believe follows an exponential distribution. We can use the same steps as before to perform the distribution fitting tests:

# Step 1: Generate or import dataset
np.random.seed(0)
data = np.random.exponential(1, 1000)

# Step 2: Fit a distribution
params = stats.expon.fit(data)
sample_mean = np.mean(data)
sample_std = np.std(data)

# Step 3: Perform goodness-of-fit test
kstest_result = stats.kstest(data, 'expon', args=params)
adtest_result = stats.anderson(data, 'expon')

print(f"Sample Mean: {sample_mean:.3f}")
print(f"Sample Std: {sample_std:.3f}")
print("Kolmogorov-Smirnov Test:")
print(f"  Statistic: {kstest_result.statistic:.3f}")
print(f"  P-value: {kstest_result.pvalue:.3f}")
print("Anderson-Darling Test:")
print(f"  Statistic: {adtest_result.statistic:.3f}")
print(f"  Critical Values: {adtest_result.critical_values}")

In this example, we generate a random dataset of size 1000 from the exponential distribution with a rate parameter of 1. We fit the exponential distribution to the data, perform the KS test and the AD test, and print the results.

Conclusion

The statsmodels library provides a powerful set of tools for performing distribution fitting tests in Python. By following a few simple steps, we can assess the goodness of fit of various distributions to our datasets. This allows us to make informed decisions and draw meaningful conclusions when analyzing data.