[파이썬] statsmodels 주성분 회귀

In statistics, Principal Component Regression (PCR) is a technique used for regression analysis when dealing with multicollinearity among predictors. It combines the concepts of Principal Component Analysis (PCA) and linear regression to create a more robust regression model.

In this blog post, we will explore how to perform PCR using the statsmodels library in Python. Let’s get started!

Step 1: Importing the necessary libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

In this example, we are importing numpy for numerical computations, pandas for data manipulation, statsmodels for regression analysis, and StandardScaler from scikit-learn for data standardization.

Step 2: Load and preprocess the data

# Load the data
data = pd.read_csv("data.csv")

# Separate the predictors and the target variable
X = data.drop("target_variable", axis=1)
y = data["target_variable"]

# Standardize the predictors
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Assuming you have your data in a CSV file named “data.csv”, we load the data using pandas and separate the predictors (X) and the target variable (y). Then, we apply standardization to the predictors using StandardScaler.

Step 3: Perform Principal Component Regression

# Perform Principal Component Analysis
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Select the desired number of principal components
num_components = 5
X_pca_selected = X_pca[:, :num_components]

# Add constant to X_pca_selected
X_pca_selected = sm.add_constant(X_pca_selected)

# Fit the regression model
model = sm.OLS(y, X_pca_selected)
results = model.fit()

Now we perform Principal Component Analysis (PCA) on the standardized predictors (X_scaled) using PCA from scikit-learn. We then select the desired number of principal components (num_components) and add a constant column to the selected components. Finally, we fit the Multiple Linear Regression model using sm.OLS from statsmodels and obtain the results.

Step 4: Analyze the results

print(results.summary())

We can analyze the results using the summary() method of the fitted model. It provides detailed information such as R-squared, coefficients, p-values, and more.

Conclusion

Principal Component Regression is a useful technique to overcome multicollinearity issues in regression analysis. By combining the power of PCA and linear regression, it allows us to build more robust models with improved interpretability.

In this blog post, we demonstrated how to perform Principal Component Regression using the statsmodels library in Python. Make sure to experiment with different settings and hyperparameters to optimize your models according to your specific needs.

Happy coding!