In statistics, Principal Component Regression (PCR) is a technique used for regression analysis when dealing with multicollinearity among predictors. It combines the concepts of Principal Component Analysis (PCA) and linear regression to create a more robust regression model.
In this blog post, we will explore how to perform PCR using the statsmodels
library in Python. Let’s get started!
Step 1: Importing the necessary libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
In this example, we are importing numpy
for numerical computations, pandas
for data manipulation, statsmodels
for regression analysis, and StandardScaler
from scikit-learn for data standardization.
Step 2: Load and preprocess the data
# Load the data
data = pd.read_csv("data.csv")
# Separate the predictors and the target variable
X = data.drop("target_variable", axis=1)
y = data["target_variable"]
# Standardize the predictors
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Assuming you have your data in a CSV file named “data.csv”, we load the data using pandas
and separate the predictors (X
) and the target variable (y
). Then, we apply standardization to the predictors using StandardScaler
.
Step 3: Perform Principal Component Regression
# Perform Principal Component Analysis
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Select the desired number of principal components
num_components = 5
X_pca_selected = X_pca[:, :num_components]
# Add constant to X_pca_selected
X_pca_selected = sm.add_constant(X_pca_selected)
# Fit the regression model
model = sm.OLS(y, X_pca_selected)
results = model.fit()
Now we perform Principal Component Analysis (PCA) on the standardized predictors (X_scaled
) using PCA
from scikit-learn. We then select the desired number of principal components (num_components
) and add a constant column to the selected components. Finally, we fit the Multiple Linear Regression model using sm.OLS
from statsmodels
and obtain the results.
Step 4: Analyze the results
print(results.summary())
We can analyze the results using the summary()
method of the fitted model. It provides detailed information such as R-squared, coefficients, p-values, and more.
Conclusion
Principal Component Regression is a useful technique to overcome multicollinearity issues in regression analysis. By combining the power of PCA and linear regression, it allows us to build more robust models with improved interpretability.
In this blog post, we demonstrated how to perform Principal Component Regression using the statsmodels
library in Python. Make sure to experiment with different settings and hyperparameters to optimize your models according to your specific needs.
Happy coding!