[파이썬] statsmodels 연속형 변수 분석

Introduction to Continuous Variables

Continuous variables are numerical variables that can take on any value within a specific range or interval. Examples of continuous variables include age, height, temperature, and income. Analyzing these variables allows us to gain insights into relationships and patterns present in the data.

Getting Started with Statsmodels

To start using Statsmodels for analyzing continuous variables, first make sure you have it installed. You can install Statsmodels by running the following command:

pip install statsmodels

Once you have Statsmodels installed, you can import the necessary modules and functions in your Python program:

import statsmodels.api as sm
import statsmodels.formula.api as smf

Simple Linear Regression

One of the most common techniques used for analyzing continuous variables is simple linear regression. Simple linear regression examines the relationship between a dependent variable (Y) and an independent variable (X) by fitting a linear equation to the observed data.

Here’s an example of how to perform simple linear regression using Statsmodels:

# Load the dataset
dataset = sm.datasets.get_rdataset('mtcars').data

# Define the dependent and independent variables
y = dataset['mpg']  # Dependent variable
x = dataset['hp']   # Independent variable

# Add a constant term to the independent variable
x = sm.add_constant(x)

# Create a linear regression model
model = sm.OLS(y, x)

# Fit the model
results = model.fit()

# Print the summary statistics
print(results.summary())

In this example, we load the “mtcars” dataset from the built-in datasets in Statsmodels. We then define the dependent variable (y) as the “mpg” column and the independent variable (x) as the “hp” column. We add a constant term to the independent variable using sm.add_constant. Next, we create a linear regression model using sm.OLS and fit the model using model.fit(). Finally, we print the summary statistics of the fitted model using results.summary().

The summary statistics provide information about the coefficient estimates, standard errors, p-values, and other goodness-of-fit metrics. These statistics help evaluate the strength and significance of the relationship between the variables.

Multiple Linear Regression

Multiple linear regression extends simple linear regression by considering multiple independent variables. It allows analyzing the relationships between a dependent variable and multiple predictors simultaneously.

Here’s an example of how to perform multiple linear regression using Statsmodels:

# Define the dependent and independent variables
y = dataset['mpg']  # Dependent variable
x = dataset[['hp', 'wt', 'qsec']]  # Independent variables

# Add a constant term to the independent variable
x = sm.add_constant(x)

# Create a multiple linear regression model
model = sm.OLS(y, x)

# Fit the model
results = model.fit()

# Print the summary statistics
print(results.summary())

In this example, we define the dependent variable (y) as the “mpg” column, and the independent variables (x) as the “hp”, “wt”, and “qsec” columns. We add a constant term to the independent variable, create a multiple linear regression model, fit the model, and print the summary statistics using the same procedures as in simple linear regression.

Conclusion

Statsmodels provides a comprehensive toolkit for analyzing continuous variables in Python. In this blog post, we introduced simple linear regression and multiple linear regression using Statsmodels. These techniques allow us to explore relationships between continuous variables and make predictions based on the observed data.

Remember to install Statsmodels and familiarize yourself with its functions and methods to unlock more advanced analytics capabilities. Happy analyzing!