[파이썬] statsmodels 생존 분석

Survival analysis is a statistical method used to analyze time-to-event data, where the event represents a specific outcome such as death, failure, or relapse. In Python, the statsmodels library provides a powerful set of functions for conducting survival analysis.

Installation

Before getting started with statsmodels survival analysis, make sure you have the library installed. You can install it using pip:

!pip install statsmodels

Importing the Required Libraries

To start working with survival analysis in statsmodels, import the necessary libraries:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from lifelines import KaplanMeierFitter

The pandas library is used for data manipulation, while numpy is used for numerical operations. The statsmodels.api module is the main module for conducting statistical analyses, and the lifelines library provides additional functionality for survival analysis.

Loading the Data

Before performing survival analysis, load your time-to-event dataset into a pandas DataFrame. The dataset should contain two columns - one for the time of the event and another indicating whether the event has occurred or not (also known as the status column).

data = pd.read_csv('survival_data.csv')

Replace 'survival_data.csv' with the path to your own dataset.

Kaplan-Meier Estimator

One of the most common methods in survival analysis is the Kaplan-Meier estimator, which estimates the survival function of a population given the observed time-to-event data.

To compute the Kaplan-Meier estimator using statsmodels, follow these steps:

# Create a KaplanMeierFitter object
kmf = KaplanMeierFitter()

# Fit the model with the time and status data
kmf.fit(data['time'], data['status'])

# Print the estimated survival function
print(kmf.survival_function_)

The fit() method fits the model to the data, and the survival_function_ attribute returns the estimated survival function at each observed time point.

Cox Proportional Hazards Model

Another commonly used survival analysis technique is the Cox proportional hazards model. This model allows you to investigate the relationship between predictor variables and the hazard rate, which is the instantaneous rate at which events occur.

To fit a Cox proportional hazards model using statsmodels, use the following code:

# Create a DataFrame with the predictor variables
X = data[['age', 'gender', 'treatment']]

# Add a constant column to the predictor variables
X = sm.add_constant(X)

# Create the survival regression model
cph = sm.PHReg(data['time'], X)

# Fit the model to the data
result = cph.fit()

# Print the coefficients and p-values
print(result.summary())

In this example, the predictor variables are age, gender, and treatment. The add_constant() function adds a constant column (intercept) to the predictor variables. The PHReg class is used to create the Cox proportional hazards model.

Conclusion

Survival analysis is a powerful statistical method for analyzing time-to-event data. In this blog post, we introduced the basics of performing survival analysis with statsmodels in Python. We covered the Kaplan-Meier estimator for estimating the survival function and the Cox proportional hazards model for investigating the relationship between predictor variables and the hazard rate.

For more advanced survival analysis techniques and models, be sure to explore the statsmodels documentation and available resources online.