Survival Analysis in R
Survival analysis is a branch of statistics designed for data where the outcome variable represents time until an event occurs. Unlike standard regression where you predict a continuous value, survival analysis handles a unique challenge: censoring. Sometimes participants drop out of a study, or the observation period ends before the event happens. These censored observations still contain valuable information—you know the person survived at least until that point.
This guide shows you how to perform survival analysis in R using the two workhorse packages: survival for the statistical machinery and survival and survminer for visualization.
Installing Required Packages
You will need three packages: the core survival package (included with base R installation in most cases), survminer for visualization, and ggpubr for additional plotting utilities.
install.packages(c("survival", "survminer", "ggpubr"))
library(survival)
library(survminer)
Key Concepts
Before diving into code, understand three core concepts:
- Survival function S(t) — The probability of surviving beyond time t
- Hazard function h(t) — The instantaneous failure rate at time t
- Censoring — When a subject leaves the study or the study ends before the event occurs
Creating Survival Objects
The Surv() function from the survival package creates a survival object that combines time and censoring information.
# Basic Surv object
# Surv(time, event)
# event = 1 means the event occurred, 0 means censored
surv_object <- Surv(time = c(5, 10, 12, 15, 20, 25),
event = c(1, 1, 0, 1, 0, 1))
surv_object
For right-censored data (the most common scenario), the syntax is straightforward. You track how long each subject was observed and whether the event occurred during that time.
Kaplan-Meier Estimation
The Kaplan-Meier estimator calculates the survival function non-parametrically. It handles censoring automatically and produces survival curves that are easy to interpret.
# Using the lung cancer dataset (included in survival)
data(lung)
# Fit Kaplan-Meier curve
km_fit <- survfit(Surv(time, status) ~ 1, data = lung)
# View summary
summary(km_fit)
The survfit() function with ~ 1 creates an overall survival curve. To compare survival between groups (e.g., male vs. female), use ~ sex instead.
Visualizing Survival Curves
The ggsurvplot() function from survminer creates publication-quality plots:
# Basic survival plot
ggsurvplot(km_fit,
data = lung,
conf.int = TRUE,
risk.table = TRUE,
ncensor.plot = TRUE,
title = "Kaplan-Meier Survival Curve")
Key options:
conf.int— Shows 95% confidence intervalrisk.table— Adds a table showing number at riskncensor.plot— Shows when censoring occurred
To compare two groups:
# Survival by sex
km_fit_by_sex <- survfit(Surv(time, status) ~ sex, data = lung)
ggsurvplot(km_fit_by_sex,
data = lung,
conf.int = TRUE,
risk.table = TRUE,
pval = TRUE,
legend.title = "Sex",
legend.labs = c("Male", "Female"))
Cox Proportional Hazards Regression
The Cox model is the workhorse of survival analysis. It estimates the effect of covariates on the hazard ratio while making fewer assumptions than parametric models.
# Fit Cox proportional hazards model
cox_fit <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
# View results
summary(cox_fit)
Interpreting the output:
- coef — Log-hazard ratio
- exp(coef) — Hazard ratio (e.g., 1.01 means 1% increased hazard per unit increase)
- se(coef) — Standard error
- z — Test statistic
- p — P-value
To extract just the hazard ratios with confidence intervals:
cox_fit <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
cox_summary <- summary(cox_fit)
cbind(
HR = round(cox_summary$coefficients[, "exp(coef)"], 3),
"95% CI" = round(cox_summary$conf.int[, c("lower .95", "upper .95")], 3),
p = round(cox_summary$coefficients[, "Pr(>|z|)"], 4)
)
Model Diagnostics
Two assumptions underpin Cox regression:
- Proportional hazards — Hazard ratios stay constant over time
- Linear relationship — Covariates have linear effects
Check proportional hazards with Schoenfeld residuals:
# Test proportional hazards assumption
test_ph <- cox.zph(cox_fit)
print(test_ph)
# Plot Schoenfeld residuals
ggcoxzph(test_ph)
If the p-value is significant (less than 0.05), the proportional hazards assumption is violated. Consider stratified models or time-dependent covariates.
Concordance Index
The concordance index (C-index) measures how well the model discriminates between subjects who experienced the event and those who did not. A value of 0.5 means no better than chance; 1.0 means perfect discrimination.
# Get C-index
concordance(cox_fit)
Practical Example: Lung Cancer Survival
Putting it all together with the lung dataset:
# Full analysis pipeline
library(survival)
library(survminer)
# Load data
data(lung)
# Descriptive: Kaplan-Meier by treatment
km_by_trt <- survfit(Surv(time, status) ~ ph.ecog, data = lung)
ggsurvplot(km_by_trt, data = lung, pval = TRUE, risk.table = TRUE)
# Model: Cox regression
cox_model <- coxph(Surv(time, status) ~ age + sex + ph.ecog + wt.loss,
data = lung)
summary(cox_model)
# Diagnostics
cox.zph(cox_model)
Conclusion
Survival analysis in R is straightforward once you understand the core concepts. The survival package handles the heavy statistical lifting, while survminer makes results interpretable through visualization. Start with Kaplan-Meier curves to understand your data, then build Cox models to quantify the effect of predictors. Always validate the proportional hazards assumption before drawing strong conclusions.
For more advanced modeling, explore accelerated failure time models, frailty models, or multi-state models depending on your specific use case.
See Also
dplyr— Data manipulation for preparing survival dataggplot2— Creating custom survival visualizationslinear-regression-r— Related statistical modeling