Survival Analysis in R
Survival analysis is a branch of statistics designed for data where the outcome variable represents time until an event occurs. Unlike standard regression where you predict a continuous value, survival analysis handles a unique challenge: censoring. Sometimes participants drop out of a study, or the observation period ends before the event happens. These censored observations still contain valuable information—you know the person survived at least until that point.
This guide shows you how to perform survival analysis in R using the two workhorse packages: survival for the statistical machinery and survival and survminer for visualization.
Installing required packages
You will need three packages: the core survival package (included with base R installation in most cases), survminer for visualization, and ggpubr for additional plotting utilities.
install.packages(c("survival", "survminer", "ggpubr"))
library(survival)
library(survminer)
Key concepts
Before diving into code, understand three core concepts:
- Survival function S(t), The probability of surviving beyond time t
- Hazard function h(t), The instantaneous failure rate at time t
- Censoring, When a subject leaves the study or the study ends before the event occurs
Creating survival objects
The Surv() function from the survival package creates a survival object that combines time and censoring information.
# Basic Surv object
# Surv(time, event)
# event = 1 means the event occurred, 0 means censored
surv_object <- Surv(time = c(5, 10, 12, 15, 20, 25),
event = c(1, 1, 0, 1, 0, 1))
surv_object
For right-censored data (the most common scenario), the syntax is straightforward. You track how long each subject was observed and whether the event occurred during that time.
Visualizing survival curves
The ggsurvplot() function from survminer creates publication-quality plots:
# Basic survival plot
ggsurvplot(km_fit,
data = lung,
conf.int = TRUE,
risk.table = TRUE,
ncensor.plot = TRUE,
title = "Kaplan-Meier Survival Curve")
Key options:
conf.int, Shows 95% confidence intervalrisk.table, Adds a table showing number at riskncensor.plot, Shows when censoring occurred
To compare two groups:
# Survival by sex
km_fit_by_sex <- survfit(Surv(time, status) ~ sex, data = lung)
ggsurvplot(km_fit_by_sex,
data = lung,
conf.int = TRUE,
risk.table = TRUE,
pval = TRUE,
legend.title = "Sex",
legend.labs = c("Male", "Female"))
Cox proportional hazards regression
The Cox model is the workhorse of survival analysis. It estimates the effect of covariates on the hazard ratio while making fewer assumptions than parametric models.
# Fit Cox proportional hazards model
cox_fit <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
# View results
summary(cox_fit)
Interpreting the output:
- coef, Log-hazard ratio
- exp(coef), Hazard ratio (e.g., 1.01 means 1% increased hazard per unit increase)
- se(coef), Standard error
- z, Test statistic
- p, P-value
To extract just the hazard ratios with confidence intervals:
cox_fit <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
cox_summary <- summary(cox_fit)
cbind(
HR = round(cox_summary$coefficients[, "exp(coef)"], 3),
"95% CI" = round(cox_summary$conf.int[, c("lower .95", "upper .95")], 3),
p = round(cox_summary$coefficients[, "Pr(>|z|)"], 4)
)
Model diagnostics
Two assumptions underpin Cox regression:
- Proportional hazards, Hazard ratios stay constant over time
- Linear relationship, Covariates have linear effects
Check proportional hazards with Schoenfeld residuals:
# Test proportional hazards assumption
test_ph <- cox.zph(cox_fit)
print(test_ph)
# Plot Schoenfeld residuals
ggcoxzph(test_ph)
If the p-value is significant (less than 0.05), the proportional hazards assumption is violated. Consider stratified models or time-dependent covariates.
Concordance index
The concordance index (C-index) measures how well the model discriminates between subjects who experienced the event and those who did not. A value of 0.5 means no better than chance; 1.0 means perfect discrimination.
# Get C-index
concordance(cox_fit)
Practical example: lung cancer survival
Putting it all together with the lung dataset:
# Full analysis pipeline
library(survival)
library(survminer)
# Load data
data(lung)
# Descriptive: Kaplan-Meier by treatment
km_by_trt <- survfit(Surv(time, status) ~ ph.ecog, data = lung)
ggsurvplot(km_by_trt, data = lung, pval = TRUE, risk.table = TRUE)
# Model: Cox regression
cox_model <- coxph(Surv(time, status) ~ age + sex + ph.ecog + wt.loss,
data = lung)
summary(cox_model)
# Diagnostics
cox.zph(cox_model)
The survival data structure
Survival analysis requires two pieces of information for each observation: the time until the event or censoring, and whether the event occurred or the observation was censored. survival::Surv(time, event) creates a survival object. event is a binary indicator: 1 if the event occurred, 0 if censored (the observation left the study without the event occurring).
Right censoring (the most common type) occurs when the study ends before the event occurs. Left censoring occurs when the event is known to have occurred before the observation period. Interval censoring occurs when the event is known to have occurred within a time interval but the exact time is unknown.
The kaplan-Meier estimator
survfit(Surv(time, event) ~ group, data = df) fits the Kaplan-Meier estimator separately for each group. The result is a non-parametric estimate of the survival function: the probability of surviving past time t. ggsurvfit::ggsurvfit(km_fit) produces publication-quality survival curves with ggplot2.
survdiff(Surv(time, event) ~ group, data = df) tests whether survival curves differ between groups using the log-rank test. The p-value from this test is the standard way to compare group survival in clinical research.
The cox proportional hazards model
coxph(Surv(time, event) ~ covariate1 + covariate2, data = df) fits the semiparametric Cox model. Coefficients are on the log-hazard scale. exp(coef(model)) gives hazard ratios: a coefficient of 0.5 means the covariate is associated with a hazard that is exp(0.5) ≈ 1.65 times higher.
The proportional hazards assumption requires that the hazard ratio between groups is constant over time. cox.zph(model) tests this assumption and plot(cox.zph(model)) visualizes the Schoenfeld residuals. Violations indicate that time-varying coefficients are needed.
Cox model interpretation
The Cox proportional hazards model estimates hazard ratios, not survival probabilities directly. A hazard ratio of 1.5 for a binary covariate means the event rate is 50% higher in that group at any given time point, assuming the proportional hazards assumption holds. Test the proportional hazards assumption with cox.zph(), a significant p-value suggests the ratio changes over time, which violates the model’s assumption.
Time-Varying covariates
When a covariate changes over follow-up (e.g., drug dose), use the counting process format: Surv(tstart, tstop, event). Each row represents an interval of time during which covariates are constant. This is more complex to construct but necessary for accurate inference when exposures change over time. The survSplit() function from the survival package automates the data restructuring.
Competing risks
When subjects can experience more than one type of event (e.g., death from disease vs. death from other causes), standard Kaplan-Meier curves overestimate event probability. Use Fine-Gray subdistribution hazard models via the cmprsk package or tidycmprsk for tidy output. The event of interest uses Surv(..., type = "mstate") syntax.
Survival data structure
Survival analysis handles time-to-event data where the event may not have occurred for all subjects by the end of the study. This partial observation, knowing the event had not occurred as of the last contact, but not knowing when or whether it will occur, is called censoring. Ignoring censored observations by excluding them biases the analysis. Survival analysis methods correctly incorporate the information that the event had not yet occurred at the censoring time.
Survival data requires two columns: the time and an event indicator. The time column is the time from the start of follow-up to either the event or censoring. The event indicator is 1 when the event was observed and 0 when the observation was censored. The Surv() function from the survival package combines these two columns into a survival object that the analysis functions understand.
Kaplan-Meier estimation
The Kaplan-Meier estimator is the nonparametric estimate of the survival function, the probability of surviving beyond each time point. It steps down at each event time, with the step size proportional to the number of events relative to the number still at risk. The survival curve approaches but may not reach zero if many observations are censored at late time points.
Comparing survival curves between groups uses the log-rank test, which tests the null hypothesis that the survival functions are identical. A significant result means survival differs between groups but does not quantify the difference. The hazard ratio from a Cox model quantifies the difference in event rates between groups, accounting for censoring and potential confounders.
Cox proportional hazards model
The Cox model relates covariates to the hazard rate, the instantaneous event rate conditional on surviving to that time, without assuming a specific distribution for the baseline hazard. This semiparametric approach is the standard regression method for survival data. The model produces hazard ratios: a hazard ratio of 2 for a binary covariate means the event rate in that group is twice the rate in the reference group at every time point.
The proportional hazards assumption requires that the hazard ratio between groups is constant over time. Testing this assumption with Schoenfeld residuals is a standard diagnostic step. Violations of proportional hazards, where treatment effects change over time, require modeling the time-varying effect or splitting follow-up time into periods.
Conclusion
Survival analysis in R is straightforward once you understand the core concepts. The survival package handles the heavy statistical lifting, while survminer makes results interpretable through visualization. Start with Kaplan-Meier curves to understand your data, then build Cox models to quantify the effect of predictors. Always validate the proportional hazards assumption before drawing strong conclusions.
For more advanced modeling, explore accelerated failure time models, frailty models, or multi-state models depending on your specific use case.
See also
dplyr— Data manipulation for preparing survival dataggplot2— Creating custom survival visualizationslinear-regression-r— Related statistical modeling