How to Compute Summary Statistics for Columns in R
To compute summary statistics for columns is one of the first things you do when exploring a new dataset in R. Base R’s summary() gives a quick overview, while dplyr::summarise() produces a tidy data frame you can use downstream. Both approaches handle missing values cleanly with the na.rm = TRUE argument.
library(dplyr)
df <- data.frame(
department = c("Sales", "Sales", "Engineering", "Engineering", "Sales"),
salary = c(50000, 55000, 70000, 75000, 52000),
age = c(25, 30, 35, 40, 22)
)
# Single-column summary
df %>%
summarise(
mean = mean(age, na.rm = TRUE),
median = median(age, na.rm = TRUE),
sd = sd(age, na.rm = TRUE),
min = min(age, na.rm = TRUE),
max = max(age, na.rm = TRUE)
)
For grouped summaries, use group_by(): df |> group_by(department) |> summarise(n = n(), mean_salary = mean(salary)). To summarise every numeric column per group in one call, write summarise(across(where(is.numeric), mean)).
# Grouped summary
df %>%
group_by(department) %>%
summarise(
n = n(),
mean_salary = mean(salary)
)
# Quick base R overview
summary(df$age)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 22.00 26.50 29.00 30.00 36.25 40.00 0
For richer exploratory summaries, skimr::skim(df) prints mini-histograms for numeric columns and separates output by variable type. psych::describe() adds skewness and kurtosis. Both are faster than writing custom summarise() calls when getting acquainted with an unfamiliar dataset.
When you need a specific subset of statistics across many columns, summarise(across(where(is.numeric), list(mean = mean, sd = sd))) applies the same functions to every numeric column in one call. For weighted statistics, Hmisc::wtd.mean() and Hmisc::wtd.var() accept a weights argument that the base R functions do not.