rguides

How to Compute Summary Statistics for Columns in R

To compute summary statistics for columns is one of the first things you do when exploring a new dataset in R. Base R’s summary() gives a quick overview, while dplyr::summarise() produces a tidy data frame you can use downstream. Both approaches handle missing values cleanly with the na.rm = TRUE argument.

library(dplyr)

df <- data.frame(
  department = c("Sales", "Sales", "Engineering", "Engineering", "Sales"),
  salary     = c(50000, 55000, 70000, 75000, 52000),
  age        = c(25, 30, 35, 40, 22)
)

# Single-column summary
df %>%
  summarise(
    mean   = mean(age, na.rm = TRUE),
    median = median(age, na.rm = TRUE),
    sd     = sd(age, na.rm = TRUE),
    min    = min(age, na.rm = TRUE),
    max    = max(age, na.rm = TRUE)
  )

For grouped summaries, use group_by(): df |> group_by(department) |> summarise(n = n(), mean_salary = mean(salary)). To summarise every numeric column per group in one call, write summarise(across(where(is.numeric), mean)).

# Grouped summary
df %>%
  group_by(department) %>%
  summarise(
    n           = n(),
    mean_salary = mean(salary)
  )

# Quick base R overview
summary(df$age)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
#   22.00   26.50   29.00   30.00   36.25   40.00       0

For richer exploratory summaries, skimr::skim(df) prints mini-histograms for numeric columns and separates output by variable type. psych::describe() adds skewness and kurtosis. Both are faster than writing custom summarise() calls when getting acquainted with an unfamiliar dataset.

When you need a specific subset of statistics across many columns, summarise(across(where(is.numeric), list(mean = mean, sd = sd))) applies the same functions to every numeric column in one call. For weighted statistics, Hmisc::wtd.mean() and Hmisc::wtd.var() accept a weights argument that the base R functions do not.

See also

  • mean(), Calculate the arithmetic mean
  • sd(), Standard deviation
  • range(), Minimum and maximum