How to Group Data and Summarise with dplyr in R
To group data in R for summary statistics, dplyr::group_by() paired with summarise() answers questions like “what is the average salary per department?” or “how many sales happened each month?”. You split the data into groups, compute a statistic for each group, and combine the results into a tidy summary table — all inside a single pipeline.
library(dplyr)
# Group by department and compute summary statistics
df %>%
group_by(department) %>%
summarise(
mean_salary = mean(salary, na.rm = TRUE),
median_salary = median(salary, na.rm = TRUE),
n = n(),
.groups = "drop"
)
# Simple count per group
df %>% count(department)
dplyr::summarise() is the most readable approach and returns a tidy data frame. List all statistics in one summarise() call rather than calling aggregate() multiple times. For large datasets (millions of rows), data.table’s dt[, .(mean_salary = mean(salary)), by = department] syntax is substantially faster.
# Base R equivalent
aggregate(salary ~ department, data = df, FUN = mean)
# data.table approach
library(data.table)
dt <- as.data.table(df)
dt[, .(mean_salary = mean(salary)), by = department]
Common summary functions include mean(), median(), sum(), min(), max(), sd(), and the row-count helpers n() (dplyr) and .N (data.table). Use na.rm = TRUE to handle missing values in any summary function. When you need multiple statistics per group, list them all inside a single summarise() call rather than chaining separate aggregate() commands — the dplyr approach runs faster and produces a cleaner output data frame.
See also
- group_by(), dplyr grouping
- filter(), Filter rows
- mutate(), Add columns