dplyr::summarise()
summarise(.data, ..., .by = NULL, .sort = FALSE, .na.rm = FALSE) summarise() (also spelled summarize()) collapses a tibble or data frame into a single row per group by computing summary statistics. It is one of the most-used functions in the tidyverse for aggregating data. dplyr ships both spellings as a convenience for UK and US conventions — they are identical.
Syntax
summarise(.data, ..., .by = NULL, .sort = FALSE, .na.rm = FALSE)
summarize(.data, ..., .by = NULL, .sort = FALSE, .na.rm = FALSE) # identical
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
.data | tibble / data.frame | required | Input data frame or tibble |
... | name-value expressions | required | New columns defined as name = expression |
.by | bare column names | NULL | Group by these columns before summarising (dplyr 1.1+). Passes bare unquoted names, not a vector. |
.sort | logical | FALSE | Sort output rows by size of .by groups, largest first (dplyr 1.1+) |
.na.rm | logical | FALSE | Pass na.rm = TRUE to all summary functions that accept it |
The .by parameter is the modern alternative to group_by() for one-off grouped summaries. It accepts bare column names directly: .by = c(cyl, am) instead of group_by(cyl, am) |> summarise(...).
Examples
Basic summarise
The simplest form of summarise() computes aggregate statistics across all rows of the data frame, returning a single-row tibble. Each name-value pair defines a new column with the result of the specified summary function:
library(dplyr)
# Overall summary of mtcars
mtcars |> summarise(avg_mpg = mean(mpg), total_hp = sum(hp))
#> # A tibble: 1 × 2
#> avg_mpg total_hp
#> <dbl> <dbl>
#> 1 20.1 4694
Grouped summarise with .by (dplyr 1.1+)
The .by argument groups data before summarising, eliminating the need for a separate group_by() call. This is the modern dplyr pattern for one-off grouped summaries and produces the same result as the older two-step approach in a single line:
# Average mpg and total hp per cylinder/gear combination
mtcars |> summarise(
avg_mpg = mean(mpg),
total_hp = sum(hp),
.by = c(cyl, gear)
)
#> # A tibble: 9 × 4
#> cyl gear avg_mpg total_hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 3 21.5 415
#> 2 4 4 26.9 468
#> ...
Sorting and multiple statistics
The .sort = TRUE argument orders the output by group size in descending order, putting the largest groups at the top. Multiple summary statistics can be computed in a single summarise() call, and across() applies the same function to many columns at once. Use .names with glue syntax like "mean_{.col}" to construct descriptive output column names automatically:
# Sort by group size, largest first
mtcars |> summarise(n = n(), .by = cyl, .sort = TRUE)
# Multiple statistics in one call
mtcars |> summarise(
mean_mpg = mean(mpg), median_mpg = median(mpg),
sd_mpg = sd(mpg), min_mpg = min(mpg), max_mpg = max(mpg),
.by = cyl
)
# across() with dynamic .names — mean of all Sepal columns by species
iris |>
summarise(across(starts_with("Sepal"), mean, .names = "mean_{.col}"), .by = Species)
Handling NA values
summarise() propagates NA values through aggregation functions by default, so a single missing value in a group causes the entire group’s result to be NA. Setting .na.rm = TRUE globally removes missing values before any computation within the summarise step, which is convenient when multiple aggregation functions all need the same treatment:
df <- tibble(x = c(1, 2, NA, 4), g = c("a", "a", "b", "b"))
# Default: NA propagates through
df |> summarise(mean_x = mean(x), .by = g)
#> # A tibble: 2 × 2
#> g mean_x
#> <chr> <dbl>
#> 1 a NA
#> 2 b NA
# Set .na.rm globally
df |> summarise(mean_x = mean(x), .by = g, .na.rm = TRUE)
#> # A tibble: 2 × 2
#> g mean_x
#> <chr> <dbl>
#> 1 a 1.5
#> 2 b 4
You can also pass na.rm = TRUE directly to individual functions like mean(x, na.rm = TRUE).
The .groups argument (deprecated)
In dplyr 1.0+, .groups controlled the grouping structure of the output, but this has been deprecated in favor of .by. For code targeting dplyr 1.1+, use .by instead and avoid .groups entirely:
# .groups is now deprecated
mtcars |>
group_by(cyl) |>
summarise(n = n(), .groups = "drop_last")
When you still need explicit control over grouping after summarisation, call ungroup() at the end of the pipeline. This makes the intent explicit, works consistently across all dplyr versions, and avoids depending on deprecated behavior that may be removed in future releases:
mtcars |>
summarise(n = n(), .by = cyl) |>
ungroup()
count() and tally() shortcuts
For simple row counts, dplyr provides two convenience functions that eliminate the need to write the full summarise() expression by hand. Both produce the same output but take slightly different input forms — count() creates its own groups while tally() works on data that is already grouped by a prior step in the pipeline:
# count() is a specialised summarise for counting rows
mtcars |> count(cyl, am)
mtcars |> count(cyl, wt = hp) # weighted count
# tally() is the pipe-friendly alias
mtcars |> group_by(cyl) |> tally()
Both are equivalent to summarise(n = n(), .by = ...) but are shorter to type and communicate intent more directly for this specific pattern.
Common gotchas
Unquoted column names. dplyr uses tidy evaluation, so column names are unquoted. To pass a column name programmatically from a function argument, use the embrace operator {{ }}. This captures the expression passed by the caller and injects it into the summarise() call:
my_summary <- function(data, col) {
data |> summarise(result = mean({{ col }}))
}
Empty ... returns one row. Calling summarise() with no summary expressions returns a single row containing all existing columns set to NA. This is mostly a curiosity — in practice, summarise() is almost always called with at least one name-value pair:
mtcars |> summarise()
#> # A tibble: 1 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Grouped summarise changes row count. When used on grouped data, summarise collapses each group to a single row. The output has as many rows as there are groups.
NA propagation. Summary functions return NA when input contains missing values. Pass na.rm = TRUE to individual functions or set .na.rm = TRUE globally.
.by takes bare names, not a vector. Use .by = c(col1, col2) with the pipe syntax. Unlike group_by(), you do not need to quote or wrap the column names.
Summary functions
Any function that takes a vector and returns a single value works inside summarise(). Common choices:
| Function | What it returns |
|---|---|
mean(x) | Arithmetic mean |
median(x) | Median value |
sum(x) | Sum of values |
sd(x) | Standard deviation |
min(x) | Minimum value |
max(x) | Maximum value |
n() | Count of rows (no arguments) |
n_distinct(x) | Count of unique values |
first(x) | First value in group |
last(x) | Last value in group |
nth(x, n) | Nth value in group |
IQR(x) | Interquartile range |
quantile(x, probs) | Quantile values |
All of these ignore NA by default. Use .na.rm = TRUE to strip missing values before computing.
See also
- dplyr::group_by() — longer-form grouping, required for multiple chained operations
- dplyr::across() — apply functions across multiple columns in summarise
- dplyr::mutate() — create new columns without collapsing rows
- dplyr::count() — shortcut for counting rows by group