rguides

dplyr::summarise()

summarise(.data, ..., .by = NULL, .sort = FALSE, .na.rm = FALSE)

summarise() (also spelled summarize()) collapses a tibble or data frame into a single row per group by computing summary statistics. It is one of the most-used functions in the tidyverse for aggregating data. dplyr ships both spellings as a convenience for UK and US conventions — they are identical.

Syntax

summarise(.data, ..., .by = NULL, .sort = FALSE, .na.rm = FALSE)
summarize(.data, ..., .by = NULL, .sort = FALSE, .na.rm = FALSE)  # identical

Parameters

ParameterTypeDefaultDescription
.datatibble / data.framerequiredInput data frame or tibble
...name-value expressionsrequiredNew columns defined as name = expression
.bybare column namesNULLGroup by these columns before summarising (dplyr 1.1+). Passes bare unquoted names, not a vector.
.sortlogicalFALSESort output rows by size of .by groups, largest first (dplyr 1.1+)
.na.rmlogicalFALSEPass na.rm = TRUE to all summary functions that accept it

The .by parameter is the modern alternative to group_by() for one-off grouped summaries. It accepts bare column names directly: .by = c(cyl, am) instead of group_by(cyl, am) |> summarise(...).

Examples

Basic summarise

The simplest form of summarise() computes aggregate statistics across all rows of the data frame, returning a single-row tibble. Each name-value pair defines a new column with the result of the specified summary function:

library(dplyr)

# Overall summary of mtcars
mtcars |> summarise(avg_mpg = mean(mpg), total_hp = sum(hp))
#> # A tibble: 1 × 2
#>   avg_mpg total_hp
#>     <dbl>    <dbl>
#> 1     20.1     4694

Grouped summarise with .by (dplyr 1.1+)

The .by argument groups data before summarising, eliminating the need for a separate group_by() call. This is the modern dplyr pattern for one-off grouped summaries and produces the same result as the older two-step approach in a single line:

# Average mpg and total hp per cylinder/gear combination
mtcars |> summarise(
  avg_mpg = mean(mpg),
  total_hp = sum(hp),
  .by = c(cyl, gear)
)
#> # A tibble: 9 × 4
#>     cyl  gear avg_mpg total_hp
#>   <dbl> <dbl>    <dbl>    <dbl>
#> 1     4     3     21.5      415
#> 2     4     4     26.9      468
#> ...

Sorting and multiple statistics

The .sort = TRUE argument orders the output by group size in descending order, putting the largest groups at the top. Multiple summary statistics can be computed in a single summarise() call, and across() applies the same function to many columns at once. Use .names with glue syntax like "mean_{.col}" to construct descriptive output column names automatically:

# Sort by group size, largest first
mtcars |> summarise(n = n(), .by = cyl, .sort = TRUE)

# Multiple statistics in one call
mtcars |> summarise(
  mean_mpg = mean(mpg), median_mpg = median(mpg),
  sd_mpg = sd(mpg), min_mpg = min(mpg), max_mpg = max(mpg),
  .by = cyl
)

# across() with dynamic .names — mean of all Sepal columns by species
iris |>
  summarise(across(starts_with("Sepal"), mean, .names = "mean_{.col}"), .by = Species)

Handling NA values

summarise() propagates NA values through aggregation functions by default, so a single missing value in a group causes the entire group’s result to be NA. Setting .na.rm = TRUE globally removes missing values before any computation within the summarise step, which is convenient when multiple aggregation functions all need the same treatment:

df <- tibble(x = c(1, 2, NA, 4), g = c("a", "a", "b", "b"))

# Default: NA propagates through
df |> summarise(mean_x = mean(x), .by = g)
#> # A tibble: 2 × 2
#>   g     mean_x
#>   <chr>   <dbl>
#> 1 a         NA
#> 2 b         NA

# Set .na.rm globally
df |> summarise(mean_x = mean(x), .by = g, .na.rm = TRUE)
#> # A tibble: 2 × 2
#>   g     mean_x
#>   <chr>   <dbl>
#> 1 a        1.5
#> 2 b        4

You can also pass na.rm = TRUE directly to individual functions like mean(x, na.rm = TRUE).

The .groups argument (deprecated)

In dplyr 1.0+, .groups controlled the grouping structure of the output, but this has been deprecated in favor of .by. For code targeting dplyr 1.1+, use .by instead and avoid .groups entirely:

# .groups is now deprecated
mtcars |>
  group_by(cyl) |>
  summarise(n = n(), .groups = "drop_last")

When you still need explicit control over grouping after summarisation, call ungroup() at the end of the pipeline. This makes the intent explicit, works consistently across all dplyr versions, and avoids depending on deprecated behavior that may be removed in future releases:

mtcars |>
  summarise(n = n(), .by = cyl) |>
  ungroup()

count() and tally() shortcuts

For simple row counts, dplyr provides two convenience functions that eliminate the need to write the full summarise() expression by hand. Both produce the same output but take slightly different input forms — count() creates its own groups while tally() works on data that is already grouped by a prior step in the pipeline:

# count() is a specialised summarise for counting rows
mtcars |> count(cyl, am)
mtcars |> count(cyl, wt = hp)   # weighted count

# tally() is the pipe-friendly alias
mtcars |> group_by(cyl) |> tally()

Both are equivalent to summarise(n = n(), .by = ...) but are shorter to type and communicate intent more directly for this specific pattern.

Common gotchas

Unquoted column names. dplyr uses tidy evaluation, so column names are unquoted. To pass a column name programmatically from a function argument, use the embrace operator {{ }}. This captures the expression passed by the caller and injects it into the summarise() call:

my_summary <- function(data, col) {
  data |> summarise(result = mean({{ col }}))
}

Empty ... returns one row. Calling summarise() with no summary expressions returns a single row containing all existing columns set to NA. This is mostly a curiosity — in practice, summarise() is almost always called with at least one name-value pair:

mtcars |> summarise()
#> # A tibble: 1 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <NA> <NA>  <NA> <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA>

Grouped summarise changes row count. When used on grouped data, summarise collapses each group to a single row. The output has as many rows as there are groups.

NA propagation. Summary functions return NA when input contains missing values. Pass na.rm = TRUE to individual functions or set .na.rm = TRUE globally.

.by takes bare names, not a vector. Use .by = c(col1, col2) with the pipe syntax. Unlike group_by(), you do not need to quote or wrap the column names.

Summary functions

Any function that takes a vector and returns a single value works inside summarise(). Common choices:

FunctionWhat it returns
mean(x)Arithmetic mean
median(x)Median value
sum(x)Sum of values
sd(x)Standard deviation
min(x)Minimum value
max(x)Maximum value
n()Count of rows (no arguments)
n_distinct(x)Count of unique values
first(x)First value in group
last(x)Last value in group
nth(x, n)Nth value in group
IQR(x)Interquartile range
quantile(x, probs)Quantile values

All of these ignore NA by default. Use .na.rm = TRUE to strip missing values before computing.

See also