rguides

dplyr::distinct

distinct(.data, ..., .keep_all = FALSE)

distinct() removes duplicate rows from a data frame, keeping only unique combinations of the columns you specify. It is considerably faster than base R’s unique.data.frame(), and it always returns a tibble.

Basic usage

By default, distinct() checks all columns and returns only rows that are fully unique:

library(dplyr)

df <- tibble(
  x = c(1, 1, 2, 1),
  y = c("a", "a", "b", "a")
)

distinct(df)
# # A tibble: 3 × 2
#       x y
# 1     1 a
# 2     2 b
# 3     1 a

When two rows share the same values in every column, distinct() keeps only the first one and discards the rest. The result is a tibble containing only the unique rows, preserving column types and order as they appear in the input.

Selecting specific columns

Sometimes you want uniqueness based on only a subset of columns rather than the full row. Passing column names to distinct() compares rows using only those named columns and drops any columns not mentioned in the call:

distinct(df, x)
# # A tibble: 2 × 1
#       x
# 1     1
# 2     2

This drops all other columns not mentioned in the call. If you want to keep the rest of your data while still checking uniqueness on a subset of columns, use .keep_all = TRUE. Without this flag, distinct(df, x) returns only the x column; with it, all original columns are preserved and the first row for each distinct value of x is kept.

Keeping all columns with .keep_all

The .keep_all argument controls whether unmentioned columns are retained in the output. Set it to TRUE when you need to deduplicate by a subset of columns without losing any of the other data:

distinct(df, x, .keep_all = TRUE)
# # A tibble: 2 × 2
#       x y
# 1     1 a
# 2     2 b

Without .keep_all = TRUE, specifying x alone would drop y. With .keep_all = TRUE, you get the first row of each unique x value while keeping all columns. When multiple rows share the same x, only the first occurrence by row order is retained.

Computed columns

You can create expressions on the fly to determine uniqueness:

distinct(df, diff = abs(x - 1))
# # A tibble: 2 × 1
#   <dbl>
# 1     0
# 2     1

This checks uniqueness based on abs(x - 1) but does not add that column to the output (unless you also use .keep_all = TRUE).

Using across() for multiple columns

When you need to check uniqueness across many columns, across() lets you use select-helper semantics:

df2 <- tibble(
  name = c("Alice", "Alice", "Bob", "Bob"),
  age  = c(25, 25, 30, 30),
  city = c("NY", "NY", "LA", "LA")
)

distinct(df2, across(contains("city")))
# # A tibble: 2 × 1
#   city
# 1 NY
# 2 LA

This is the modern replacement for the deprecated distinct_all(), distinct_at(), and distinct_if() scoped variants. Using across() gives you access to the full range of tidy-select helpers like starts_with(), ends_with(), contains(), and where(), making column selection patterns both more readable and more flexible than the older scoped functions.

Grouped data frames

When your data is grouped with group_by(), the grouping variables are always included in the uniqueness check regardless of whether you explicitly list them. This means distinct() on a grouped data frame never drops grouping columns from the output, even if they are not among the columns you passed:

df_g <- tibble(
  g = c(1, 1, 2, 2),
  x = c(1, 1, 2, 1)
) %>% group_by(g)

distinct(df_g, x)
# # A tibble: 3 × 2
# # Groups: g [2]
#       g     x
#   <dbl> <dbl>
# 1     1     1
# 2     2     2
# 3     2     1

Notice that g appears in the output even though you did not explicitly mention it, grouping columns are always retained.

Common pitfalls

Dropped columns by default. If you write distinct(df, col_a) expecting a subset of df with only duplicate-free rows, you lose every other column. Use .keep_all = TRUE if you need them.

NA is a distinct value. A row containing NA in your specified columns is not automatically removed, NA counts as its own value when checking uniqueness.

Column order affects which row is kept. distinct(df, x, y) keeps the first row for each unique (x, y) combination. If you swap the order to distinct(df, y, x), you may get a different representative row, since “first” is determined by the order of columns as they appear in the call.

See also