rguides

Data Wrangling with dplyr

dplyr is the grammar of data manipulation in R. Part of the tidyverse, it provides a consistent set of verbs that help you solve the most common data transformation challenges. This guide covers the six core dplyr functions that form the foundation of data wrangling in R.

Why dplyr?

Raw data rarely arrives in the format you need for analysis. Before you can visualize or model your data, you must transform it: filtering rows, selecting columns, creating new variables, grouping data, and summarizing values. dplyr makes these operations intuitive through a grammar of data manipulation.

The key insight behind dplyr is that every data manipulation task can be broken down into three components:

  • What to operate on, the data frame
  • What to do, the verb (function)
  • How to connect operations, the pipe operator

dplyr verbs are designed to be composable. You chain them together using the pipe operator (%>%), creating a readable workflow where each step transforms the data.

The pipeline

For this guide, we’ll use the starwars dataset from dplyr, which contains information about Star Wars characters.

library(dplyr)
data(starwars)

The pipe operator (%>%) takes the result of the left side and passes it as the first argument to the function on the right. This reads like a sentence: “take the data, then filter it, then select these columns.”

filter(): subsetting rows

filter() keeps rows where conditions are TRUE. Multiple conditions separated by commas are combined with AND — each condition must be true for the row to survive. Use | for OR when either condition suffices, and %in% for membership tests that are more readable than chained == with |.

# Filter characters taller than 180cm
filter(starwars, height > 180)

# Filter humans from Alderaan — comma means AND
filter(starwars, homeworld == "Alderaan", species == "Human")

# Filter using logical OR or membership
filter(starwars, eye_color == "red" | eye_color == "yellow")

Common comparison operators work as expected: ==, !=, >, >=, <, <=. Use & for AND and | for OR. Remember: filter excludes rows where the condition is FALSE or NA.

select(): choosing columns

select() chooses columns by name, position, or pattern. Use starts_with(), ends_with(), and contains() for pattern-based selection, a range (name:mass) for contiguous columns, and negative selection (-column) to exclude. All helpers can be mixed in a single call.

# Select by name, pattern, or range
select(starwars, name, height, mass)
select(starwars, starts_with("mass"), ends_with("color"))
select(starwars, name:mass)

# Exclude columns with negative selection
select(starwars, -hair_color, -skin_color, -eye_color)

mutate(): creating new variables

mutate() adds new columns while preserving existing ones. Columns are computed sequentially within a single call, so a column created on one line can be used in the next. Use across() to apply the same transformation to multiple columns at once.

# Create a single derived column
mutate(starwars, height_m = height / 100)

# Multiple columns at once — bmi references height_m computed on the line above
mutate(starwars,
  height_m = height / 100,
  bmi = mass / (height_m^2)
)

# Transform multiple columns with across()
mutate(starwars, across(c(height, mass), as.character))

mutate() computes values sequentially, so you can reference newly created columns in the same call.

arrange(): sorting data

arrange() sorts rows by one or more columns. Use it to order your data meaningfully.

# Sort by height ascending
arrange(starwars, height)
# Sort by multiple columns
arrange(starwars, desc(mass), height)
# Put NA values first or last
arrange(starwars, desc(mass), na.last = FALSE)

group_by() and summarise(): aggregation

These two functions work together to create grouped summaries, dplyr’s most powerful combination.

# Group by species and calculate mean height
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    avg_height = mean(height, na.rm = TRUE)
  )
# Multiple summary statistics
starwars %>%
  filter(!is.na(homeworld)) %>%
  group_by(homeworld) %>%
  summarise(
    count = n(),
    avg_height = mean(height, na.rm = TRUE),
    total_mass = sum(mass, na.rm = TRUE),
    .groups = "drop"
  )

The .groups argument controls how grouping metadata is dropped after summarization.

Combining operations

The real power of dplyr emerges when you chain multiple operations. Each verb does one thing well, but together they handle complex transformations.

starwars %>%
  filter(!is.na(mass), mass < 200) %>%
  select(name, species, homeworld, mass) %>%
  mutate(mass_kg = mass * 0.453592) %>%
  group_by(species) %>%
  summarise(
    count = n(),
    avg_mass_kg = mean(mass_kg),
    .groups = "drop"
  ) %>%
  arrange(desc(count))

This pipeline: filters characters with valid mass, selects relevant columns, converts to kilograms, groups by species, summarizes, and sorts by count.

When to use dplyr

dplyr excels when you need readable, maintainable data transformation code. It’s ideal for exploratory analysis and data preprocessing pipelines. The syntax translates directly to SQL through dbplyr when working with databases.

However, dplyr may not be the best choice for:

  • Extremely large datasets that don’t fit in memory (consider data.table)
  • Simple one-liners where the overhead isn’t worth it
  • Production code where maximum performance is critical

Installing dplyr

install.packages("dplyr")
library(dplyr)

For the full tidyverse experience:

install.packages("tidyverse")
library(tidyverse)

The full dplyr verb set

The core dplyr verbs cover most data manipulation needs. Row operations: filter() selects rows, arrange() sorts, slice() selects by position. Column operations: select() chooses columns, rename() renames, mutate() adds or transforms, transmute() creates and drops. Group operations: group_by() adds grouping, summarise() collapses, ungroup() removes grouping.

Joining: left_join(), inner_join(), full_join(), anti_join(), semi_join() cover all join types. Set operations: union(), intersect(), setdiff() for set operations on rows. bind_rows() and bind_cols() for combining data frames.

Column-wise operations with across()

across() applies functions to multiple columns within mutate() or summarise(). mutate(df, across(where(is.numeric), scale)) standardizes all numeric columns. summarise(df, across(starts_with("q"), list(mean = mean, sd = sd))) computes mean and SD for columns starting with “q”, naming the results automatically.

c_across() works within rowwise(): rowwise(df) |> mutate(total = sum(c_across(starts_with("q")))) sums across columns per row.

Window functions

lag() and lead() shift values by n positions. cumsum(), cummean(), cummax() compute cumulative statistics. row_number(), rank(), dense_rank(), ntile(), and percent_rank() provide row-level ranking within groups. All window functions are used inside mutate() after group_by() — the grouping defines the window. These window functions are the key to computing running totals, session boundaries, and relative ranks in grouped data.

slider::slide_dbl() from the slider package provides flexible rolling window functions that integrate cleanly with mutate(), more flexible than zoo::rollmean() or RcppRoll::roll_mean() for complex window types.

Joining data frames

The join verbs handle most data combination needs. left_join() keeps all rows from the left table. inner_join() keeps only matching rows. anti_join() keeps left-table rows with no match on the right, the inverse of semi_join(). Use by = c("x_col" = "y_col") when key column names differ between tables. join_by() from dplyr 1.1+ provides a more expressive syntax and supports inequality joins like join_by(start <= date, date < end).

Working with multiple tables

bind_rows() stacks data frames vertically. bind_cols() adds columns side by side. For combining more than two tables, pass a list: bind_rows(list_of_dfs) or purrr::reduce(list_of_dfs, left_join, by = "key"). When column names differ across data frames you want to stack, dplyr::bind_rows() handles the mismatch by filling absent columns with NA, unlike base R rbind() which requires identical columns.

The pipe-Based mental model

dplyr is designed to be used with the pipe operator, which passes the output of one function as the first argument to the next. A dplyr pipeline reads as a sequence of English verbs: “take this data, then filter to rows meeting this condition, then select these columns, then group by this variable, then summarize.” This readability is the primary reason dplyr became the dominant R data manipulation interface, the code communicates intent, not mechanics.

Each dplyr verb does one thing. filter selects rows. select chooses columns. mutate creates or modifies columns. arrange reorders rows. summarize reduces rows to summaries. group_by establishes groups that subsequent verbs respect. Combining these building blocks in different orders handles almost every data manipulation task. The learning curve is low because there are few verbs to learn and they compose predictably.

Grouped operations

group_by does not change the data itself — it attaches grouping metadata that subsequent verbs use. After group_by, mutate adds columns computed within each group rather than across the whole data frame. summarize reduces each group to one row. The grouping remains active through the pipeline until you call ungroup() explicitly. Forgetting to ungroup is a common source of subtle bugs, particularly when data leaves one pipeline and enters another that does not expect grouping.

Multiple grouping variables create nested groups. Calling summarize on data grouped by two variables collapses the innermost grouping by default and keeps the outer grouping. The .groups argument controls this behavior explicitly. For complex grouped operations, being explicit about when to ungroup prevents unexpected behavior downstream.

Joins and relational data

The join family — left_join, inner_join, right_join, full_join, anti_join, semi_join — implements SQL-style joins with tidyverse syntax. left_join keeps all rows from the left data frame and attaches matching rows from the right. inner_join keeps only rows with matches in both. anti_join returns rows from the left that have no match in the right, useful for finding records missing from a reference table.

Join columns are specified with the by argument. When both data frames have the same column name for the join key, by = “id” works. When the column names differ, use by = c(“local_name” = “remote_name”). Always specify join columns explicitly rather than relying on natural join behavior, which matches on all columns with identical names — an implicit join on unexpected columns causes silent incorrect results.

Summary

VerbPurpose
filter()Keep rows matching conditions
select()Choose columns by name
mutate()Create new columns
arrange()Sort rows by columns
group_by()Define grouping for summaries
summarise()Aggregate within groups

Master these six verbs, and you’ll handle the vast majority of data wrangling tasks in R. Combine them with the pipe operator to create readable, compositional data transformation pipelines.