Data Wrangling with dplyr

· 4 min read · Updated March 10, 2026 · intermediate
dplyr tidyverse data-wrangling

dplyr is the grammar of data manipulation in R. Part of the tidyverse, it provides a consistent set of verbs that help you solve the most common data transformation challenges. This guide covers the six core dplyr functions that form the foundation of data wrangling in R.

Why dplyr?

Raw data rarely arrives in the format you need for analysis. Before you can visualize or model your data, you must transform it: filtering rows, selecting columns, creating new variables, grouping data, and summarizing values. dplyr makes these operations intuitive through a grammar of data manipulation.

The key insight behind dplyr is that every data manipulation task can be broken down into three components:

  • What to operate on — the data frame
  • What to do — the verb (function)
  • How to connect operations — the pipe operator

dplyr verbs are designed to be composable. You chain them together using the pipe operator (%>%), creating a readable workflow where each step transforms the data.

The Pipeline

For this guide, we’ll use the starwars dataset from dplyr, which contains information about Star Wars characters.

library(dplyr)
data(starwars)

The pipe operator (%>%) takes the result of the left side and passes it as the first argument to the function on the right. This reads like a sentence: “take the data, then filter it, then select these columns.”

filter(): Subsetting Rows

filter() keeps rows where conditions are TRUE. You can filter by multiple conditions simultaneously.

# Filter characters who are taller than 180cm
filter(starwars, height > 180)
# Filter humans from Alderaan
filter(starwars, homeworld == "Alderaan", species == "Human")
# Filter using logical OR
filter(starwars, eye_color == "red" | eye_color == "yellow")

Common comparison operators work as expected: ==, !=, >, >=, <, <=. Use & for AND and | for OR. Remember: filter excludes rows where the condition is FALSE or NA.

select(): Choosing Columns

select() chooses columns by name or position. It’s essential for focusing on relevant variables.

# Select specific columns
select(starwars, name, height, mass)
# Select columns by pattern
select(starwars, starts_with("mass"), ends_with("color"))
# Select a range of columns
select(starwars, name:mass)
# Exclude columns with negative selection
select(starwars, -hair_color, -skin_color, -eye_color)

Helper functions like starts_with(), ends_with(), contains(), matches(), and everything() make selection flexible.

mutate(): Creating New Variables

mutate() adds new columns while preserving existing ones. Use it to derive features from existing data.

# Create a new column for height in meters
mutate(starwars, height_m = height / 100)
# Create multiple columns at once
mutate(starwars,
  height_m = height / 100,
  bmi = mass / (height_m^2)
)
# Use across() to apply transformation to multiple columns
mutate(starwars, across(c(height, mass), as.character))

mutate() computes values sequentially, so you can reference newly created columns in the same call.

arrange(): Sorting Data

arrange() sorts rows by one or more columns. Use it to order your data meaningfully.

# Sort by height ascending
arrange(starwars, height)
# Sort by multiple columns
arrange(starwars, desc(mass), height)
# Put NA values first or last
arrange(starwars, desc(mass), na.last = FALSE)

group_by() and summarise(): Aggregation

These two functions work together to create grouped summaries — dplyr’s most powerful combination.

# Group by species and calculate mean height
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    avg_height = mean(height, na.rm = TRUE)
  )
# Multiple summary statistics
starwars %>%
  filter(!is.na(homeworld)) %>%
  group_by(homeworld) %>%
  summarise(
    count = n(),
    avg_height = mean(height, na.rm = TRUE),
    total_mass = sum(mass, na.rm = TRUE),
    .groups = "drop"
  )

The .groups argument controls how grouping metadata is dropped after summarization.

Combining Operations

The real power of dplyr emerges when you chain multiple operations. Each verb does one thing well, but together they handle complex transformations.

starwars %>%
  filter(!is.na(mass), mass < 200) %>%
  select(name, species, homeworld, mass) %>%
  mutate(mass_kg = mass * 0.453592) %>%
  group_by(species) %>%
  summarise(
    count = n(),
    avg_mass_kg = mean(mass_kg),
    .groups = "drop"
  ) %>%
  arrange(desc(count))

This pipeline: filters characters with valid mass, selects relevant columns, converts to kilograms, groups by species, summarizes, and sorts by count.

When to Use dplyr

dplyr excels when you need readable, maintainable data transformation code. It’s ideal for exploratory analysis and data preprocessing pipelines. The syntax translates directly to SQL through dbplyr when working with databases.

However, dplyr may not be the best choice for:

  • Extremely large datasets that don’t fit in memory (consider data.table)
  • Simple one-liners where the overhead isn’t worth it
  • Production code where maximum performance is critical

Installing dplyr

install.packages("dplyr")
library(dplyr)

For the full tidyverse experience:

install.packages("tidyverse")
library(tidyverse)

Summary

VerbPurpose
filter()Keep rows matching conditions
select()Choose columns by name
mutate()Create new columns
arrange()Sort rows by columns
group_by()Define grouping for summaries
summarise()Aggregate within groups

Master these six verbs, and you’ll handle the vast majority of data wrangling tasks in R. Combine them with the pipe operator to create readable, compositional data transformation pipelines.