Data Manipulation with dplyr

· 4 min read · Updated March 7, 2026 · beginner
dplyr data-manipulation tidyverse r beginner

dplyr is the cornerstone of the tidyverse and one of the most popular R packages for a reason. It provides a consistent set of verbs (functions) that help you solve the most common data manipulation challenges. In this tutorial, you’ll learn the six essential dplyr verbs that will transform how you work with data in R.

Installing and Loading dplyr

Before we begin, make sure dplyr is installed and loaded:

# Install tidyverse (includes dplyr)
install.packages("tidyverse")

# Load dplyr specifically
library(dplyr)

For this tutorial, we’ll use a sample dataset. Let’s create a simple data frame representing a small retail business:

# Create sample sales data
sales <- data.frame(
  product = c("Laptop", "Mouse", "Keyboard", "Monitor", "Laptop", "Mouse", "Keyboard"),
  category = c("Electronics", "Accessories", "Accessories", "Electronics", "Electronics", 
               "Accessories", "Accessories"),
  region = c("North", "South", "North", "East", "West", "South", "North"),
  units_sold = c(5, 20, 15, 8, 3, 25, 12),
  price = c(999, 29, 79, 349, 999, 29, 79)
)

# Calculate revenue
sales$revenue <- sales$units_sold * sales$price

print(sales)

The Pipe Operator: Your New Best Friend

Before diving into the verbs, you need to know about the pipe operator (%>%). It takes the output of one function and passes it as input to the next. Think of it as “then”:

# Without pipe: nested approach
result <- arrange(mutate(sales, revenue = units_sold * price), region)

# With pipe: readable chain
result <- sales %>%
  mutate(revenue = units_sold * price) %>%
  arrange(region)

Both produce the same result, but the piped version is much easier to read. R 4.1+ also supports the native pipe |>, which works identically.

filter(): Selecting Rows

The filter() function selects rows based on conditions. It’s similar to SQL’s WHERE clause:

# Filter rows where region is "North"
north_sales <- sales %>%
  filter(region == "North")

print(north_sales)

You can combine multiple conditions using logical operators:

# Electronics products in the North region with more than 5 units sold
high_electronics <- sales %>%
  filter(category == "Electronics", 
         region == "North", 
         units_sold > 5)

print(high_electronics)

Common comparison operators: == (equals), != (not equals), >, <, >=, <= Common logical operators: & (and), | (or), ! (not)

select(): Choosing Columns

The select() function chooses which columns to keep (or exclude):

# Select specific columns
product_revenue <- sales %>%
  select(product, revenue)

print(product_revenue)

Handy helpers make selection easier:

# Select columns that start with "re"
sales %>%
  select(starts_with("re"))

# Select columns containing "u"
sales %>%
  select(contains("u"))

# Select all columns except region
sales %>%
  select(-region)

You can also use rename() to change column names while keeping all columns:

renamed <- sales %>%
  rename(units = units_sold)

print(renamed)

mutate(): Creating New Columns

Use mutate() to create new columns based on existing ones:

# Add a discount column and a discounted price
sales_with_discount <- sales %>%
  mutate(discount = ifelse(revenue > 500, 0.15, 0),
         final_price = revenue * (1 - discount))

print(sales_with_discount)

You can create multiple columns in one mutate() call:

multiplied <- sales %>%
  mutate(
    revenue = units_sold * price,
    tax = revenue * 0.08,
    revenue_after_tax = revenue + tax
  )

print(multiplied)

arrange(): Sorting Rows

The arrange() function sorts rows. By default, it sorts in ascending order:

# Sort by revenue (lowest to highest)
by_revenue <- sales %>%
  arrange(revenue)

print(by_revenue)

Use desc() for descending order:

# Sort by revenue (highest to lowest)
by_revenue_desc <- sales %>%
  arrange(desc(revenue))

print(by_revenue_desc)

Sort by multiple columns by listing them:

# First by category (alphabetically), then by revenue (descending)
sorted_multi <- sales %>%
  arrange(category, desc(revenue))

print(sorted_multi)

group_by() and summarise(): Aggregation

These two verbs work together to create summaries by group:

# Group by category and summarize
category_summary <- sales %>%
  group_by(category) %>%
  summarise(
    total_revenue = sum(revenue),
    avg_units = mean(units_sold),
    n_transactions = n()
  )

print(category_summary)

The summarise() function creates new columns using aggregation functions like sum(), mean(), n(), min(), max(), sd(), and more:

# Detailed summary by region
region_summary <- sales %>%
  group_by(region) %>%
  summarise(
    total_revenue = sum(revenue),
    total_units = sum(units_sold),
    avg_price = mean(price),
    num_products = n_distinct(product)
  )

print(region_summary)

count() and tally(): Quick Counting

For simple counts, these verbs are incredibly useful:

# Count transactions by category
sales %>%
  count(category)

# Equivalent to:
sales %>%
  group_by(category) %>%
  tally()

Add a weight variable for weighted counts:

# Count total units sold by category
sales %>%
  count(category, wt = units_sold)

Chaining It All Together

The real power of dplyr comes from chaining these verbs together. Here’s a complete example:

final_report <- sales %>%
  # Filter: focus on Electronics
  filter(category == "Electronics") %>%
  # Add new columns
  mutate(discount = 0.1,
         after_discount = revenue * (1 - discount)) %>%
  # Group by region
  group_by(region) %>%
  # Summarise each group
  summarise(
    total_revenue = sum(after_discount),
    total_units = sum(units_sold),
    avg_order = mean(after_discount)
  ) %>%
  # Sort by revenue descending
  arrange(desc(total_revenue))

print(final_report)

Summary

You’ve learned the six essential dplyr verbs:

VerbPurpose
filter()Select rows based on conditions
select()Choose columns to keep
mutate()Create new columns
arrange()Sort rows
group_by()Define groups for aggregation
summarise()Create summary statistics

These verbs form the foundation of data manipulation in R. Once you master them, you’ll find yourself writing cleaner, more readable data analysis code. In the next tutorial, we’ll explore tidyr for reshaping data between wide and long formats.