Database Queries with dbplyr

· 5 min read · Updated March 17, 2026 · intermediate
r dbplyr dplyr database sql tidyverse

The dbplyr package translates your dplyr code into SQL automatically. Instead of writing raw SQL queries, you write familiar dplyr statements—filter(), select(), mutate(), group_by()—and dbplyr converts them into efficient database queries. This approach combines the readability of R with the performance of database execution.

Installation and Setup

Install dbplyr along with DBI and a database driver:

install.packages(c("dbplyr", "DBI", "RSQLite"))

For other databases, add the appropriate driver:

# PostgreSQL
install.packages("RPostgres")

# MySQL
install.packages("RMySQL")

Connecting to a Database

Create a database connection using DBI, then wrap it with dbplyr::tbl():

library(dbplyr)
library(DBI)

# SQLite example (in-memory for demonstration)
con <- dbConnect(RSQLite::SQLite(), ":memory:")

# Create a sample table to query
dbExecute(con, "
  CREATE TABLE sales (
    id INTEGER PRIMARY KEY,
    product TEXT,
    region TEXT,
    amount NUMERIC,
    date TEXT
  )
")

dbExecute(con, "INSERT INTO sales (product, region, amount, date) VALUES
  ('Widget A', 'North', 1500, '2026-01-15'),
  ('Widget B', 'South', 2300, '2026-01-16'),
  ('Widget A', 'North', 1800, '2026-01-17'),
  ('Widget B', 'East', 900, '2026-01-18'),
  ('Widget A', 'West', 3200, '2026-01-19')")

Translating dplyr to SQL

The magic happens when you use tbl() to reference a database table:

# Reference the remote table
sales <- tbl(con, "sales")

# Now query it using dplyr verbs
sales %>%
  filter(region == "North")

This doesn’t run in R—it generates SQL and executes on the database:

SELECT *
FROM sales
WHERE region = 'North'

The query is lazy. It only executes when you explicitly collect results.

Essential dplyr Verbs on Databases

Every major dplyr verb translates to SQL:

filter() — WHERE clauses

high_value <- sales %>%
  filter(amount > 1000) %>%
  filter(region %in% c("North", "South"))
# SQL: WHERE amount > 1000 AND region IN ('North', 'South')

select() — SELECT columns

sales %>%
  select(product, amount)
# SQL: SELECT product, amount FROM sales

mutate() — CREATE new columns

sales %>%
  mutate(amount_usd = amount)
# SQL: SELECT *, amount AS amount_usd FROM sales

arrange() — ORDER BY

sales %>%
  arrange(desc(amount))
# SQL: ORDER BY amount DESC

group_by() and summarize()

sales %>%
  group_by(product) %>%
  summarize(
    total = sum(amount, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  )
# SQL: SELECT product, SUM(amount) AS total, COUNT(*) AS count
#      FROM sales GROUP BY product

The .groups argument controls output grouping. Use "drop" for an ungrouped tibble.

join() operations

# Create a products table
dbExecute(con, "CREATE TABLE products (id INTEGER, category TEXT)")
dbExecute(con, "INSERT INTO products VALUES (1, 'Electronics'), (2, 'Office')")

products <- tbl(con, "products")

# Join sales with products
sales %>%
  inner_join(products, by = c("product" = "id"))
# SQL: SELECT * FROM sales INNER JOIN products ON sales.product = products.id

Collecting Results

By default, dbplyr operations are lazy—they build a query but don’t execute it. Call collect() to pull results into R:

# Lazy query — no data fetched yet
query <- sales %>%
  filter(amount > 1000) %>%
  select(product, amount)

# Execute and bring into R
results <- query %>% collect()
class(results)
# [1] "tbl_df" "tbl" "data.frame"

For small results, collect() is automatic when you use incompatible R operations:

# This automatically collects because %in% with a local vector
sales %>%
  filter(product %in% c("Widget A", "Widget B")) %>%
  head(10)

Window Functions

dbplyr translates window functions like lag(), lead(), rank():

sales %>%
  arrange(date) %>%
  mutate(
    prev_amount = lag(amount),
    next_amount = lead(amount),
    rank = row_number(),
    pct_of_total = amount / sum(amount) * 100
  ) %>%
  collect()

These generate SQL window functions (OVER(), LAG(), LEAD(), RANK()).

Counting and Distinct

# Count rows per group
sales %>%
  count(region, wt = amount, name = "total_amount")
# SQL: SELECT region, SUM(amount) AS total_amount FROM sales GROUP BY region

# Get unique values
sales %>%
  distinct(region)
# SQL: SELECT DISTINCT region FROM sales

SQL at a Glance

Preview the generated SQL without executing:

sales %>%
  filter(amount > 1000) %>%
  group_by(product) %>%
  summarize(total = sum(amount)) %>%
  show_query()

This outputs the SQL for inspection—useful for debugging and optimization.

Performance Considerations

Push down predicates

Filter early to reduce data transfer:

# Good: filter before join
sales %>%
  filter(amount > 1000) %>%
  inner_join(products, by = c("product" = "id")) %>%
  collect()

# Less efficient: collect, then filter in R
sales %>%
  inner_join(products, by = c("product" = "id")) %>%
  filter(amount > 1000) %>%
  collect()

Use database aggregation

# Database does the work
sales %>%
  group_by(region) %>%
  summarize(avg = mean(amount)) %>%
  collect()

Check the query plan

For large tables, verify the database uses indexes:

sales %>% filter(amount > 1000) %>% explain()

Working with Multiple Databases

The same dplyr code works across databases. dbplyr automatically generates appropriate SQL:

# Same R code, different databases
# PostgreSQL
con_pg <- dbConnect(RPostgres::Postgres(), dbname = "mydb")
tbl(con_pg, "sales") %>% filter(region == "North")

# MySQL
con_mysql <- dbConnect(RMySQL::MySQL(), dbname = "mydb")
tbl(con_mysql, "sales") %>% filter(region == "North")

Disconnecting

dbDisconnect(con)

Common Pitfalls

Local R objects in filters

# This works but may be slow
target_regions <- c("North", "South")
sales %>% filter(region %in% target_regions) %>% collect()

# Better: pass the filter to the database
sales %>% filter(region %in% !!target_regions) %>% collect()

Forgetting to collect

Large lazy tables look like data but aren’t loaded:

big_result <- sales %>% filter(amount > 1000)
nrow(big_result)  # Triggers collection

Unsupported operations

Some R operations can’t translate to SQL. You’ll get an error or unexpected results:

# This won't translate
sales %>% mutate(in_r = sample(n()))  # Error or unexpected

See Also