Database Queries with dbplyr
The dbplyr package translates your dplyr code into SQL automatically. Instead of writing raw SQL queries, you write familiar dplyr statements—filter(), select(), mutate(), group_by()—and dbplyr converts them into efficient database queries. This approach combines the readability of R with the performance of database execution.
Installation and Setup
Install dbplyr along with DBI and a database driver:
install.packages(c("dbplyr", "DBI", "RSQLite"))
For other databases, add the appropriate driver:
# PostgreSQL
install.packages("RPostgres")
# MySQL
install.packages("RMySQL")
Connecting to a Database
Create a database connection using DBI, then wrap it with dbplyr::tbl():
library(dbplyr)
library(DBI)
# SQLite example (in-memory for demonstration)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Create a sample table to query
dbExecute(con, "
CREATE TABLE sales (
id INTEGER PRIMARY KEY,
product TEXT,
region TEXT,
amount NUMERIC,
date TEXT
)
")
dbExecute(con, "INSERT INTO sales (product, region, amount, date) VALUES
('Widget A', 'North', 1500, '2026-01-15'),
('Widget B', 'South', 2300, '2026-01-16'),
('Widget A', 'North', 1800, '2026-01-17'),
('Widget B', 'East', 900, '2026-01-18'),
('Widget A', 'West', 3200, '2026-01-19')")
Translating dplyr to SQL
The magic happens when you use tbl() to reference a database table:
# Reference the remote table
sales <- tbl(con, "sales")
# Now query it using dplyr verbs
sales %>%
filter(region == "North")
This doesn’t run in R—it generates SQL and executes on the database:
SELECT *
FROM sales
WHERE region = 'North'
The query is lazy. It only executes when you explicitly collect results.
Essential dplyr Verbs on Databases
Every major dplyr verb translates to SQL:
filter() — WHERE clauses
high_value <- sales %>%
filter(amount > 1000) %>%
filter(region %in% c("North", "South"))
# SQL: WHERE amount > 1000 AND region IN ('North', 'South')
select() — SELECT columns
sales %>%
select(product, amount)
# SQL: SELECT product, amount FROM sales
mutate() — CREATE new columns
sales %>%
mutate(amount_usd = amount)
# SQL: SELECT *, amount AS amount_usd FROM sales
arrange() — ORDER BY
sales %>%
arrange(desc(amount))
# SQL: ORDER BY amount DESC
group_by() and summarize()
sales %>%
group_by(product) %>%
summarize(
total = sum(amount, na.rm = TRUE),
count = n(),
.groups = "drop"
)
# SQL: SELECT product, SUM(amount) AS total, COUNT(*) AS count
# FROM sales GROUP BY product
The .groups argument controls output grouping. Use "drop" for an ungrouped tibble.
join() operations
# Create a products table
dbExecute(con, "CREATE TABLE products (id INTEGER, category TEXT)")
dbExecute(con, "INSERT INTO products VALUES (1, 'Electronics'), (2, 'Office')")
products <- tbl(con, "products")
# Join sales with products
sales %>%
inner_join(products, by = c("product" = "id"))
# SQL: SELECT * FROM sales INNER JOIN products ON sales.product = products.id
Collecting Results
By default, dbplyr operations are lazy—they build a query but don’t execute it. Call collect() to pull results into R:
# Lazy query — no data fetched yet
query <- sales %>%
filter(amount > 1000) %>%
select(product, amount)
# Execute and bring into R
results <- query %>% collect()
class(results)
# [1] "tbl_df" "tbl" "data.frame"
For small results, collect() is automatic when you use incompatible R operations:
# This automatically collects because %in% with a local vector
sales %>%
filter(product %in% c("Widget A", "Widget B")) %>%
head(10)
Window Functions
dbplyr translates window functions like lag(), lead(), rank():
sales %>%
arrange(date) %>%
mutate(
prev_amount = lag(amount),
next_amount = lead(amount),
rank = row_number(),
pct_of_total = amount / sum(amount) * 100
) %>%
collect()
These generate SQL window functions (OVER(), LAG(), LEAD(), RANK()).
Counting and Distinct
# Count rows per group
sales %>%
count(region, wt = amount, name = "total_amount")
# SQL: SELECT region, SUM(amount) AS total_amount FROM sales GROUP BY region
# Get unique values
sales %>%
distinct(region)
# SQL: SELECT DISTINCT region FROM sales
SQL at a Glance
Preview the generated SQL without executing:
sales %>%
filter(amount > 1000) %>%
group_by(product) %>%
summarize(total = sum(amount)) %>%
show_query()
This outputs the SQL for inspection—useful for debugging and optimization.
Performance Considerations
Push down predicates
Filter early to reduce data transfer:
# Good: filter before join
sales %>%
filter(amount > 1000) %>%
inner_join(products, by = c("product" = "id")) %>%
collect()
# Less efficient: collect, then filter in R
sales %>%
inner_join(products, by = c("product" = "id")) %>%
filter(amount > 1000) %>%
collect()
Use database aggregation
# Database does the work
sales %>%
group_by(region) %>%
summarize(avg = mean(amount)) %>%
collect()
Check the query plan
For large tables, verify the database uses indexes:
sales %>% filter(amount > 1000) %>% explain()
Working with Multiple Databases
The same dplyr code works across databases. dbplyr automatically generates appropriate SQL:
# Same R code, different databases
# PostgreSQL
con_pg <- dbConnect(RPostgres::Postgres(), dbname = "mydb")
tbl(con_pg, "sales") %>% filter(region == "North")
# MySQL
con_mysql <- dbConnect(RMySQL::MySQL(), dbname = "mydb")
tbl(con_mysql, "sales") %>% filter(region == "North")
Disconnecting
dbDisconnect(con)
Common Pitfalls
Local R objects in filters
# This works but may be slow
target_regions <- c("North", "South")
sales %>% filter(region %in% target_regions) %>% collect()
# Better: pass the filter to the database
sales %>% filter(region %in% !!target_regions) %>% collect()
Forgetting to collect
Large lazy tables look like data but aren’t loaded:
big_result <- sales %>% filter(amount > 1000)
nrow(big_result) # Triggers collection
Unsupported operations
Some R operations can’t translate to SQL. You’ll get an error or unexpected results:
# This won't translate
sales %>% mutate(in_r = sample(n())) # Error or unexpected
See Also
dplyr-filter— Filter rows in R and SQLdplyr-mutate— Create new columnsdplyr-select— Select and rename columnsdplyr-group-by— Group data for aggregationr-databases-dbi— Connecting to databases with DBIr-duckdb— Fast in-memory analytics with DuckDB