rguides

Database Queries in R with dbplyr and dplyr

The dbplyr package translates your dplyr code into SQL automatically. Instead of writing raw SQL queries, you write familiar dplyr statements like filter(), select(), mutate(), and group_by() — and dbplyr converts them into efficient database queries. This approach combines the readability of R with the performance of database execution.

Installation and setup

Install dbplyr along with DBI and a database driver:

install.packages(c("dbplyr", "DBI", "RSQLite"))

The SQLite driver is bundled with R and works out of the box for local development and testing. It stores the database in a single file or in memory, making it ideal for prototyping queries before deploying to a production system. For production databases, install the driver package that matches your backend: RPostgres for PostgreSQL, RMySQL or RMariaDB for MySQL-compatible databases, and odbc for SQL Server and other enterprise systems. Each driver connects to its respective database through DBI’s common interface, so your dbplyr code stays the same regardless of the backend.

# Install database-specific drivers alongside dbplyr
# PostgreSQL
install.packages("RPostgres")
# MySQL / MariaDB
install.packages("RMySQL")

Connecting to a database

Create a database connection using DBI, then wrap it with dbplyr::tbl(). The connection object carries authentication, host, port, and database name. Once dbplyr has a connection, it treats remote tables like local data frames for all dplyr operations, translating your R code into SQL behind the scenes. To follow along, the example uses an in-memory SQLite database so you can experiment without installing a separate database server:

library(dbplyr)
library(DBI)

# SQLite example (in-memory for demonstration)
con <- dbConnect(RSQLite::SQLite(), ":memory:")

# Create a sample table to query
dbExecute(con, "
  CREATE TABLE sales (
    id INTEGER PRIMARY KEY,
    product TEXT,
    region TEXT,
    amount NUMERIC,
    date TEXT
  )
")

dbExecute() sends SQL statements that don’t return result sets — DDL commands like CREATE TABLE and DML commands like INSERT. With the table structure in place, you need sample data to query. Populate the table with a few rows of sales records so the examples that follow have something to work with:

# Populate the table with sample sales data
dbExecute(con, "INSERT INTO sales (product, region, amount, date) VALUES
  ('Widget A', 'North', 1500, '2026-01-15'),
  ('Widget B', 'South', 2300, '2026-01-16'),
  ('Widget A', 'North', 1800, '2026-01-17'),
  ('Widget B', 'East', 900, '2026-01-18'),
  ('Widget A', 'West', 3200, '2026-01-19')")

Translating dplyr to SQL

Dbplyr’s core premise is simple but powerful: instead of writing SQL, you write dplyr code against a remote table reference. Dbplyr intercepts your dplyr verbs and converts them into the equivalent SQL for your database. The tbl() function is the entry point: it creates a lazy reference to a database table that you can query with dplyr as if it were a local data frame. The following example queries a sales table for rows where region is "North":

# Reference the remote table
sales <- tbl(con, "sales")

# Now query it using dplyr verbs
sales %>%
  filter(region == "North")

This doesn’t run in R; it generates SQL and executes on the database. Under the hood, dbplyr inspects your dplyr pipeline, builds an abstract SQL query tree, and then renders it in the dialect of your connected database. That generated SQL is what actually runs; your R session never touches the raw data until you explicitly request it:

SELECT *
FROM sales
WHERE region = 'North'

The query is lazy: it only executes when you call collect() to pull results into R. Lazy evaluation is the key advantage, since you can chain multiple dplyr verbs, each adding clauses to the SQL, and dbplyr sends only the final optimized query to the database. This means filtering, selecting, and mutating all compile into a single SQL statement rather than running as separate round-trips.

Essential dplyr verbs on databases

Every major dplyr verb translates to SQL, and the mapping is intuitive: filter() becomes WHERE, select() becomes SELECT, arrange() becomes ORDER BY. The key advantage is that you write R you already know, and dbplyr generates optimized SQL behind the scenes. You can inspect the generated SQL at any time with show_query() to verify what gets sent to the database.

# filter() → SQL WHERE clause with AND for multiple conditions
high_value <- sales %>%
  filter(amount > 1000) %>%
  filter(region %in% c("North", "South"))
# SQL: WHERE amount > 1000 AND region IN ('North', 'South')

# select() → SQL SELECT for column projection
sales %>%
  select(product, amount)
# SQL: SELECT product, amount FROM sales

# mutate() → SQL SELECT with computed columns
sales %>%
  mutate(amount_usd = amount)
# SQL: SELECT *, amount AS amount_usd FROM sales

The verb-to-SQL translation preserves the logical flow of your dplyr pipeline. Each verb appends to the query, and dbplyr only sends the complete SQL statement when you call collect(). This lazy execution means you can build complex queries incrementally without hitting the database until you are ready to retrieve results. The next block shows how arrange() and group_by() with summarize() map to their SQL counterparts, and both are essential for producing sorted and aggregated output directly in the database:

# arrange() → SQL ORDER BY
sales %>%
  arrange(desc(amount))
# SQL: ORDER BY amount DESC

# group_by() + summarize() → SQL GROUP BY with aggregate functions
sales %>%
  group_by(product) %>%
  summarize(
    total = sum(amount, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  )
# SQL: SELECT product, SUM(amount) AS total, COUNT(*) AS count
#      FROM sales GROUP BY product

The .groups argument in summarize() controls whether the result remains grouped. Using "drop" returns an ungrouped tibble, matching the behavior most users expect after a summary, while "keep" preserves the grouping structure for further grouped operations downstream. After mastering single-table operations, the natural next step is joining tables together, and dbplyr translates dplyr joins into SQL JOINs with the same consistent mapping:

# Join operations translate to SQL JOINs with ON clauses
dbExecute(con, "CREATE TABLE products (id INTEGER, category TEXT)")
dbExecute(con, "INSERT INTO products VALUES (1, 'Electronics'), (2, 'Office')")

products <- tbl(con, "products")
sales %>%
  inner_join(products, by = c("product" = "id"))
# SQL: SELECT * FROM sales INNER JOIN products ON sales.product = products.id

Join translation is one of dbplyr’s strongest features. The database handles the matching of rows across tables, which is far more efficient than pulling both tables into R and merging them locally. The by argument maps column names between the two tables, and dbplyr converts the dplyr join syntax into the appropriate SQL JOIN ... ON clause for your database dialect.

Collecting results

By default, dbplyr operations are lazy: they build a query but don’t execute it. Call collect() to pull results into R as a local tibble. This separation between query construction and execution gives you control over when network round-trips happen and how much data is transferred:

# Lazy query — no data fetched yet
query <- sales %>%
  filter(amount > 1000) %>%
  select(product, amount)

# Execute and bring into R
results <- query %>% collect()
class(results)
# [1] "tbl_df" "tbl" "data.frame"

For small results, collect() is automatic when you use incompatible R operations, such as %in% with a local R vector. Dbplyr detects that the operation cannot be translated to SQL, so it quietly executes the pending query, pulls the data into R, and performs the incompatible operation locally. The database still executes the initial filtering, so only the matching rows transfer across the network for R-side processing:

# This automatically collects because %in% with a local vector
sales %>%
  filter(product %in% c("Widget A", "Widget B")) %>%
  head(10)

Counting and distinct

Counting rows per group and finding unique values are common database operations that every data analyst reaches for regularly. Both translate cleanly from dplyr to SQL, and because the database handles the grouping and deduplication, these operations remain fast even on tables with millions of rows. The examples below show count() with weighted totals and distinct() for deduplication:

# Count rows per group
sales %>%
  count(region, wt = amount, name = "total_amount")
# SQL: SELECT region, SUM(amount) AS total_amount FROM sales GROUP BY region

# Get unique values
sales %>%
  distinct(region)
# SQL: SELECT DISTINCT region FROM sales

Both count() and distinct() are fully translated to SQL and execute efficiently on the database. The wt argument in count() lets you weight the row count by a numeric column, useful for summing transaction amounts grouped by region or category without a separate summarize() step.

SQL at a glance

Preview the generated SQL without executing the query. show_query() prints the exact SQL that dbplyr would send. This is invaluable for debugging unexpected results, learning SQL from dplyr code, and verifying query correctness before running on large tables. The example below chains a filter, group, and summary, then inspects the SQL that dbplyr would generate for that pipeline:

sales %>%
  filter(amount > 1000) %>%
  group_by(product) %>%
  summarize(total = sum(amount)) %>%
  show_query()

This outputs the SQL for inspection, which is useful for debugging and optimization. You can also run explain() to see the query execution plan; the database shows which indexes it uses and whether it performs a full table scan.

Performance considerations

Push down predicates

Filter early to reduce data transfer. When the database filters rows before joining, it processes less data and sends fewer rows across the network. The two approaches below produce identical results, but only the first keeps the heavy lifting inside the database:

# Good: filter before join
sales %>%
  filter(amount > 1000) %>%
  inner_join(products, by = c("product" = "id")) %>%
  collect()

# Less efficient: collect, then filter in R
sales %>%
  inner_join(products, by = c("product" = "id")) %>%
  filter(amount > 1000) %>%
  collect()

Use database aggregation

Aggregating in the database reduces the data transferred to R. When you call summarize() before collect(), the database performs the aggregation and returns only the summary rows, potentially reducing millions of rows to a handful of summary statistics before the transfer. The same principle applies to window functions and other analytical operations — push computation to the database whenever the required functions are supported:

# Database does the work
sales %>%
  group_by(region) %>%
  summarize(avg = mean(amount)) %>%
  collect()

Check the query plan

For large tables, verify the database’s execution plan to identify slow queries before they run in production. explain() returns the plan, showing which indexes are used, whether the query requires a full table scan, and how the database estimates row counts at each stage. Catching a missing index early can turn a multi-second query into a sub-millisecond one:

sales %>% filter(amount > 1000) %>% explain()

Working with multiple databases

The same dplyr code works across databases. dbplyr automatically generates the appropriate SQL dialect for each backend, so you can switch from SQLite for development to PostgreSQL for production without rewriting your analysis code. The database connection object determines the SQL dialect:

# Same R code, different databases
# PostgreSQL
con_pg <- dbConnect(RPostgres::Postgres(), dbname = "mydb")
tbl(con_pg, "sales") %>% filter(region == "North")

# MySQL
con_mysql <- dbConnect(RMySQL::MySQL(), dbname = "mydb")
tbl(con_mysql, "sales") %>% filter(region == "North")

Always close your database connection when you are done. Leaving connections open can exhaust database connection pools on production servers and may cause resource leaks in long-running scripts. The shutdown = TRUE argument ensures the database process is fully stopped rather than just disconnected:

Disconnecting

dbDisconnect(con)

Common pitfalls

Local R objects in filters

Passing local R objects into database filters often triggers unintended collection. Use the !! (bang-bang) operator to ensure the value is injected into the SQL query directly rather than forcing the database to compare against a locally computed vector. Without !!, dbplyr may silently pull the entire table into R to evaluate the condition locally, defeating the purpose of using a database in the first place:

# This works but may be slow — may trigger collection of all rows
target_regions <- c("North", "South")
sales %>% filter(region %in% target_regions) %>% collect()

The !! operator tells dbplyr to inline the R variable directly into the generated SQL. The database receives WHERE region IN ('North', 'South') as a literal string in the query, which it can evaluate efficiently using indexes. Always use !! when passing R vectors into filter conditions on database-backed tables:

# Better: pass the filter to the database with !!
sales %>% filter(region %in% !!target_regions) %>% collect()

Forgetting to collect

Large lazy tables look like data frames in your environment but are not actually loaded into R memory. Calling nrow() on a lazy table forces collection, which may fetch millions of rows unexpectedly. Always use collect() explicitly and inspect the returned object’s class to confirm the data has been materialized. A quick check with is.tbl() or class() tells you whether you are holding a remote reference or a local data frame:

big_result <- sales %>% filter(amount > 1000)
nrow(big_result)  # Triggers collection

Unsupported operations

Some R functions have no SQL equivalent and cannot be translated by dbplyr. Functions like sample(), complex regular expression operations, and certain statistical transformations produce errors or silently degrade to collecting all data before computation. When you encounter a “no known translation” error, the workaround is to call collect() first, then apply the R function on the local data frame. For production pipelines, consider whether the untranslatable operation can be restructured as a database-native computation:

# This won't translate
sales %>% mutate(in_r = sample(n()))  # Error or unexpected

See also