Querying Data with DuckDB from R
DuckDB is an in-process analytical database that runs entirely within R without requiring external servers. It excels at querying data across analytical workloads, handling millions of rows efficiently while speaking SQL. For R programmers accustomed to working with data frames, DuckDB offers a powerful alternative that scales beyond what base R can handle.
This guide covers installing DuckDB, connecting to data sources, running SQL queries, and integrating with the tidyverse ecosystem.
Why DuckDB for R?
DuckDB addresses several pain points in R data analysis. Traditional approaches involve loading entire datasets into memory, which fails with files larger than available RAM. CSV parsing in R is single-threaded, making large file imports slow. DuckDB solves these problems by running analytical queries directly on files without loading them into R memory.
The database executes vectorized queries using columnar storage, often achieving 10-100x speedups over base R operations. It supports full SQL including joins, window functions, and aggregations. Best of all, DuckDB requires no setup, no server, and no configuration.
Installation
Install DuckDB and its R bindings from CRAN:
install.packages("duckdb")
install.packages("DBI")
With DuckDB and the DBI package installed, you gain the core database connectivity layer that lets R communicate with DuckDB’s query engine. The next step adds dbplyr and dplyr, which together bridge DuckDB’s SQL engine with R’s familiar data-manipulation syntax. Rather than writing raw SQL for every operation, these packages translate filter(), group_by(), and summarise() into optimized DuckDB queries, letting you stay within the tidyverse workflow while benefiting from DuckDB’s analytical speed.
install.packages("dbplyr")
install.packages("dplyr")
Once all four packages are installed, calling library() makes their functions available in your R session. The duckdb package provides the database driver and connection functions, while DBI supplies the standard database interface that R packages expect. The dplyr and dbplyr pair gives you the translation layer that converts familiar R data-manipulation verbs into SQL that DuckDB executes efficiently. With these loaded, you are ready to open a connection and begin querying data directly from files or in-memory tables.
library(duckdb)
library(DBI)
library(dplyr)
library(dbplyr)
The packages are now available in your session, but DuckDB itself has not been started yet. The next section walks through opening connections, which is the gateway between your R script and DuckDB’s query engine. Choosing between an in-memory database and a file-backed one is the first practical decision you will make, and that choice directly affects both query performance and whether your data survives after the R session ends.
Connecting to DuckDB
Create an in-memory database for temporary work or a file-based database for persistence:
# In-memory database (fastest, but data lost on restart)
con <- dbConnect(duckdb::duckdb())
# File-based database (persists between sessions)
con <- dbConnect(duckdb::duckdb(), "my_analytics.duckdb")
# Close connection when done
dbDisconnect(con, shutdown = TRUE)
In-memory databases are useful for ETL pipelines and temporary analysis. File-based databases persist your data and indexes, making subsequent queries faster.
Reading data into DuckDB
From CSV and Parquet files
DuckDB reads CSV and Parquet files directly from disk without loading them into R memory, keeping the data in DuckDB’s columnar storage so that only query results move to R. The read_csv_auto() function detects column types, delimiters, and headers automatically. For Parquet, DuckDB can read single files, glob patterns matching multiple files, and even files stored in S3 or other cloud object storage.
con <- dbConnect(duckdb::duckdb())
# CSV — auto-detects types, headers, and delimiters
dbExecute(con, "CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_data.csv')")
# Parquet — reads single files or wildcard patterns
dbExecute(con, "CREATE TABLE data AS SELECT * FROM read_parquet('*.parquet')")
From R data frames
To join R data with external sources, register an existing data frame as a DuckDB virtual table. The data is not copied; DuckDB references the R object directly and queries it in place, making this efficient for ad-hoc joins between in-memory R data and on-disk database tables.
duckdb_register(con, "sales", sales_df)
result <- dbGetQuery(con, "SELECT * FROM sales LIMIT 10")
Registering an R data frame as a DuckDB table is a lightweight operation that avoids data duplication. Once registered, you can query the R object with full SQL syntax, combining it with DuckDB-native tables in joins, aggregations, and subqueries. This capability is especially helpful during exploratory analysis when you have reference data already loaded in R that you need to cross-reference against larger datasets stored in Parquet or CSV files.
Running SQL queries
DuckDB supports standard SQL with window functions, CTEs, and correlated subqueries. Use dbGetQuery() for SELECT statements that return results and dbExecute() for INSERT, UPDATE, DELETE, and DDL statements.
# SELECT returns results as an R data frame
results <- dbGetQuery(con, "SELECT * FROM sales WHERE quantity > 10")
# Aggregations with GROUP BY, ORDER BY, and computed columns
by_product <- dbGetQuery(con, "
SELECT product,
SUM(quantity) as total_qty,
SUM(quantity * price) as revenue
FROM sales
GROUP BY product
ORDER BY revenue DESC
")
Window functions
The basic aggregation queries shown above collapse all rows per group into a single summary row. Window functions preserve every row while computing aggregates over a defined window of related rows, which makes them indispensable for running totals, percentiles, and rankings without losing the original row-level detail. The example below adds a running revenue total per product and a revenue rank, both computed inside DuckDB rather than in R.
# Running totals and rankings
analysis <- dbGetQuery(con, "
SELECT *,
SUM(revenue) OVER (PARTITION BY product ORDER BY date) as running_total,
RANK() OVER (PARTITION BY product ORDER BY revenue DESC) as rank
FROM daily_sales
")
Window functions in DuckDB run with the same vectorized execution engine that powers its regular aggregations, so you can partition and rank across millions of rows without the performance penalty you would incur by implementing equivalent logic in R with grouped data frames. After running analytical SQL through DuckDB, you may want to switch to a more familiar syntax for data transformation tasks, which brings us to the dplyr integration.
Using dplyr with DuckDB
The real power of DuckDB in R emerges when combined with dplyr syntax, which lets you write R code that translates to efficient SQL behind the scenes. Instead of toggling between SQL strings and R data frame operations, you build a single pipeline in dplyr that DuckDB converts automatically into an optimized query plan, executing it entirely within the database engine.
Basic pipeline
# Connect to DuckDB with a sample dataset
con <- dbConnect(duckdb::duckdb(), ":memory:")
duckdb_read_csv(con, "sales", "sales_data.csv")
# Use dplyr syntax
sales_summary <- con %>%
tbl("sales") %>%
group_by(product) %>%
summarise(
total_qty = sum(quantity),
avg_price = mean(price),
n_orders = n()
) %>%
collect()
The collect() action executes the query and returns a tibble, pulling the final result set from DuckDB into R memory. Without collect(), you get a lazy query reference that DuckDB can optimize further before execution, which is useful when you want to inspect or chain additional operations before materializing the data. Sometimes, however, you need DuckDB to run a query and store the intermediate result inside the database itself rather than pulling it into R, which is where compute() becomes valuable.
Push query to database
Force execution in DuckDB rather than R:
# This generates SQL and runs it in DuckDB
large_result <- con %>%
tbl("sales") %>%
filter(quantity > 100) %>%
mutate(revenue = quantity * price) %>%
group_by(product) %>%
summarise(across(where(is.numeric), sum)) %>%
compute() # Creates temporary table in DuckDB
The compute() function materializes the query as a temporary table in DuckDB, useful for complex multi-step transformations. When you chain many dplyr verbs together, DuckDB generates increasingly complex SQL, and compute() forces it to write the intermediate result so subsequent steps start from a clean, pre-aggregated table rather than building ever-deeper query trees.
Performance tips
Getting near-peak performance from DuckDB requires a few deliberate habits around indexing, data loading, and resource allocation. These three practices together can turn a query that takes minutes into one that finishes in seconds, especially when you are working with files that contain tens of millions of rows. Each subsection below addresses one dimension of performance, starting with the indexes that make filter and join operations fast.
Use indexes
Create indexes for frequently filtered columns:
dbExecute(con, "CREATE INDEX idx_sales_product ON sales(product)")
dbExecute(con, "CREATE INDEX idx_sales_date ON sales(date)")
Indexes speed up filter and join operations by letting DuckDB locate relevant rows without scanning the entire table. Once you have created indexes on the columns you query most often, the next performance concern is how you get data into DuckDB in the first place. Loading data row by row through repeated INSERT statements is surprisingly slow, so bulk loading matters just as much as proper indexing.
Batch inserts
For large data imports, batch inserts:
# Much faster than row-by-row inserts
dbExecute(con, "INSERT INTO sales SELECT * FROM read_csv_auto('large_file.csv')")
A single bulk INSERT that reads directly from a file avoids the overhead of sending thousands of individual rows from R to DuckDB. The CSV parsing happens inside DuckDB’s own multithreaded reader, which is dramatically faster than reading the file into R first and then transferring the resulting data frame. With your data loaded efficiently, the remaining piece of the performance puzzle is telling DuckDB how much of your machine’s memory it is allowed to use.
Memory management
Control DuckDB memory usage:
# Set memory limit (in bytes)
dbExecute(con, "SET memory_limit='4GB'")
# Check current usage
dbGetQuery(con, "SELECT * FROM duckdb_memory()")
DuckDB defaults to using roughly 80% of system memory for its buffer pool, which is usually fine on a dedicated analysis machine but can cause swapping if you are running other memory-intensive processes alongside it. Setting an explicit limit with SET memory_limit and checking actual consumption through duckdb_memory() gives you visibility into the engine’s resource usage so you can tune it for your specific hardware and workload.
Working with multiple data sources
DuckDB can join data from different file formats simultaneously:
# Join CSV with Parquet
combined <- dbGetQuery(con, "
SELECT s.*, c.customer_name, c.region
FROM read_csv_auto('sales.csv') s
JOIN read_parquet('customers.parquet') c
ON s.customer_id = c.id
WHERE s.date >= '2026-01-01'
")
This is powerful for analytics workflows that combine multiple data sources without loading them into R. The query above joins a CSV file of sales transactions with a Parquet file of customer records, applying a date filter before the join so DuckDB reads only the rows it needs from each source.
Example: analyzing website logs
A practical use case that highlights everything covered so far is analyzing web server logs. These files are typically large CSV exports that grow quickly and are impractical to load entirely into an R data frame. DuckDB queries the file directly, applying aggregations at the storage level, so even a multi-gigabyte access log returns results in milliseconds.
# Assume you have Apache logs in CSV format
logs <- dbConnect(duckdb::duckdb(), "logs.duckdb")
duckdb_read_csv(logs, "access_log", "access_log.csv")
# Analyze traffic patterns
traffic <- dbGetQuery(logs, "
SELECT
date_trunc('hour', timestamp) as hour,
path,
COUNT(*) as hits,
AVG(response_time) as avg_time
FROM access_log
WHERE status = 200
GROUP BY 1, 2
ORDER BY hits DESC
LIMIT 20
")
This same pattern of querying data directly from files, without pre-loading or importing, applies to any tabular format DuckDB supports, including Parquet, JSON, and Arrow. By combining DuckDB’s file-native SQL engine with dplyr integration, R programmers get a workflow that scales from quick exploratory queries on a laptop to scheduled ETL pipelines processing hundreds of gigabytes of data.
See also
r-sqlite— SQLite for simpler database needsr-arrow-parquet— Working with Parquet filesr-data-table— Alternative high-performance data manipulation