Querying Data with DuckDB from R
DuckDB is an in-process analytical database that runs entirely within R without requiring external servers. It excels at analytical workloads, handling millions of rows efficiently while speaking SQL. For R programmers accustomed to working with data frames, DuckDB offers a powerful alternative that scales beyond what base R can handle.
This guide covers installing DuckDB, connecting to data sources, running SQL queries, and integrating with the tidyverse ecosystem.
Why DuckDB for R?
DuckDB addresses several pain points in R data analysis. Traditional approaches involve loading entire datasets into memory, which fails with files larger than available RAM. CSV parsing in R is single-threaded, making large file imports slow. DuckDB solves these problems by running analytical queries directly on files without loading them into R memory.
The database executes vectorized queries using columnar storage, often achieving 10-100x speedups over base R operations. It supports full SQL including joins, window functions, and aggregations. Best of all, DuckDB requires no setup, no server, and no configuration.
Installation
Install DuckDB and its R bindings from CRAN:
install.packages("duckdb")
install.packages("DBI")
For tidyverse integration, also install:
install.packages("dbplyr")
install.packages("dplyr")
Load the packages:
library(duckdb)
library(DBI)
library(dplyr)
library(dbplyr)
Connecting to DuckDB
Create an in-memory database for temporary work or a file-based database for persistence:
# In-memory database (fastest, but data lost on restart)
con <- dbConnect(duckdb::duckdb())
# File-based database (persists between sessions)
con <- dbConnect(duckdb::duckdb(), "my_analytics.duckdb")
# Close connection when done
dbDisconnect(con, shutdown = TRUE)
In-memory databases are useful for ETL pipelines and temporary analysis. File-based databases persist your data and indexes, making subsequent queries faster.
Reading Data into DuckDB
From CSV Files
Import a CSV file directly into DuckDB without loading into R memory:
con <- dbConnect(duckdb::duckdb())
# Read CSV directly into DuckDB
dbExecute(con, "CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_data.csv')")
# Or use the convenience function
duckdb_read_csv(con, "sales", "sales_data.csv")
The read_csv_auto function automatically detects column types, delimiters, and headers. For more control, use read_csv with explicit options.
From Parquet Files
DuckDB excels at reading Parquet files, which are columnar and compressed:
# Read Parquet file
dbExecute(con, "CREATE TABLE data AS SELECT * FROM read_parquet('data.parquet')")
# Read multiple files at once
dbExecute(con, "CREATE TABLE data AS SELECT * FROM read_parquet('*.parquet')")
Parquet is the recommended format for large analytical datasets. DuckDB can read Parquet files directly from S3 or other cloud storage without downloading the entire file.
From R Data Frames
Convert R data frames to DuckDB tables:
# Create table from data frame
duckdb_register(con, "sales", sales_df)
# Now query it with SQL
result <- dbGetQuery(con, "SELECT * FROM sales LIMIT 10")
This approach is useful when you need to join R data with external data sources.
Running SQL Queries
Basic Queries
Execute SQL queries and return results to R:
# Fetch all results
results <- dbGetQuery(con, "SELECT * FROM sales WHERE quantity > 10")
# Execute without returning results
dbExecute(con, "INSERT INTO sales VALUES (6, 'Widget D', 20, 24.99, '2026-03-11')")
Aggregations
DuckDB handles aggregations efficiently:
# Total sales by product
by_product <- dbGetQuery(con, "
SELECT product,
SUM(quantity) as total_qty,
SUM(quantity * price) as revenue
FROM sales
GROUP BY product
ORDER BY revenue DESC
")
Window Functions
Use SQL window functions for complex analytics:
# Running totals and rankings
analysis <- dbGetQuery(con, "
SELECT *,
SUM(revenue) OVER (PARTITION BY product ORDER BY date) as running_total,
RANK() OVER (PARTITION BY product ORDER BY revenue DESC) as rank
FROM daily_sales
")
Using dplyr with DuckDB
The real power of DuckDB in R emerges when combined with dplyr syntax. This lets you write R code that translates to efficient SQL.
Basic Pipeline
# Connect to DuckDB with a sample dataset
con <- dbConnect(duckdb::duckdb(), ":memory:")
duckdb_read_csv(con, "sales", "sales_data.csv")
# Use dplyr syntax
sales_summary <- con %>%
tbl("sales") %>%
group_by(product) %>%
summarise(
total_qty = sum(quantity),
avg_price = mean(price),
n_orders = n()
) %>%
collect()
The collect() action executes the query and returns a tibble. Without collect(), you get a lazy query that DuckDB can optimize further.
Push Query to Database
Force execution in DuckDB rather than R:
# This generates SQL and runs it in DuckDB
large_result <- con %>%
tbl("sales") %>%
filter(quantity > 100) %>%
mutate(revenue = quantity * price) %>%
group_by(product) %>%
summarise(across(where(is.numeric), sum)) %>%
compute() # Creates temporary table in DuckDB
The compute() function materializes the query as a temporary table in DuckDB, useful for complex multi-step transformations.
Performance Tips
Use Indexes
Create indexes for frequently filtered columns:
dbExecute(con, "CREATE INDEX idx_sales_product ON sales(product)")
dbExecute(con, "CREATE INDEX idx_sales_date ON sales(date)")
Batch Inserts
For large data imports, batch inserts:
# Much faster than row-by-row inserts
dbExecute(con, "INSERT INTO sales SELECT * FROM read_csv_auto('large_file.csv')")
Memory Management
Control DuckDB memory usage:
# Set memory limit (in bytes)
dbExecute(con, "SET memory_limit='4GB'")
# Check current usage
dbGetQuery(con, "SELECT * FROM duckdb_memory()")
Working with Multiple Data Sources
DuckDB can join data from different file formats simultaneously:
# Join CSV with Parquet
combined <- dbGetQuery(con, "
SELECT s.*, c.customer_name, c.region
FROM read_csv_auto('sales.csv') s
JOIN read_parquet('customers.parquet') c
ON s.customer_id = c.id
WHERE s.date >= '2026-01-01'
")
This is powerful for analytics workflows that combine multiple data sources without loading them into R.
Example: Analyzing Website Logs
A practical use case is analyzing web server logs:
# Assume you have Apache logs in CSV format
logs <- dbConnect(duckdb::duckdb(), "logs.duckdb")
duckdb_read_csv(logs, "access_log", "access_log.csv")
# Analyze traffic patterns
traffic <- dbGetQuery(logs, "
SELECT
date_trunc('hour', timestamp) as hour,
path,
COUNT(*) as hits,
AVG(response_time) as avg_time
FROM access_log
WHERE status = 200
GROUP BY 1, 2
ORDER BY hits DESC
LIMIT 20
")
See Also
r-sqlite— SQLite for simpler database needsr-arrow-parquet— Working with Parquet filesr-data-table— Alternative high-performance data manipulation