Querying Data with DuckDB from R

· 5 min read · Updated March 11, 2026 · intermediate
r databases duckdb analytics big-data

DuckDB is an in-process analytical database that runs entirely within R without requiring external servers. It excels at analytical workloads, handling millions of rows efficiently while speaking SQL. For R programmers accustomed to working with data frames, DuckDB offers a powerful alternative that scales beyond what base R can handle.

This guide covers installing DuckDB, connecting to data sources, running SQL queries, and integrating with the tidyverse ecosystem.

Why DuckDB for R?

DuckDB addresses several pain points in R data analysis. Traditional approaches involve loading entire datasets into memory, which fails with files larger than available RAM. CSV parsing in R is single-threaded, making large file imports slow. DuckDB solves these problems by running analytical queries directly on files without loading them into R memory.

The database executes vectorized queries using columnar storage, often achieving 10-100x speedups over base R operations. It supports full SQL including joins, window functions, and aggregations. Best of all, DuckDB requires no setup, no server, and no configuration.

Installation

Install DuckDB and its R bindings from CRAN:

install.packages("duckdb")
install.packages("DBI")

For tidyverse integration, also install:

install.packages("dbplyr")
install.packages("dplyr")

Load the packages:

library(duckdb)
library(DBI)
library(dplyr)
library(dbplyr)

Connecting to DuckDB

Create an in-memory database for temporary work or a file-based database for persistence:

# In-memory database (fastest, but data lost on restart)
con <- dbConnect(duckdb::duckdb())

# File-based database (persists between sessions)
con <- dbConnect(duckdb::duckdb(), "my_analytics.duckdb")

# Close connection when done
dbDisconnect(con, shutdown = TRUE)

In-memory databases are useful for ETL pipelines and temporary analysis. File-based databases persist your data and indexes, making subsequent queries faster.

Reading Data into DuckDB

From CSV Files

Import a CSV file directly into DuckDB without loading into R memory:

con <- dbConnect(duckdb::duckdb())

# Read CSV directly into DuckDB
dbExecute(con, "CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_data.csv')")

# Or use the convenience function
duckdb_read_csv(con, "sales", "sales_data.csv")

The read_csv_auto function automatically detects column types, delimiters, and headers. For more control, use read_csv with explicit options.

From Parquet Files

DuckDB excels at reading Parquet files, which are columnar and compressed:

# Read Parquet file
dbExecute(con, "CREATE TABLE data AS SELECT * FROM read_parquet('data.parquet')")

# Read multiple files at once
dbExecute(con, "CREATE TABLE data AS SELECT * FROM read_parquet('*.parquet')")

Parquet is the recommended format for large analytical datasets. DuckDB can read Parquet files directly from S3 or other cloud storage without downloading the entire file.

From R Data Frames

Convert R data frames to DuckDB tables:

# Create table from data frame
duckdb_register(con, "sales", sales_df)

# Now query it with SQL
result <- dbGetQuery(con, "SELECT * FROM sales LIMIT 10")

This approach is useful when you need to join R data with external data sources.

Running SQL Queries

Basic Queries

Execute SQL queries and return results to R:

# Fetch all results
results <- dbGetQuery(con, "SELECT * FROM sales WHERE quantity > 10")

# Execute without returning results
dbExecute(con, "INSERT INTO sales VALUES (6, 'Widget D', 20, 24.99, '2026-03-11')")

Aggregations

DuckDB handles aggregations efficiently:

# Total sales by product
by_product <- dbGetQuery(con, "
  SELECT product, 
         SUM(quantity) as total_qty,
         SUM(quantity * price) as revenue
  FROM sales
  GROUP BY product
  ORDER BY revenue DESC
")

Window Functions

Use SQL window functions for complex analytics:

# Running totals and rankings
analysis <- dbGetQuery(con, "
  SELECT *,
         SUM(revenue) OVER (PARTITION BY product ORDER BY date) as running_total,
         RANK() OVER (PARTITION BY product ORDER BY revenue DESC) as rank
  FROM daily_sales
")

Using dplyr with DuckDB

The real power of DuckDB in R emerges when combined with dplyr syntax. This lets you write R code that translates to efficient SQL.

Basic Pipeline

# Connect to DuckDB with a sample dataset
con <- dbConnect(duckdb::duckdb(), ":memory:")
duckdb_read_csv(con, "sales", "sales_data.csv")

# Use dplyr syntax
sales_summary <- con %>% 
  tbl("sales") %>%
  group_by(product) %>%
  summarise(
    total_qty = sum(quantity),
    avg_price = mean(price),
    n_orders = n()
  ) %>%
  collect()

The collect() action executes the query and returns a tibble. Without collect(), you get a lazy query that DuckDB can optimize further.

Push Query to Database

Force execution in DuckDB rather than R:

# This generates SQL and runs it in DuckDB
large_result <- con %>%
  tbl("sales") %>%
  filter(quantity > 100) %>%
  mutate(revenue = quantity * price) %>%
  group_by(product) %>%
  summarise(across(where(is.numeric), sum)) %>%
  compute()  # Creates temporary table in DuckDB

The compute() function materializes the query as a temporary table in DuckDB, useful for complex multi-step transformations.

Performance Tips

Use Indexes

Create indexes for frequently filtered columns:

dbExecute(con, "CREATE INDEX idx_sales_product ON sales(product)")
dbExecute(con, "CREATE INDEX idx_sales_date ON sales(date)")

Batch Inserts

For large data imports, batch inserts:

# Much faster than row-by-row inserts
dbExecute(con, "INSERT INTO sales SELECT * FROM read_csv_auto('large_file.csv')")

Memory Management

Control DuckDB memory usage:

# Set memory limit (in bytes)
dbExecute(con, "SET memory_limit='4GB'")

# Check current usage
dbGetQuery(con, "SELECT * FROM duckdb_memory()")

Working with Multiple Data Sources

DuckDB can join data from different file formats simultaneously:

# Join CSV with Parquet
combined <- dbGetQuery(con, "
  SELECT s.*, c.customer_name, c.region
  FROM read_csv_auto('sales.csv') s
  JOIN read_parquet('customers.parquet') c
    ON s.customer_id = c.id
  WHERE s.date >= '2026-01-01'
")

This is powerful for analytics workflows that combine multiple data sources without loading them into R.

Example: Analyzing Website Logs

A practical use case is analyzing web server logs:

# Assume you have Apache logs in CSV format
logs <- dbConnect(duckdb::duckdb(), "logs.duckdb")
duckdb_read_csv(logs, "access_log", "access_log.csv")

# Analyze traffic patterns
traffic <- dbGetQuery(logs, "
  SELECT 
    date_trunc('hour', timestamp) as hour,
    path,
    COUNT(*) as hits,
    AVG(response_time) as avg_time
  FROM access_log
  WHERE status = 200
  GROUP BY 1, 2
  ORDER BY hits DESC
  LIMIT 20
")

See Also