R and Databases: Advanced Patterns
This tutorial covers advanced patterns for working with databases in R. You’ll learn connection pooling, parameterized queries, transaction management, and how to work across multiple database backends.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with R and Databases. By the end, you will know how to apply the core functions in real data analysis workflows.
Connection pooling
Connection pooling prevents the overhead of establishing a new database connection for each query. The pool package manages a reusable set of connections and handles connection health checks automatically.
library(DBI)
library(pool)
# Create a connection pool with specific settings
pool <- dbPool(
drv = RMySQL::MySQL(),
host = "localhost",
dbname = "analytics",
user = "app_user",
password = Sys.getenv("DB_PASSWORD"),
minConnections = 2,
maxConnections = 10,
idleTimeout = 300 # seconds before idle connection closes
)
# Queries use pooled connections automatically
result <- dbGetQuery(pool, "SELECT * FROM users LIMIT 100")
# Always close the pool when done
poolClose(pool)
The pool maintains a set of active connections and reuses them. This is essential for Shiny apps and long-running scripts where establishing connections becomes a bottleneck. The pool also handles connection timeouts and automatically recreates failed connections.
Parameterized queries
Never interpolate user input directly into SQL. Use parameterized queries to prevent SQL injection attacks and improve query performance through query plan caching:
# Unsafe - DON'T DO THIS THIS IS A SECURITY RISK
query <- paste0("SELECT * FROM users WHERE id = ", user_id)
# Safe - use parameterized queries
query <- "SELECT * FROM users WHERE id = ?"
result <- dbGetQuery(pool, query, params = list(user_id))
For dbplyr, use the local() function to safely embed R vectors in generated SQL:
library(dplyr)
user_ids <- c(1, 2, 3, 4, 5)
users <- tbl(pool, "users") %>%
filter(id %in% local(user_ids)) %>%
collect()
The local() function evaluates the R vector locally and sends the values safely to the database as a temporary inline table. This approach works well for small to medium-sized vectors.
Transaction management
Wrap multiple operations in a transaction to ensure atomicity. Either all operations succeed together, or none do:
# Start a transaction
dbBegin(pool)
tryCatch({
# First operation: update order status
dbExecute(pool, "UPDATE orders SET status = 'shipped' WHERE id = ?",
params = list(order_id))
# Second operation: record shipping date
dbExecute(pool, "UPDATE orders SET shipped_at = NOW() WHERE id = ?",
params = list(order_id))
# Third operation: create shipment record
dbExecute(pool, "INSERT INTO shipments (order_id, created_at) VALUES (?, NOW())",
params = list(order_id))
# Commit if all succeed
dbCommit(pool)
message("Order ", order_id, " shipped successfully")
}, error = function(e) {
# Rollback on any error
dbRollback(pool)
stop("Failed to ship order: ", e$message)
})
Transactions are critical when updating multiple related tables. They ensure data consistency even if a query fails partway through.
Working with multiple databases
Use a configuration approach to manage connections to different database backends:
# Define database configurations
configs <- list(
production = list(
drv = RPostgres::Postgres(),
host = "prod-db.example.com",
dbname = "main",
port = 5432
),
analytics = list(
drv = RMySQL::MySQL(),
host = "analytics-db.example.com",
dbname = "warehouse",
port = 3306
),
warehouse = list(
drv = duckdb::duckdb(),
dbdir = "/data/warehouse.duckdb"
)
)
# Helper to create connections
connect_db <- function(env = c("production", "analytics", "warehouse")) {
env <- match.arg(env)
cfg <- configs[[env]]
do.call(dbConnect, cfg)
}
# Query different databases in the same pipeline
prod_con <- connect_db("production")
analytics_con <- connect_db("analytics")
# Pull from production, join with analytics data
users <- tbl(prod_con, "users") %>%
inner_join(tbl(analytics_con, "activity"), by = "id") %>%
filter(date >= "2025-01-01") %>%
group_by(id) %>%
summarise(total_activity = n()) %>%
collect()
This pattern enables polyglot data access—using the right database for each task while maintaining a consistent R interface.
Async queries with dbplyr
For long-running queries, use the future package with dbplyr for non-blocking execution:
library(future)
library(promises)
plan(multisession)
# Execute query in background
query_future <- future({
tbl(pool, "large_table") %>%
filter(date > "2025-01-01") %>%
group_by(category) %>%
summarise(count = n(), .groups = "drop") %>%
collect()
})
# Do other work here while query runs...
# Retrieve results when ready
result <- value(query_future)
This pattern is valuable for Shiny apps where you don’t want the UI to freeze during complex queries. Combine with reactlog to visualize reactive dependencies.
Error handling best practices
Always handle database errors gracefully:
safe_query <- function(con, query, ...) {
tryCatch(
dbGetQuery(con, query, ...),
error = function(e) {
warning("Query failed: ", e$message)
NULL
}
)
}
# Check connection health before queries
is_connected <- function(con) {
tryCatch(
dbExecute(con, "SELECT 1") > 0,
error = function(e) FALSE
)
}
Connection pooling with pool
For Shiny apps and APIs with multiple concurrent users, the pool package manages database connections efficiently. dbPool() creates a pool that checks out connections from a shared set rather than opening a new connection per request. This is essential for Shiny, without a pool, each user session would open and close connections, potentially exhausting the database’s connection limit.
Advanced query patterns
SQL window functions become powerful in R workflows: RANK() OVER (PARTITION BY dept ORDER BY salary DESC) computes within-department salary ranks. In dbplyr: group_by(dept) |> mutate(rank = row_number(desc(salary))). Recursive CTEs handle hierarchical data (org charts, bill of materials). DuckDB and modern PostgreSQL support both efficiently.
Database migrations
Schema changes in production databases require care. The dm package models database schemas in R and generates migration SQL. For simple schemas, dbCreateTable(), dbAddColumn(), and manual SQL via dbExecute() handle migrations. Always back up before migrating. Apply migrations transactionally: run in a transaction so a partial failure does not leave the schema in an inconsistent state.
Full-Text search
PostgreSQL’s full-text search (tsvector, tsquery) is accessible from R via raw SQL in dbGetQuery(). CREATE INDEX ... USING gin(to_tsvector(...)) creates an index; WHERE to_tsvector(content) @@ to_tsquery('search term') queries it. For SQLite, FTS5 provides similar functionality. For complex search needs, external search engines (Elasticsearch, OpenSearch) accessed via httr2 provide better relevance ranking and faceting.
Transactions
Transactions group multiple SQL operations atomically, either all succeed or all are rolled back. DBI::dbBegin(con) starts a transaction. DBI::dbCommit(con) commits. DBI::dbRollback(con) rolls back. Use tryCatch() to roll back on error:
DBI::dbBegin(con)
tryCatch({
dbExecute(con, "UPDATE accounts SET balance = balance - 100 WHERE id = 1")
dbExecute(con, "UPDATE accounts SET balance = balance + 100 WHERE id = 2")
DBI::dbCommit(con)
}, error = function(e) {
DBI::dbRollback(con)
stop(e)
})
Without transactions, a failure partway through a multi-step operation leaves the database in an inconsistent state.
Schema management
For production databases, manage schema changes with migration files rather than running DDL statements manually. The dbmate CLI tool (language-agnostic) or DBImigrateR manage numbered migration files. Each migration has an up and down SQL block; running migrations applies unapplied ones in order.
In development, DBI::dbCreateTable(con, "table_name", df_prototype) creates a table matching the column types of a data frame. DBI::dbWriteTable(con, "table", df, append = TRUE) inserts rows into an existing table; overwrite = TRUE drops and recreates.
DBI::dbListTables(con) lists all tables. DBI::dbListFields(con, "table") lists column names. DBI::dbExistsTable(con, "table") checks if a table exists. These introspection functions let code adapt to the current schema.
Query performance
EXPLAIN ANALYZE <query> (PostgreSQL syntax) returns the query execution plan with actual timing. Run this via dbGetQuery(con, "EXPLAIN ANALYZE SELECT ...") to identify slow operations. Common bottlenecks: full table scans on large tables (add an index), inefficient joins (verify join columns are indexed), and returning more data than needed (add WHERE filters and SELECT specific columns).
Indexes in R: dbExecute(con, "CREATE INDEX idx_user_id ON orders(user_id)"). Index maintenance has overhead, each insert/update/delete updates all indexes on that table. Create indexes only on columns you filter or join on frequently.
For analytics queries (aggregations, wide scans), DuckDB is faster than PostgreSQL because it uses columnar storage. Switch the connection driver to duckdb::duckdb() and point it at a DuckDB file or an in-memory database for analytical workloads.
Beyond basic querying
Basic database interaction in R covers connecting, reading tables, and writing simple queries. Advanced usage covers the patterns that matter for production applications: connection pooling for high-concurrency scenarios, transactions for operations that must succeed or fail atomically, schema management for evolving database structures, and performance tuning for queries that are unacceptably slow.
These techniques matter most when R is not just reading data for analysis but participating in a data pipeline or serving data for an application. Shiny apps that handle many concurrent users need connection pooling. ETL processes that update multiple tables atomically need transactions. Applications that run in production for months need schema migration tools.
Transactions for data integrity
A transaction groups multiple database operations into a unit that either completes entirely or rolls back entirely. Without transactions, a failure halfway through a multi-step update leaves the database in an inconsistent intermediate state. With a transaction, the database guarantees that either all changes are committed or none are, preserving consistency.
In DBI, transactions are managed with dbBegin, dbCommit, and dbRollback. Wrapping a series of dbExecute calls with these transaction controls ensures atomicity. The idiomatic pattern uses tryCatch to catch errors and call dbRollback in the error handler. When the wrapped operations succeed, dbCommit makes the changes permanent. When they fail, dbRollback undoes any partial changes.
Parameterized queries at scale
For batch inserts of many rows, executing one parameterized query per row is much slower than using DBI’s batch insert capabilities. The dbAppendTable function inserts an entire data frame in a single operation that the database driver translates into an efficient bulk insert. For updates that modify specific rows based on a key, parameterized queries with vectorized parameter binding execute the query once per parameter set but can be much faster than individual calls when the database driver batches them.
Connection-level settings affect query performance. Setting the application_name connection parameter helps identify R sessions in database server logs. Adjusting the statement_timeout prevents long-running queries from blocking others. For read-heavy analytical queries, setting the transaction isolation level to READ COMMITTED or using read replicas reduces lock contention with transactional workloads.
Next steps
Now that you understand r and databases, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.
See also
- Database Queries with dbplyr — Basic database operations with dbplyr
- Using dtplyr: dplyr Syntax on data.table — Fast data manipulation
- Arrow and Parquet Files in R — Modern data file patterns