R and Databases: Advanced Patterns
This tutorial covers advanced patterns for working with databases in R. You’ll learn connection pooling, parameterized queries, transaction management, and how to work across multiple database backends.
Connection Pooling
Connection pooling prevents the overhead of establishing a new database connection for each query. The pool package manages a reusable set of connections and handles connection health checks automatically.
library(DBI)
library(pool)
# Create a connection pool with specific settings
pool <- dbPool(
drv = RMySQL::MySQL(),
host = "localhost",
dbname = "analytics",
user = "app_user",
password = Sys.getenv("DB_PASSWORD"),
minConnections = 2,
maxConnections = 10,
idleTimeout = 300 # seconds before idle connection closes
)
# Queries use pooled connections automatically
result <- dbGetQuery(pool, "SELECT * FROM users LIMIT 100")
# Always close the pool when done
poolClose(pool)
The pool maintains a set of active connections and reuses them. This is essential for Shiny apps and long-running scripts where establishing connections becomes a bottleneck. The pool also handles connection timeouts and automatically recreates failed connections.
Parameterized Queries
Never interpolate user input directly into SQL. Use parameterized queries to prevent SQL injection attacks and improve query performance through query plan caching:
# Unsafe - DON'T DO THIS THIS IS A SECURITY RISK
query <- paste0("SELECT * FROM users WHERE id = ", user_id)
# Safe - use parameterized queries
query <- "SELECT * FROM users WHERE id = ?"
result <- dbGetQuery(pool, query, params = list(user_id))
For dbplyr, use the local() function to safely embed R vectors in generated SQL:
library(dplyr)
user_ids <- c(1, 2, 3, 4, 5)
users <- tbl(pool, "users") %>%
filter(id %in% local(user_ids)) %>%
collect()
The local() function evaluates the R vector locally and sends the values safely to the database as a temporary inline table. This approach works well for small to medium-sized vectors.
Transaction Management
Wrap multiple operations in a transaction to ensure atomicity. Either all operations succeed together, or none do:
# Start a transaction
dbBegin(pool)
tryCatch({
# First operation: update order status
dbExecute(pool, "UPDATE orders SET status = 'shipped' WHERE id = ?",
params = list(order_id))
# Second operation: record shipping date
dbExecute(pool, "UPDATE orders SET shipped_at = NOW() WHERE id = ?",
params = list(order_id))
# Third operation: create shipment record
dbExecute(pool, "INSERT INTO shipments (order_id, created_at) VALUES (?, NOW())",
params = list(order_id))
# Commit if all succeed
dbCommit(pool)
message("Order ", order_id, " shipped successfully")
}, error = function(e) {
# Rollback on any error
dbRollback(pool)
stop("Failed to ship order: ", e$message)
})
Transactions are critical when updating multiple related tables. They ensure data consistency even if a query fails partway through.
Working with Multiple Databases
Use a configuration approach to manage connections to different database backends:
# Define database configurations
configs <- list(
production = list(
drv = RPostgres::Postgres(),
host = "prod-db.example.com",
dbname = "main",
port = 5432
),
analytics = list(
drv = RMySQL::MySQL(),
host = "analytics-db.example.com",
dbname = "warehouse",
port = 3306
),
warehouse = list(
drv = duckdb::duckdb(),
dbdir = "/data/warehouse.duckdb"
)
)
# Helper to create connections
connect_db <- function(env = c("production", "analytics", "warehouse")) {
env <- match.arg(env)
cfg <- configs[[env]]
do.call(dbConnect, cfg)
}
# Query different databases in the same pipeline
prod_con <- connect_db("production")
analytics_con <- connect_db("analytics")
# Pull from production, join with analytics data
users <- tbl(prod_con, "users") %>%
inner_join(tbl(analytics_con, "activity"), by = "id") %>%
filter(date >= "2025-01-01") %>%
group_by(id) %>%
summarise(total_activity = n()) %>%
collect()
This pattern enables polyglot data access—using the right database for each task while maintaining a consistent R interface.
Async Queries with dbplyr
For long-running queries, use the future package with dbplyr for non-blocking execution:
library(future)
library(promises)
plan(multisession)
# Execute query in background
query_future <- future({
tbl(pool, "large_table") %>%
filter(date > "2025-01-01") %>%
group_by(category) %>%
summarise(count = n(), .groups = "drop") %>%
collect()
})
# Do other work here while query runs...
# Retrieve results when ready
result <- value(query_future)
This pattern is valuable for Shiny apps where you don’t want the UI to freeze during complex queries. Combine with reactlog to visualize reactive dependencies.
Error Handling Best Practices
Always handle database errors gracefully:
safe_query <- function(con, query, ...) {
tryCatch(
dbGetQuery(con, query, ...),
error = function(e) {
warning("Query failed: ", e$message)
NULL
}
)
}
# Check connection health before queries
is_connected <- function(con) {
tryCatch(
dbExecute(con, "SELECT 1") > 0,
error = function(e) FALSE
)
}
See Also
- Database Queries with dbplyr — Basic database operations with dbplyr
- Using dtplyr: dplyr Syntax on data.table — Fast data manipulation
- Arrow and Parquet Files in R — Modern data file patterns