R and Databases: Advanced Patterns

· 4 min read · Updated March 17, 2026 · advanced
r databases dbplyr dplyr data

This tutorial covers advanced patterns for working with databases in R. You’ll learn connection pooling, parameterized queries, transaction management, and how to work across multiple database backends.

Connection Pooling

Connection pooling prevents the overhead of establishing a new database connection for each query. The pool package manages a reusable set of connections and handles connection health checks automatically.

library(DBI)
library(pool)

# Create a connection pool with specific settings
pool <- dbPool(
  drv = RMySQL::MySQL(),
  host = "localhost",
  dbname = "analytics",
  user = "app_user",
  password = Sys.getenv("DB_PASSWORD"),
  minConnections = 2,
  maxConnections = 10,
  idleTimeout = 300  # seconds before idle connection closes
)

# Queries use pooled connections automatically
result <- dbGetQuery(pool, "SELECT * FROM users LIMIT 100")

# Always close the pool when done
poolClose(pool)

The pool maintains a set of active connections and reuses them. This is essential for Shiny apps and long-running scripts where establishing connections becomes a bottleneck. The pool also handles connection timeouts and automatically recreates failed connections.

Parameterized Queries

Never interpolate user input directly into SQL. Use parameterized queries to prevent SQL injection attacks and improve query performance through query plan caching:

# Unsafe - DON'T DO THIS THIS IS A SECURITY RISK
query <- paste0("SELECT * FROM users WHERE id = ", user_id)

# Safe - use parameterized queries
query <- "SELECT * FROM users WHERE id = ?"
result <- dbGetQuery(pool, query, params = list(user_id))

For dbplyr, use the local() function to safely embed R vectors in generated SQL:

library(dplyr)

user_ids <- c(1, 2, 3, 4, 5)
users <- tbl(pool, "users") %>%
  filter(id %in% local(user_ids)) %>%
  collect()

The local() function evaluates the R vector locally and sends the values safely to the database as a temporary inline table. This approach works well for small to medium-sized vectors.

Transaction Management

Wrap multiple operations in a transaction to ensure atomicity. Either all operations succeed together, or none do:

# Start a transaction
dbBegin(pool)

tryCatch({
  # First operation: update order status
  dbExecute(pool, "UPDATE orders SET status = 'shipped' WHERE id = ?",
            params = list(order_id))
  
  # Second operation: record shipping date
  dbExecute(pool, "UPDATE orders SET shipped_at = NOW() WHERE id = ?",
            params = list(order_id))
  
  # Third operation: create shipment record
  dbExecute(pool, "INSERT INTO shipments (order_id, created_at) VALUES (?, NOW())",
            params = list(order_id))
  
  # Commit if all succeed
  dbCommit(pool)
  message("Order ", order_id, " shipped successfully")
}, error = function(e) {
  # Rollback on any error
  dbRollback(pool)
  stop("Failed to ship order: ", e$message)
})

Transactions are critical when updating multiple related tables. They ensure data consistency even if a query fails partway through.

Working with Multiple Databases

Use a configuration approach to manage connections to different database backends:

# Define database configurations
configs <- list(
  production = list(
    drv = RPostgres::Postgres(),
    host = "prod-db.example.com",
    dbname = "main",
    port = 5432
  ),
  analytics = list(
    drv = RMySQL::MySQL(),
    host = "analytics-db.example.com",
    dbname = "warehouse",
    port = 3306
  ),
  warehouse = list(
    drv = duckdb::duckdb(),
    dbdir = "/data/warehouse.duckdb"
  )
)

# Helper to create connections
connect_db <- function(env = c("production", "analytics", "warehouse")) {
  env <- match.arg(env)
  cfg <- configs[[env]]
  do.call(dbConnect, cfg)
}

# Query different databases in the same pipeline
prod_con <- connect_db("production")
analytics_con <- connect_db("analytics")

# Pull from production, join with analytics data
users <- tbl(prod_con, "users") %>%
  inner_join(tbl(analytics_con, "activity"), by = "id") %>%
  filter(date >= "2025-01-01") %>%
  group_by(id) %>%
  summarise(total_activity = n()) %>%
  collect()

This pattern enables polyglot data access—using the right database for each task while maintaining a consistent R interface.

Async Queries with dbplyr

For long-running queries, use the future package with dbplyr for non-blocking execution:

library(future)
library(promises)

plan(multisession)

# Execute query in background
query_future <- future({
  tbl(pool, "large_table") %>%
    filter(date > "2025-01-01") %>%
    group_by(category) %>%
    summarise(count = n(), .groups = "drop") %>%
    collect()
})

# Do other work here while query runs...

# Retrieve results when ready
result <- value(query_future)

This pattern is valuable for Shiny apps where you don’t want the UI to freeze during complex queries. Combine with reactlog to visualize reactive dependencies.

Error Handling Best Practices

Always handle database errors gracefully:

safe_query <- function(con, query, ...) {
  tryCatch(
    dbGetQuery(con, query, ...),
    error = function(e) {
      warning("Query failed: ", e$message)
      NULL
    }
  )
}

# Check connection health before queries
is_connected <- function(con) {
  tryCatch(
    dbExecute(con, "SELECT 1") > 0,
    error = function(e) FALSE
  )
}

See Also