rguides

R and Databases: Advanced Patterns

This tutorial covers advanced patterns for working with databases in R. You’ll learn connection pooling, parameterized queries, transaction management, and how to work across multiple database backends.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with R and Databases. By the end, you will know how to apply the core functions in real data analysis workflows.

Connection pooling

Connection pooling prevents the overhead of establishing a new database connection for each query. The pool package manages a reusable set of connections and handles connection health checks automatically.

library(DBI)
library(pool)

# Create a connection pool with specific settings
pool <- dbPool(
  drv = RMySQL::MySQL(),
  host = "localhost",
  dbname = "analytics",
  user = "app_user",
  password = Sys.getenv("DB_PASSWORD"),
  minConnections = 2,
  maxConnections = 10,
  idleTimeout = 300  # seconds before idle connection closes
)

# Queries use pooled connections automatically
result <- dbGetQuery(pool, "SELECT * FROM users LIMIT 100")

# Always close the pool when done
poolClose(pool)

The pool maintains a set of active connections and reuses them. This is essential for Shiny apps and long-running scripts where establishing connections becomes a bottleneck. The pool also handles connection timeouts and automatically recreates failed connections.

Parameterized queries

Never interpolate user input directly into SQL. Use parameterized queries to prevent SQL injection attacks and improve query performance through query plan caching:

# Unsafe - DON'T DO THIS THIS IS A SECURITY RISK
query <- paste0("SELECT * FROM users WHERE id = ", user_id)

# Safe - use parameterized queries
query <- "SELECT * FROM users WHERE id = ?"
result <- dbGetQuery(pool, query, params = list(user_id))

For dbplyr, use the local() function to safely embed R vectors in generated SQL:

library(dplyr)

user_ids <- c(1, 2, 3, 4, 5)
users <- tbl(pool, "users") %>%
  filter(id %in% local(user_ids)) %>%
  collect()

The local() function evaluates the R vector locally and sends the values safely to the database as a temporary inline table. This approach works well for small to medium-sized vectors.

Transaction management

Wrap multiple operations in a transaction to ensure atomicity. Either all operations succeed together, or none do:

# Start a transaction
dbBegin(pool)

tryCatch({
  # First operation: update order status
  dbExecute(pool, "UPDATE orders SET status = 'shipped' WHERE id = ?",
            params = list(order_id))
  
  # Second operation: record shipping date
  dbExecute(pool, "UPDATE orders SET shipped_at = NOW() WHERE id = ?",
            params = list(order_id))
  
  # Third operation: create shipment record
  dbExecute(pool, "INSERT INTO shipments (order_id, created_at) VALUES (?, NOW())",
            params = list(order_id))
  
  # Commit if all succeed
  dbCommit(pool)
  message("Order ", order_id, " shipped successfully")
}, error = function(e) {
  # Rollback on any error
  dbRollback(pool)
  stop("Failed to ship order: ", e$message)
})

Transactions are critical when updating multiple related tables. They ensure data consistency even if a query fails partway through.

Working with multiple databases

Use a configuration approach to manage connections to different database backends:

# Define database configurations
configs <- list(
  production = list(
    drv = RPostgres::Postgres(),
    host = "prod-db.example.com",
    dbname = "main",
    port = 5432
  ),
  analytics = list(
    drv = RMySQL::MySQL(),
    host = "analytics-db.example.com",
    dbname = "warehouse",
    port = 3306
  ),
  warehouse = list(
    drv = duckdb::duckdb(),
    dbdir = "/data/warehouse.duckdb"
  )
)

# Helper to create connections
connect_db <- function(env = c("production", "analytics", "warehouse")) {
  env <- match.arg(env)
  cfg <- configs[[env]]
  do.call(dbConnect, cfg)
}

# Query different databases in the same pipeline
prod_con <- connect_db("production")
analytics_con <- connect_db("analytics")

# Pull from production, join with analytics data
users <- tbl(prod_con, "users") %>%
  inner_join(tbl(analytics_con, "activity"), by = "id") %>%
  filter(date >= "2025-01-01") %>%
  group_by(id) %>%
  summarise(total_activity = n()) %>%
  collect()

This pattern enables polyglot data access—using the right database for each task while maintaining a consistent R interface.

Async queries with dbplyr

For long-running queries, use the future package with dbplyr for non-blocking execution:

library(future)
library(promises)

plan(multisession)

# Execute query in background
query_future <- future({
  tbl(pool, "large_table") %>%
    filter(date > "2025-01-01") %>%
    group_by(category) %>%
    summarise(count = n(), .groups = "drop") %>%
    collect()
})

# Do other work here while query runs...

# Retrieve results when ready
result <- value(query_future)

This pattern is valuable for Shiny apps where you don’t want the UI to freeze during complex queries. Combine with reactlog to visualize reactive dependencies.

Error handling best practices

Always handle database errors gracefully:

safe_query <- function(con, query, ...) {
  tryCatch(
    dbGetQuery(con, query, ...),
    error = function(e) {
      warning("Query failed: ", e$message)
      NULL
    }
  )
}

# Check connection health before queries
is_connected <- function(con) {
  tryCatch(
    dbExecute(con, "SELECT 1") > 0,
    error = function(e) FALSE
  )
}

Connection pooling with pool

For Shiny apps and APIs with multiple concurrent users, the pool package manages database connections efficiently. dbPool() creates a pool that checks out connections from a shared set rather than opening a new connection per request. This is essential for Shiny, without a pool, each user session would open and close connections, potentially exhausting the database’s connection limit.

Advanced query patterns

SQL window functions become powerful in R workflows: RANK() OVER (PARTITION BY dept ORDER BY salary DESC) computes within-department salary ranks. In dbplyr: group_by(dept) |> mutate(rank = row_number(desc(salary))). Recursive CTEs handle hierarchical data (org charts, bill of materials). DuckDB and modern PostgreSQL support both efficiently.

Database migrations

Schema changes in production databases require care. The dm package models database schemas in R and generates migration SQL. For simple schemas, dbCreateTable(), dbAddColumn(), and manual SQL via dbExecute() handle migrations. Always back up before migrating. Apply migrations transactionally: run in a transaction so a partial failure does not leave the schema in an inconsistent state.

PostgreSQL’s full-text search (tsvector, tsquery) is accessible from R via raw SQL in dbGetQuery(). CREATE INDEX ... USING gin(to_tsvector(...)) creates an index; WHERE to_tsvector(content) @@ to_tsquery('search term') queries it. For SQLite, FTS5 provides similar functionality. For complex search needs, external search engines (Elasticsearch, OpenSearch) accessed via httr2 provide better relevance ranking and faceting.

Transactions

Transactions group multiple SQL operations atomically, either all succeed or all are rolled back. DBI::dbBegin(con) starts a transaction. DBI::dbCommit(con) commits. DBI::dbRollback(con) rolls back. Use tryCatch() to roll back on error:

DBI::dbBegin(con)
tryCatch({
  dbExecute(con, "UPDATE accounts SET balance = balance - 100 WHERE id = 1")
  dbExecute(con, "UPDATE accounts SET balance = balance + 100 WHERE id = 2")
  DBI::dbCommit(con)
}, error = function(e) {
  DBI::dbRollback(con)
  stop(e)
})

Without transactions, a failure partway through a multi-step operation leaves the database in an inconsistent state.

Schema management

For production databases, manage schema changes with migration files rather than running DDL statements manually. The dbmate CLI tool (language-agnostic) or DBImigrateR manage numbered migration files. Each migration has an up and down SQL block; running migrations applies unapplied ones in order.

In development, DBI::dbCreateTable(con, "table_name", df_prototype) creates a table matching the column types of a data frame. DBI::dbWriteTable(con, "table", df, append = TRUE) inserts rows into an existing table; overwrite = TRUE drops and recreates.

DBI::dbListTables(con) lists all tables. DBI::dbListFields(con, "table") lists column names. DBI::dbExistsTable(con, "table") checks if a table exists. These introspection functions let code adapt to the current schema.

Query performance

EXPLAIN ANALYZE <query> (PostgreSQL syntax) returns the query execution plan with actual timing. Run this via dbGetQuery(con, "EXPLAIN ANALYZE SELECT ...") to identify slow operations. Common bottlenecks: full table scans on large tables (add an index), inefficient joins (verify join columns are indexed), and returning more data than needed (add WHERE filters and SELECT specific columns).

Indexes in R: dbExecute(con, "CREATE INDEX idx_user_id ON orders(user_id)"). Index maintenance has overhead, each insert/update/delete updates all indexes on that table. Create indexes only on columns you filter or join on frequently.

For analytics queries (aggregations, wide scans), DuckDB is faster than PostgreSQL because it uses columnar storage. Switch the connection driver to duckdb::duckdb() and point it at a DuckDB file or an in-memory database for analytical workloads.

Beyond basic querying

Basic database interaction in R covers connecting, reading tables, and writing simple queries. Advanced usage covers the patterns that matter for production applications: connection pooling for high-concurrency scenarios, transactions for operations that must succeed or fail atomically, schema management for evolving database structures, and performance tuning for queries that are unacceptably slow.

These techniques matter most when R is not just reading data for analysis but participating in a data pipeline or serving data for an application. Shiny apps that handle many concurrent users need connection pooling. ETL processes that update multiple tables atomically need transactions. Applications that run in production for months need schema migration tools.

Transactions for data integrity

A transaction groups multiple database operations into a unit that either completes entirely or rolls back entirely. Without transactions, a failure halfway through a multi-step update leaves the database in an inconsistent intermediate state. With a transaction, the database guarantees that either all changes are committed or none are, preserving consistency.

In DBI, transactions are managed with dbBegin, dbCommit, and dbRollback. Wrapping a series of dbExecute calls with these transaction controls ensures atomicity. The idiomatic pattern uses tryCatch to catch errors and call dbRollback in the error handler. When the wrapped operations succeed, dbCommit makes the changes permanent. When they fail, dbRollback undoes any partial changes.

Parameterized queries at scale

For batch inserts of many rows, executing one parameterized query per row is much slower than using DBI’s batch insert capabilities. The dbAppendTable function inserts an entire data frame in a single operation that the database driver translates into an efficient bulk insert. For updates that modify specific rows based on a key, parameterized queries with vectorized parameter binding execute the query once per parameter set but can be much faster than individual calls when the database driver batches them.

Connection-level settings affect query performance. Setting the application_name connection parameter helps identify R sessions in database server logs. Adjusting the statement_timeout prevents long-running queries from blocking others. For read-heavy analytical queries, setting the transaction isolation level to READ COMMITTED or using read replicas reduces lock contention with transactional workloads.

Next steps

Now that you understand r and databases, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.

See also