rguides

Pipeline Automation with targets

A 30-second pipeline with five targets is easy to babysit by hand. The same pipeline at 500 targets, scheduled for 2am, with three collaborators pushing to the same branch, is where pipeline automation in R pays for itself. The targets package gives you the dependency graph. This guide covers the rest of pipeline automation: controllers for parallel and cluster execution, static and dynamic branching, error strategies for batch jobs, watch mode for long runs, renv for reproducibility, and CI for unattended rebuilds. It picks up where the general targets primer leaves off and focuses on the operational side. For the canonical function reference, see the targets manual on rOpenSci and the tarchetypes reference.

Project layout and the _targets.R contract

targets reads one file by default: _targets.R. It returns a list of tar_target() calls. Put tar_option_set() at the top of that file, before any tar_target(), so option overrides apply to every target that follows.

# _targets.R
library(targets)

tar_option_set(
  packages = c("readr", "dplyr", "ggplot2"),
  format = "qs"
)

tar_source()  # sources every .R file in R/

list(
  tar_target(file, "data.csv", format = "file"),
  tar_target(data, read_csv(file, col_types = cols())),
  tar_target(model, lm(Ozone ~ Temp, data)),
  tar_target(plot, ggplot(data) + geom_point(aes(Temp, Ozone)))
)

A few constraints are easy to miss. tar_make() runs _targets.R in a fresh R process via callr, so .Rprofile globals and your interactive environment are invisible. Anything a target command needs must be loaded inside _targets.R itself, typically through tar_source(), explicit source() calls, or tar_option_set(packages = ...). Function dependencies are tracked by source hash, so editing a helper in R/ reruns every target that calls it. Pin package versions with renv (covered below) and treat the _targets/ directory as a build artifact: do not commit it, but do back it up if reruns are expensive. The fresh-process model is a feature, not a bug. It is also why debugging “works on my machine” requires running tar_make() from a clean session before you trust the result.

Choosing a controller

For a single machine, crew_controller_local(workers = N) is the right default. tar_make() is wired to crew by default (use_crew = TRUE), so once a controller is set in tar_option_set(), parallel scheduling happens automatically.

library(crew)
tar_option_set(controller = crew_controller_local(workers = 4))

For HPC, the crew.cluster package provides crew_controller_slurm(), crew_controller_sge(), crew_controller_lsf(), crew_controller_pbs(), and crew_controller_htcondor(). crew_controller_group() mixes controllers, and per-target routing uses tar_resources(crew = tar_resources_crew(controller = "name")). Cluster variant names vary by crew.cluster version; pin to crew >= 0.3.0 and targets >= 1.2.0 to keep the API stable. For autoscaling, tune seconds_idle, tasks_max, and seconds_wall on the controller; the defaults are conservative for shared clusters where idle workers hold resources that other jobs need. For cloud bursting, crew.aws provides an crew_controller_aws_batch() controller that scales EC2 workers on demand.

Static and dynamic branching

tar_map() is the static version. You give it a tibble of parameter combinations and it emits one target per row, with a stable suffix derived from the column values.

library(tarchetypes)

tar_map(
  values = tibble::tibble(
    method  = c("lm", "rf", "gbm"),
    n_trees = c(NA, 500, 200)
  ),
  tar_target(model, fit(method = method, n_trees = n_trees, data))
)

Dynamic branching is the pattern = argument on a single target. The pattern is a small language for slicing and combining upstream branches:

  • pattern = map(x) runs one branch per element of x
  • pattern = map(x, y) zips x and y into tuples
  • pattern = cross(x, y) produces the Cartesian product
  • pattern = slice(x, index = c(3, 4)) picks branches by index
  • pattern = sample(x, n) draws a random subset of n branches

The iteration argument controls how branches are aggregated. "vector" (default) uses vctrs::vec_c(), "list" uses list(), and "group" requires a tar_group column in the upstream data frame. Use "list" for objects with attributes (models, plots) where vec_c() would silently drop structure. Use tar_pattern() to preview the branch layout without running anything; it returns a tibble of the suffixes. For a tour of the map family that this mirrors, see purrr functional programming.

Errors, retries, and watch mode

tar_target() has an error argument with five values. "stop" (default) halts the run. "continue" keeps dependent targets running on the stale upstream value. "null" returns NULL from the failed target and feeds NULL downstream on reload. "abridge" lets in-flight targets finish but skips queued ones. "trim" is the right pick for nightly batch jobs: healthy branches continue past a sibling error and you get partial results.

tar_option_set(error = "trim")

For a long pipeline, run tar_watch(seconds = 5) in a separate R session. It opens a Shiny app that live-updates the dependency graph and progress bars while tar_make() runs in another process. For an audit before a run, tar_manifest() returns a tibble of every target and tar_visnetwork() renders the graph in HTML. To force a rerun without touching source, use tar_invalidate(any_of(c("model", "plot"))). To rerun only one target and its descendants, call tar_make(names = "plot"). Combine tar_cue(mode = "never") with error = "null" to lock in a cached result for an expensive object you do not want to recompute by accident.

Reproducible automation with renv and tarchetypes

tar_renv() snapshots the package versions used by the pipeline, which pairs naturally with renv::restore() in CI. To render reports as targets, use tar_quarto() or tar_render() from tarchetypes:

list(
  tar_target(raw, read_csv("data.csv", show_col_types = FALSE)),
  tar_quarto(report, path = "report.qmd", execute_params = list(data = raw))
)

tar_quarto() scans the .qmd for tar_read() and tar_load() calls and wires the upstream targets as dependencies. One gotcha: load the packages the report needs at the top of the .qmd itself, not only in _targets.R. The tarchetypes discussion #124 documents the propagation gap. For parameterised reports with one branch per row of a parameter grid, use tar_quarto_rep() instead of tar_quarto(). For non-Quarto reports, tar_render() works the same way for .Rmd files.

Running the pipeline from CI

In GitHub Actions, the typical job is a one-liner: Rscript -e 'targets::tar_make()', then upload the rendered report and the _targets/ store. A minimal workflow looks like this:

# .github/workflows/pipeline.yml
name: pipeline
on:
  push:
    branches: [main]
  schedule:
    - cron: "0 2 * * *"   # 2am UTC nightly

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: r-lib/actions/setup-r@v2
      - uses: r-lib/actions/setup-renv@v2
      - run: Rscript -e 'targets::tar_make()'
      - uses: actions/upload-artifact@v4
        with:
          name: report
          path: report.html

Two commands make CI runs clean. tar_destroy(ask = FALSE) starts from scratch on a rebuild job, and tar_prune() removes _targets/objects/ files for targets that have been deleted from _targets.R. Set ask = FALSE deliberately, since the default prompts interactively and will hang in a non-interactive runner. Cache ~/.cache/R/renv and the _targets/ store between runs to avoid re-installing packages and re-running already-up-to-date targets. The GitHub Actions for R guide covers the broader setup.

Common gotchas

A few traps show up repeatedly in production pipelines:

  • tar_option_set() must precede tar_target() calls in _targets.R. Targets capture option defaults at the moment they are created.
  • format = "file" requires a character vector pointing to existing file paths. With cloud repository, only length-1 vectors with no directories are allowed.
  • format = "auto" is convenient but creates around 20,000 internal data copies for pipelines with 10,000+ targets. Pick "qs" or "feather" for big projects.
  • tar_cue(mode = "never") combined with error = "null" is the canonical pattern for an expensive object you want to compute once and keep cached.
  • The priority argument on tar_target() was deprecated in 2025. Passing it is silently ignored. Use cue or rearrange the dependency graph.
  • Removing a target from _targets.R does not delete its cached file. Run tar_prune() to clean up.
  • tar_load_everything() loads every cached target into the global environment. Useful for ad-hoc exploration, but avoid it in production scripts.
  • Branching over a length()-1 vector still creates a one-branch dynamic in the graph, visible in tar_visnetwork() as a square node. If you want a static target, skip pattern.

Where pipeline automation goes next

Once a pipeline runs unattended, the next questions are usually about scale and cost. Start by measuring: tar_meta(fields = c("name", "seconds", "bytes")) returns per-target runtime and storage. If a few targets dominate, dynamic branching is the cheapest fix; splitting one long target into many short ones parallelises naturally on crew. If storage dominates, move large intermediate data to cloud repository = "aws" with tar_resources_aws(). If reruns are dominated by package install time, cache renv and skip renv::restore() when renv.lock is unchanged. The shape of the answer depends on the bottleneck, but the same loop applies: measure, change one thing, run, measure again. That loop is the heart of pipeline automation with targets, and it scales from a five-target homework script to a 50,000-target production pipeline.

See also