rguides

Reproducible Pipelines with targets

The targets package brings Make-like pipeline functionality to R. If you’ve ever rerun an analysis and gotten different results because something upstream changed, targets is the solution you’ve been looking for. It tracks dependencies between steps, runs only what changed, and keeps your results reproducible across sessions.

Why reproducible pipelines matter

Data science workflows are fragile. You load raw data, clean it, transform it, model it, and report it. Each step depends on the previous one. When you change the cleaning code, you need to rerun everything downstream. Most of us don’t do this manually—we rerun everything or, worse, we forget to rerun something and get wrong results.

targets solves this by building a dependency graph of your pipeline. Each target is a step that produces an R object. When you run the pipeline, targets checks what changed and runs only the affected targets. The output of each target is cached on disk, so you don’t recalculate things that haven’t changed.

This matters for several reasons. First, you save time—large analyses that took hours now take minutes because only changed parts rerun. Second, you get reproducibility—you have tangible evidence that your outputs match your code and data. Third, you reduce cognitive load—you stop worrying about what needs to rerun and let targets handle it.

Your first targets pipeline

Install targets from CRAN. The package is self-contained with no system dependencies beyond a working R installation, so it works on Linux, macOS, and Windows without extra setup:

install.packages("targets")

Create a new targets project. The tar_init() function scaffolds your working directory with the files targets needs — it writes a _targets.R file if one does not already exist and creates the _targets/ directory for caching pipeline outputs:

tar_init()

This creates a _targets.R file in your working directory. That’s where you define your pipeline. Here’s a simple example:

# _targets.R
library(targets)

# Define the pipeline
list(
  tar_target(raw_data, read.csv("data/survey.csv")),
  tar_target(cleaned_data, raw_data |>
    na.omit() |>
    subset(age > 18)),
  tar_target(summary_stats, summarise(cleaned_data,
    mean_income = mean(income),
    n = n()))
)

Each tar_target() call creates a step in your pipeline. The first argument is the target name (how you will refer to it later). The second argument is the R code that produces the output. Targets automatically detects dependencies between steps — when cleaned_data references raw_data, targets knows that changing raw_data requires recomputing cleaned_data and everything downstream of it.

Run the pipeline with tar_make(). This is the command you will use most often: it reads your _targets.R file, builds the dependency graph, checks which targets are out of date, and re-executes only those:

tar_make()

targets saves each target’s output to the _targets/objects/ directory, using a content-addressed storage system so identical outputs share disk space. The next time you run tar_make(), it checks whether each target’s dependencies changed — by comparing hashes of the target’s code, its upstream dependencies, and any files it reads. Only outdated targets rerun; everything else loads from the cache in milliseconds. This is the mechanism that turns hour-long analyses into minute-long iterations.

Read a specific target’s cached value with tar_read(), which pulls the object from disk without executing any part of the pipeline:

tar_read(summary_stats)

This loads the cached output without rerunning anything.

The pipeline graph

Understanding the pipeline graph is central to using targets effectively. Each target is a node in a directed acyclic graph (DAG). Edges represent dependencies—target B depends on target A if B’s code references A.

When you define tar_target(cleaned_data, raw_data |> ...), targets automatically detects that cleaned_data depends on raw_data. You don’t need to manually specify dependencies.

Visualize your pipeline with tar_visnetwork():

tar_visnetwork()

This opens an interactive graph showing all targets and their connections. It’s invaluable for understanding complex pipelines and debugging dependency issues.

The graph has several important properties. It’s always acyclic—you can’t have A depend on B while B depends on A. This prevents infinite loops. The order of definition in your _targets.R doesn’t matter; targets resolves dependencies automatically.

Branching: running targets in parallel

Branching lets you create multiple targets from a single definition. It’s useful when you need to process data in parallel chunks or run the same operation on different subsets.

Static branching creates targets before the pipeline runs:

# _targets.R
list(
  tar_target(files, list.files("data/", pattern = "\\.csv$")),
  tar_target(data_processed, read.csv(file.path("data", files)), 
    pattern = map(files))
)

The pattern = map(files) tells targets to create one target per element in files. If you have three CSV files, you’ll get three separate targets: data_processed_1, data_processed_2, and data_processed_3.

Dynamic branching creates targets while the pipeline runs, making it the right choice when you cannot know the number of branches ahead of time. The code in the target itself determines how many branches to create — for example, when the number of input files changes between pipeline runs, dynamic branching automatically creates the correct number of downstream targets without modifying the _targets.R definition:

tar_target(data_combined, {
  files <- list.files("data/", pattern = "\\.csv$")
  map(files, ~read.csv(file.path("data", .x)))
}, pattern = map(files))

Branching targets run in parallel when possible. Set the number of workers with tar_option_set(workers = 4) or use future-based parallelism for cluster computing.

Cloud storage

By default, targets stores data in the _targets/ folder in your project. For large projects or team workflows, you might want cloud storage.

AWS S3 support requires the aws.s3 package:

tar_target(s3_data, {
  library(aws.s3)
  s3read_object("my-bucket/data.rds")
}, storage = "secondary")

Google Cloud Storage works similarly with the googleCloudStorageR package:

tar_target(gcs_data, {
  library(googleCloudStorageR)
  gcs_get_object("data.rds")
}, storage = "secondary")

The storage = "secondary" argument tells targets to keep the object in cloud storage rather than caching it locally. Use storage = "none" to skip local caching entirely.

Cloud storage is particularly useful when you have large datasets that are slow to download, want to share data across machines, or need to ensure reproducibility independent of local disk.

Advanced patterns: tar_group and iteration

tar_group() creates grouped targets from a data frame, enabling downstream targets to process each group independently in parallel. tar_target(split_data, df |> tar_group(), iteration = "group") splits df into groups; tar_target(model, fit_model(split_data), pattern = map(split_data)) fits a model per group.

The pattern argument controls how multiple targets are created: map() for sequential iteration, cross() for cross-products, and head(n) for limiting to the first n iterations. Complex patterns compose: pattern = map(data, cross(params_a, params_b)) generates targets for each data source crossed with each parameter combination.

Integration with other tools

targets integrates with renv (via tar_option_set(packages = ...) for declaring dependencies) and with crew for distributed computing on HPC clusters. The tarchetypes package provides helper functions for common patterns: tar_quarto() renders a Quarto document as a target that depends on its data inputs, invalidating and re-rendering when the inputs change.

Skipping unchanged work

targets identifies which targets are outdated by hashing their inputs, function source code, data files, and upstream target values. When you run tar_make(), only outdated targets re-execute. If you change a helper function called by a target, targets automatically marks that target and all its downstream dependencies as outdated, without you specifying the dependency graph manually.

Inspecting the pipeline

tar_visnetwork() renders an interactive dependency graph showing all targets and their status (outdated, current, errored). tar_manifest() returns a data frame of all targets with their function names and dependencies, useful for auditing a large pipeline. tar_read("target_name") loads any target’s value from the cache without re-running the pipeline.

Branching for iteration

Dynamic branching with tar_map() or pattern = map(some_target) creates one target per element of an input without listing each combination explicitly. This is the targets equivalent of lapply(), it scales to thousands of sub-targets without any changes to the pipeline definition. Combine static branching (in _targets.R) for well-defined variations with dynamic branching for data-driven iteration.

Pipeline orchestration vs. simple scripts

A simple R analysis script runs from top to bottom every time. If you change the data cleaning step, everything downstream re-runs. If the step that reads the raw data takes ten minutes, you wait ten minutes even if you only changed the visualization code. For small analyses this is fine. For complex analyses with many steps and slow computations, this becomes frustrating and error-prone.

targets is a pipeline orchestration framework that tracks what has changed and only re-runs the steps that depend on those changes. When you modify the visualization code, targets knows that only the visualization step and its dependents need to re-run. The data reading and cleaning steps, which have not changed and whose inputs have not changed, are skipped. This selective re-execution makes iteration fast.

Defining a pipeline

A targets pipeline is a list of target objects defined in a _targets.R file. Each target specifies a name, a command (an R expression or function call), and its dependencies. Dependencies are other targets, if a target reads the output of another, it depends on it. targets automatically infers the dependency graph and determines the correct execution order.

When the pipeline runs with tar_make(), targets checks the state of each target against its stored hash. Targets whose command or dependencies have changed are re-run. Targets whose inputs are unchanged return their cached results without re-executing. The cache is stored in a _targets directory alongside the pipeline definition. Committing this directory to version control preserves pipeline state across checkouts.

Functions over scripts

The idiomatic targets workflow writes analysis code as functions rather than as script-style top-level code. Each function takes inputs and returns an output. The pipeline calls these functions with specific targets as arguments. This function-based approach makes each step testable in isolation, you can call the function directly with test data without running the full pipeline. It also clarifies the dependencies between steps: what each step needs and what it produces is explicit in the function signature.

Error diagnosis is easier in a function-based pipeline because each function has a name. When a step fails, the error message identifies which function failed and with what arguments. In a script-based approach, failure messages are line numbers in a long script, requiring more context to interpret.

See also