Reproducible Pipelines with targets

· 4 min read · Updated March 11, 2026 · intermediate
reproducibility pipeline targets workflow make

The targets package brings Make-like pipeline functionality to R. If you’ve ever rerun an analysis and gotten different results because something upstream changed, targets is the solution you’ve been looking for. It tracks dependencies between steps, runs only what changed, and keeps your results reproducible across sessions.

Why Reproducible Pipelines Matter

Data science workflows are fragile. You load raw data, clean it, transform it, model it, and report it. Each step depends on the previous one. When you change the cleaning code, you need to rerun everything downstream. Most of us don’t do this manually—we rerun everything or, worse, we forget to rerun something and get wrong results.

targets solves this by building a dependency graph of your pipeline. Each target is a step that produces an R object. When you run the pipeline, targets checks what changed and runs only the affected targets. The output of each target is cached on disk, so you don’t recalculate things that haven’t changed.

This matters for several reasons. First, you save time—large analyses that took hours now take minutes because only changed parts rerun. Second, you get reproducibility—you have tangible evidence that your outputs match your code and data. Third, you reduce cognitive load—you stop worrying about what needs to rerun and let targets handle it.

Your First targets Pipeline

Install targets from CRAN:

install.packages("targets")

Create a new targets project:

tar_init()

This creates a _targets.R file in your working directory. That’s where you define your pipeline. Here’s a simple example:

# _targets.R
library(targets)

# Define the pipeline
list(
  tar_target(raw_data, read.csv("data/survey.csv")),
  tar_target(cleaned_data, raw_data |>
    na.omit() |>
    subset(age > 18)),
  tar_target(summary_stats, summarise(cleaned_data,
    mean_income = mean(income),
    n = n()))
)

Each tar_target() call creates a step in your pipeline. The first argument is the target name (how you’ll refer to it later). The second argument is the R code that produces the output.

Run the pipeline with tar_make():

tar_make()

targets saves each target’s output to the _targets/objects/ directory. The next time you run tar_make(), it checks whether each target’s dependencies changed. Only outdated targets rerun.

Read a specific target’s value with tar_read():

tar_read(summary_stats)

This loads the cached output without rerunning anything.

The Pipeline Graph

Understanding the pipeline graph is central to using targets effectively. Each target is a node in a directed acyclic graph (DAG). Edges represent dependencies—target B depends on target A if B’s code references A.

When you define tar_target(cleaned_data, raw_data |> ...), targets automatically detects that cleaned_data depends on raw_data. You don’t need to manually specify dependencies.

Visualize your pipeline with tar_visnetwork():

tar_visnetwork()

This opens an interactive graph showing all targets and their connections. It’s invaluable for understanding complex pipelines and debugging dependency issues.

The graph has several important properties. It’s always acyclic—you can’t have A depend on B while B depends on A. This prevents infinite loops. The order of definition in your _targets.R doesn’t matter; targets resolves dependencies automatically.

Branching: Running Targets in Parallel

Branching lets you create multiple targets from a single definition. It’s useful when you need to process data in parallel chunks or run the same operation on different subsets.

Static branching creates targets before the pipeline runs:

# _targets.R
list(
  tar_target(files, list.files("data/", pattern = "\\.csv$")),
  tar_target(data_processed, read.csv(file.path("data", files)), 
    pattern = map(files))
)

The pattern = map(files) tells targets to create one target per element in files. If you have three CSV files, you’ll get three separate targets: data_processed_1, data_processed_2, and data_processed_3.

Dynamic branching creates targets while the pipeline runs. Use it when you don’t know the number of branches ahead of time:

tar_target(data_combined, {
  files <- list.files("data/", pattern = "\\.csv$")
  map(files, ~read.csv(file.path("data", .x)))
}, pattern = map(files))

Branching targets run in parallel when possible. Set the number of workers with tar_option_set(workers = 4) or use future-based parallelism for cluster computing.

Cloud Storage

By default, targets stores data in the _targets/ folder in your project. For large projects or team workflows, you might want cloud storage.

AWS S3 support requires the aws.s3 package:

tar_target(s3_data, {
  library(aws.s3)
  s3read_object("my-bucket/data.rds")
}, storage = "secondary")

Google Cloud Storage works similarly with the googleCloudStorageR package:

tar_target(gcs_data, {
  library(googleCloudStorageR)
  gcs_get_object("data.rds")
}, storage = "secondary")

The storage = "secondary" argument tells targets to keep the object in cloud storage rather than caching it locally. Use storage = "none" to skip local caching entirely.

Cloud storage is particularly useful when you have large datasets that are slow to download, want to share data across machines, or need to ensure reproducibility independent of local disk.

See Also