rguides

Sharing Data Artifacts with pins

Sharing data across R projects and between team members should be straightforward. The pins package makes data artifacts discoverable, versioned, and reproducible by letting you “pin” objects to a board: a storage location that other sessions and colleagues can read from without managing file paths or copying files manually. Whether you are sharing a reference dataset, a trained model, or an ETL pipeline output, pins provides a consistent interface that works across local folders, cloud storage, and Posit Connect.

Boards

Every pin lives in a pin board. Create a board based on how you want to share data:

library(pins)

# Local storage (same computer, different R sessions)
board <- board_local()

# Shared folder (Dropbox, network drive)
board <- board_folder("~/Dropbox")

# Posit Connect
board <- board_connect()

# Amazon S3
board <- board_s3("my-bucket")

# Azure Blob Storage
board <- board_azure(container = "my-container")

For testing, board_temp() creates a temporary board that disappears when your R session ends. This is particularly useful for unit tests and exploratory work where you want to experiment with pins without leaving artifacts behind on a shared board.

File types

Pins automatically detects how to save your data, but you can specify different formats:

TypeFunctionUse case
RDSwriteRDS()Any R object, R-only
CSVwrite.csv()Plain text, language-independent
Parquetnanoparquet::write_parquet()Large tabular data, efficient
JSONjsonlite::write_json()Nested structures
QSqs2::qs_save()Fast binary, large objects
board |> pin_write(df, "my-data", type = "parquet")

Avoid pinning files over 500 MB; pins transmits over HTTP, which becomes slow and unreliable for very large files. For datasets that large, store the file in cloud storage and pin only a lightweight metadata reference instead.

Metadata

Every pin stores metadata automatically. Access it with pin_meta() to inspect the file size, hash, creation date, and object type without loading the full dataset into memory. This is useful when you want to check if a pin has been updated before deciding to read it:

board |> pin_meta("mtcars-dataset")

The metadata system supports custom fields you define, letting you attach ownership information, pipeline version numbers, or source descriptions directly to the stored data. Additional metadata travels with the pin and is available to anyone who reads it later on any machine:

board |> pin_write(df, "analysis", 
  metadata = list(owner = "data-team", version = "1.0"))

Tags provide lightweight organization without requiring a formal metadata schema. Use them to group pins by environment (production, staging), refresh frequency (daily, weekly), or data domain (sales, marketing). Tags are searchable through the board interface, making it easy to filter a large collection down to the specific subset you need:

board |> pin_write(df, "dataset", tags = c("production", "daily"))

When to use pins

Pins works well when a single process writes data that multiple processes read:

  • ETL pipelines storing daily model outputs
  • Shared reference datasets across projects
  • Caching remote data locally

A typical pipeline pattern: a scheduled cron job runs an R script that pulls data from a database, transforms it, and pins the result to a board. Multiple downstream reports, dashboards, and analyses then read from that pin without ever touching the source database. This isolates consumers from source system availability and avoids redundant queries competing for database resources.

Pins is not designed for concurrent writes. Don’t use it for Shiny apps where multiple users write simultaneously; the package can’t manage conflicts between concurrent writers. For read-heavy workloads where a single scheduled job produces data that many downstream consumers read, pins handles the workload reliably without extra infrastructure.

Finding pins

Use pin_search() to discover pins on a board:

board |> pin_search("model")

This searches pin names, titles, descriptions, and tags. When a team maintains dozens of pins across multiple boards, pin_search() becomes the primary way to discover what data is available without hunting through directories or documentation. Combined with metadata and tagging, search makes a shared board function as a lightweight data catalog.

Caching expensive computations

Pins also works well as a caching layer for intermediate results. Save the output of a slow query or model fit and skip recomputation on subsequent runs:

cached <- tryCatch(
  pin_read(board, "sales-summary"),
  error = function(e) NULL
)

if (is.null(cached) || cached$timestamp < Sys.time() - 3600) {
  sales <- compute_monthly_sales()
  pin_write(board, sales, "sales-summary", type = "parquet")
  cached <- sales
}

This pattern checks for a cached pin and only recomputes when it is missing or stale, avoiding expensive database queries or model refits on every script run. The approach is particularly effective in scheduled reports and dashboards where the underlying source data changes infrequently but the report renders on a fixed schedule regardless.

How pins compares to alternatives

Pins fills a gap between ad-hoc file sharing and full data infrastructure. Compared to saving RDS files to a shared network drive, pins adds versioning, metadata, and search without requiring a database or object store. Compared to checking data into git, pins handles large binary objects efficiently and avoids bloating repository history.

For teams already using Posit Connect, the board_connect() backend integrates with existing authentication and access control. For cloud-native teams, the S3, GCS, and Azure backends store pins alongside other cloud assets using the same IAM policies. For solo developers or small teams, board_folder() with a shared directory is enough to get started in minutes. The common pin_write() and pin_read() API means you can start with a local folder and migrate to cloud storage later without changing your analysis code, which reduces the risk of lock-in and makes it practical to evolve your data infrastructure as your team grows.

Versioning and reproducibility

Every call to pin_write() on an existing pin name creates a new version rather than overwriting. You can list all versions to see the update history, read a specific historical version by its timestamped version ID, and prune old versions to manage storage. This built-in versioning means any analysis that records which pin version it read can be reproduced exactly, even after the pin has been updated with newer data.

For audit-sensitive workflows, combine pin versioning with custom metadata that records the generating script, data source, and runtime parameters. A downstream report can display these metadata fields to document its data provenance. When something looks wrong in the output, tracing back through the pin version history identifies exactly which version of the input data produced the result, without needing to examine logs or guess which pipeline run populated the data. For regulatory environments where every data transformation must be traceable, pins versioning paired with metadata provides a lightweight audit trail without the overhead of a dedicated data catalog or lineage tool.

See also