rguides

Arrow Big Data in R: Datasets, Partitioning, and S3

Why your laptop runs out of RAM before R does

A base R data.frame copies on every assignment, a dplyr pipeline keeps intermediate tibbles in memory, and readr::read_csv() slurps the whole file into RAM before you touch it. That works fine on a 50 MB CSV. It falls over on the 9 GB Seattle library checkouts file, on a year of NYC taxi trips, or on a 70 GB public Parquet dataset sitting in S3.

arrow is the answer when the data is too big to load but small enough that you can still describe the analysis you want. The package gives you a columnar in-memory type (arrow::Table), a multi-file abstraction that never materialises the underlying data (FileSystemDataset), and a dplyr backend that turns your familiar verbs into lazy query plans executed by the Arrow C++ library. You write filter(), select(), group_by(), summarise(). Arrow reads only the columns and row groups it needs, in parallel, and returns a tibble.

This guide is the larger-than-memory companion to the single-file Parquet guide. Where that one stops at read_parquet() and write_parquet(), this one starts: directory-shaped datasets, partitioning, pushdown filtering, and reading from S3 and GCS.

Datasets, not Tables: the abstraction that makes big data tractable

There are two arrow types you will see most often:

  • arrow::Table: an in-memory columnar table. Useful for data that fits in RAM (a few GB) but is annoying to materialise as a data.frame.
  • arrow::FileSystemDataset (and its subclasses like ParquetDataset, CsvDataset): a handle to a directory or set of files. It scans the directory, reads file footers, parses Hive-style partition keys, and infers a unified schema. No rows are read.

The key mental model: a FileSystemDataset is to read_csv() what a database connection is to dbReadTable(). You build a query, the engine decides how to execute it. If you accidentally call as_tibble(ds) or run a verb arrow does not support on a Dataset, it will either error out or fall back to collect()-ing the whole thing. That fallback is the most common way to OOM a session.

open_dataset() is the entry point. For everything below, assume library(arrow); library(dplyr).

Opening a multi-file dataset with open_dataset()

The signature you will reach for:

open_dataset(
  sources,
  schema = NULL,
  partitioning = NULL,
  hive_style = TRUE,
  unify_schemas = NULL,
  format = c("parquet", "csv", "tsv", "text", "json", "feather", "ipc")
)

sources is a directory, a single file, or a character vector of file paths. format defaults to "parquet"; on a CSV directory you either pass format = "csv" or use the convenience wrapper open_csv_dataset(). The Hive layout is auto-detected when hive_style = TRUE.

For a directory like:

nyc-taxi/year=2023/month=1/part-0.parquet
nyc-taxi/year=2023/month=2/part-0.parquet
nyc-taxi/year=2024/month=1/part-0.parquet

the call itself is one line. Pass the directory path to open_dataset() and arrow handles the rest from there: no file paths to enumerate, no schema to declare, no reader function to swap when the format changes. The directory layout and the Parquet file footers give you both for free.

ds <- open_dataset("nyc-taxi")
ds
# FileSystemDataset with 36 Parquet files
# vendor_name: string
# pickup_datetime: timestamp[ms]
# passenger_count: int64
# trip_distance: double
# fare_amount: double
# tip_amount: double
# year: int32
# month: int32

The Hive-style year=2023/month=1/ segments are parsed as real columns, not as opaque directory names. The schema comes from the Parquet file footers, which is fast and accurate (no inference from a 1000-row sample).

If your layout uses raw folders (for example, 2023/01/data.parquet), pass the partition columns explicitly:

ds2 <- open_dataset("taxi", partitioning = c("year", "month"))

The schema argument is the escape hatch when unify_schemas does not pick the type you want. It is also the right tool when the first 1000 rows of a sample do not reflect the full data: blank ISBNs, rare NAs, or other edge cases that flip the inferred type. Use it any time you would rather commit to a column type up front than gamble on inference rules.

And if the type for a column drifts across files (e.g. passenger_count is int32 in some and int64 in others), pin the schema:

ds <- open_dataset(
  "mixed-files",
  schema = schema(
    passenger_count = int64(),
    fare_amount = float64()
  )
)

Lazy dplyr: filter, select, group_by, summarise

Every dplyr verb on a Dataset is lazy. filter(), select(), mutate(), group_by(), summarise(), arrange(), count(), distinct(), and head() all record the operation and return another FileSystemDataset (query). Nothing reads any data until you call collect() (materialise to an R tibble) or compute(path) (materialise to a new on-disk dataset).

A typical analysis:

ds |>
  filter(year == 2024, fare_amount > 100) |>
  select(tip_amount, total_amount, passenger_count) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  group_by(passenger_count) |>
  summarise(
    median_tip_pct = median(tip_pct),
    n = n()
  ) |>
  collect()
# # A tibble: 10 × 3
#    passenger_count median_tip_pct      n
#             <int64>         <dbl>  <int>
#  1              1          16.6 143087
#  2              2          16.2  34418
#  3              5          16.7   5806

The same verbs work on an in-memory arrow::Table, but those are eager by default. A verb arrow does not implement on a Table will auto-collect() first. The “auto-collect” behaviour is convenient in interactive work and dangerous in production code: a “big data” pipeline that accidentally reads a Dataset into a Table will silently work on the test data and OOM on production data. Always reach for open_dataset() first.

The win on this lazy pipeline comes from three mechanics working together:

  1. Lazy evaluation. Each verb records, nothing executes. You can stack filter |> mutate |> group_by |> summarise without paying for any intermediate result.
  2. Predicate pushdown. filter(year == 2024) lets Arrow skip whole directories (the partition value is in the path) and skip row groups inside Parquet files using min/max statistics stored in the file footer. Files that do not match are never read.
  3. Projection pushdown. select(tip_amount, total_amount, passenger_count) reads only those three columns from each row group. The other nine columns stay on disk.

Partitioning choice matters. The columns you partition on should be the ones you filter on most often, and the resulting files should be hundreds of MB each: not 1 MB (file listing dominates) and not 10 GB (you lose parallelism).

From huge CSV to fast Parquet with one pipeline

A common entry point: you have a 9 GB CSV that will not fit in RAM, and you want to analyse it. arrow can read it lazily and write it back out as a partitioned Parquet dataset in one pipeline:

huge <- open_csv_dataset(
  sources = "data/seattle-library-checkouts.csv",
  col_types = schema(ISBN = string()),
  format = "csv"
)

huge |>
  mutate(year = year(CheckoutTime)) |>
  write_dataset(
    path = "data/seattle-parquet",
    partitioning = "year",
    format = "parquet"
  )

# Verify
list.files("data/seattle-parquet", recursive = TRUE)
# [1] "year=2005/part-0.parquet" "year=2006/part-0.parquet" ...

This is also the moment to learn write_dataset(). The signature mirrors open_dataset() in the things that matter (format, partitioning, hive_style), and adds a few performance knobs (max_rows_per_file, max_rows_per_group) that control how Arrow splits the output. Read these once and you will rarely need to touch them again.

The signature:

write_dataset(
  dataset,
  path,
  format = c("parquet", "csv", "tsv", "text", "feather", "ipc"),
  partitioning = NULL,
  hive_style = TRUE,
  ...,
  max_partitions = 1024L,
  max_open_files = 900L,
  max_rows_per_file = 0L,
  min_rows_per_group = 0L,
  max_rows_per_group = 1024L * 1024L
)

write_dataset() writes the dataset one piece at a time. It accepts a Table, a FileSystemDataset, a dplyr query, or a data.frame, and you do not need the full result in memory. The partitioning argument may be a string or a character vector of column names, and the resulting layout is key=value/part-N.parquet.

Parquet is columnar on disk, so the same columns read the same way they are stored. There is no R-level row-to-column reshape; data moves through Arrow buffers, not via R data.frames. The same query on the resulting Parquet dataset is usually 10–100× faster than on the original CSV.

compute() vs collect(): when to keep data on disk

Two ways to materialise a query:

  • collect() returns an R tibble. The whole result must fit in RAM.
  • compute("path") writes the result back to disk as a new Parquet dataset. The whole result only needs to fit on disk.

The decision rule: if the result is too big for RAM but small enough to keep on disk, use compute(). If the result is small enough to want in R, use collect(). If the result is too big for both, drop down to map_batches() (next section).

# Pipeline that filters to 2 B rows of summaries, too big for RAM,
# small enough to land on disk.
result_path <- ds |>
  filter(year >= 2020) |>
  group_by(year, month) |>
  summarise(total_trips = n(), avg_fare = mean(fare_amount)) |>
  compute("intermediate-parquet")

# Now `result_path` is a small, fast dataset you can collect() safely
result_path |> collect()

compute() returns a FileSystemDataset pointing at the new directory, so subsequent filter() / select() / collect() calls keep the same lazy semantics.

Cloud storage: reading from S3 and GCS

Pass an S3FileSystem (or GCSFileSystem) wherever a path is accepted:

bucket <- s3_bucket("arrow-datasets/nyc-taxi-tiny", anonymous = TRUE)

ds <- open_dataset(bucket)
ds |> count(year) |> collect()
# # A tibble: 2 × 2
#    year     n
#   <int> <int>
# 1  2023   111
# 2  2024   230

For GCS, the equivalent is gs_bucket("bucket-name", anonymous = TRUE). The Apache Arrow project hosts two versions of the NYC taxi data on arrow-datasets: nyc-taxi-tiny (~70 MB, useful for prototyping) and nyc-taxi (~70 GB, ~1.7 B rows).

Caching a remote dataset locally speeds up iteration, because the alternative (repeatedly listing thousands of small files on S3) can take minutes per open_dataset() call:

copy_files(bucket, "local-cache/nyc-taxi")
ds <- open_dataset("local-cache/nyc-taxi")

A note on the arrow build: S3 and GCS support depend on how the package was compiled. Run arrow::arrow_info() and check $capabilities$s3 and $capabilities$gcs before assuming cloud access works. The CRAN binary for Linux, macOS, and Windows ships with both; some Linux distributions build arrow without S3.

Streaming work over a too-big dataset with map_batches()

Sometimes even compute() is not an option, because the result you actually want is “one number per batch” rather than “the full filtered table.” map_batches() lets you write a per-batch callback that Arrow feeds row groups to:

ds |>
  filter(year == 2024) |>
  select(tip_amount, total_amount) |>
  map_batches(~ tibble(
    n     = nrow(.x),
    mean_tip_pct = mean(.x$tip_amount / .x$total_amount, na.rm = TRUE)
  )) |>
  collect()
# # A tibble: 1 × 2
#       n mean_tip_pct
#   <int>        <dbl>
# 1    47         0.157

Each batch is a single Arrow Table; the callback returns whatever it wants (a tibble, a model object, a one-row summary) and Arrow concatenates the per-batch results. This is the right tool for things like “fit one model per partition” or “compute a checksum per file.”

The catch: map_batches() is marked experimental in the arrow documentation. Do not ship it in production ETL without a fallback.

Common mistakes when going big in R

  • Forgetting format = "csv" (or open_csv_dataset()). open_dataset() defaults to "parquet"; on a CSV directory it fails or picks up no files. Use the explicit wrappers when in doubt.
  • Hive-style assumption. If your directories are 2023/01/data.parquet instead of year=2023/month=1/data.parquet, you must pass partitioning = c("year", "month") or arrow will treat the directory names as opaque.
  • Type drift across files. If passenger_count is int32 in some files and int64 in others, open with an explicit schema = schema(...) to force a single type.
  • Auto-collect() on in-memory Tables. A dplyr pipeline on an in-memory arrow Table auto-collect()s on verbs arrow does not implement. That is convenient in interactive sessions and a production hazard.
  • S3 listing on tiny files. If your partitioning is too fine-grained, open_dataset(s3://...) spends minutes listing. Coarsen the partition granularity or cache locally with copy_files().
  • JSON writing is not supported. open_dataset(..., format = "json") works; write_dataset(..., format = "json") does not. Convert JSON inputs to Parquet and move on.
  • map_batches() is experimental. Documented as such in arrow; do not put it in production ETL without a fallback path.

Conclusion

If you have ever watched read_csv() eat 16 GB of RAM and your session vanish, arrow is the right next step. Start with open_dataset() on a directory of Parquet files, write your usual filter |> group_by |> summarise pipeline, finish with collect() or compute(). The verbs are the same ones you already know. Arrow just does not pay for the intermediates.

For the single-file case, the Parquet IO guide is the right place to start. For an in-memory OLAP companion that speaks the same dplyr dialect, see the DuckDB guide. If your bottleneck is RAM rather than disk, the R memory management guide is worth a read first.

See Also

  • dplyr filter(): the verb that does the most work in a lazy arrow pipeline, and the one that triggers predicate pushdown.
  • dplyr group_by() pairs with summarise() to define the per-group shape of the result.
  • dplyr summarise(): the verb that turns a query into something you actually collect().