Arrow Big Data in R: Datasets, Partitioning, and S3
Why your laptop runs out of RAM before R does
A base R data.frame copies on every assignment, a dplyr pipeline keeps intermediate tibbles in memory, and readr::read_csv() slurps the whole file into RAM before you touch it. That works fine on a 50 MB CSV. It falls over on the 9 GB Seattle library checkouts file, on a year of NYC taxi trips, or on a 70 GB public Parquet dataset sitting in S3.
arrow is the answer when the data is too big to load but small enough that you can still describe the analysis you want. The package gives you a columnar in-memory type (arrow::Table), a multi-file abstraction that never materialises the underlying data (FileSystemDataset), and a dplyr backend that turns your familiar verbs into lazy query plans executed by the Arrow C++ library. You write filter(), select(), group_by(), summarise(). Arrow reads only the columns and row groups it needs, in parallel, and returns a tibble.
This guide is the larger-than-memory companion to the single-file Parquet guide. Where that one stops at read_parquet() and write_parquet(), this one starts: directory-shaped datasets, partitioning, pushdown filtering, and reading from S3 and GCS.
Datasets, not Tables: the abstraction that makes big data tractable
There are two arrow types you will see most often:
arrow::Table: an in-memory columnar table. Useful for data that fits in RAM (a few GB) but is annoying to materialise as a data.frame.arrow::FileSystemDataset(and its subclasses likeParquetDataset,CsvDataset): a handle to a directory or set of files. It scans the directory, reads file footers, parses Hive-style partition keys, and infers a unified schema. No rows are read.
The key mental model: a FileSystemDataset is to read_csv() what a database connection is to dbReadTable(). You build a query, the engine decides how to execute it. If you accidentally call as_tibble(ds) or run a verb arrow does not support on a Dataset, it will either error out or fall back to collect()-ing the whole thing. That fallback is the most common way to OOM a session.
open_dataset() is the entry point. For everything below, assume library(arrow); library(dplyr).
Opening a multi-file dataset with open_dataset()
The signature you will reach for:
open_dataset(
sources,
schema = NULL,
partitioning = NULL,
hive_style = TRUE,
unify_schemas = NULL,
format = c("parquet", "csv", "tsv", "text", "json", "feather", "ipc")
)
sources is a directory, a single file, or a character vector of file paths. format defaults to "parquet"; on a CSV directory you either pass format = "csv" or use the convenience wrapper open_csv_dataset(). The Hive layout is auto-detected when hive_style = TRUE.
For a directory like:
nyc-taxi/year=2023/month=1/part-0.parquet
nyc-taxi/year=2023/month=2/part-0.parquet
nyc-taxi/year=2024/month=1/part-0.parquet
the call itself is one line. Pass the directory path to open_dataset() and arrow handles the rest from there: no file paths to enumerate, no schema to declare, no reader function to swap when the format changes. The directory layout and the Parquet file footers give you both for free.
ds <- open_dataset("nyc-taxi")
ds
# FileSystemDataset with 36 Parquet files
# vendor_name: string
# pickup_datetime: timestamp[ms]
# passenger_count: int64
# trip_distance: double
# fare_amount: double
# tip_amount: double
# year: int32
# month: int32
The Hive-style year=2023/month=1/ segments are parsed as real columns, not as opaque directory names. The schema comes from the Parquet file footers, which is fast and accurate (no inference from a 1000-row sample).
If your layout uses raw folders (for example, 2023/01/data.parquet), pass the partition columns explicitly:
ds2 <- open_dataset("taxi", partitioning = c("year", "month"))
The schema argument is the escape hatch when unify_schemas does not pick the type you want. It is also the right tool when the first 1000 rows of a sample do not reflect the full data: blank ISBNs, rare NAs, or other edge cases that flip the inferred type. Use it any time you would rather commit to a column type up front than gamble on inference rules.
And if the type for a column drifts across files (e.g. passenger_count is int32 in some and int64 in others), pin the schema:
ds <- open_dataset(
"mixed-files",
schema = schema(
passenger_count = int64(),
fare_amount = float64()
)
)
Lazy dplyr: filter, select, group_by, summarise
Every dplyr verb on a Dataset is lazy. filter(), select(), mutate(), group_by(), summarise(), arrange(), count(), distinct(), and head() all record the operation and return another FileSystemDataset (query). Nothing reads any data until you call collect() (materialise to an R tibble) or compute(path) (materialise to a new on-disk dataset).
A typical analysis:
ds |>
filter(year == 2024, fare_amount > 100) |>
select(tip_amount, total_amount, passenger_count) |>
mutate(tip_pct = 100 * tip_amount / total_amount) |>
group_by(passenger_count) |>
summarise(
median_tip_pct = median(tip_pct),
n = n()
) |>
collect()
# # A tibble: 10 × 3
# passenger_count median_tip_pct n
# <int64> <dbl> <int>
# 1 1 16.6 143087
# 2 2 16.2 34418
# 3 5 16.7 5806
The same verbs work on an in-memory arrow::Table, but those are eager by default. A verb arrow does not implement on a Table will auto-collect() first. The “auto-collect” behaviour is convenient in interactive work and dangerous in production code: a “big data” pipeline that accidentally reads a Dataset into a Table will silently work on the test data and OOM on production data. Always reach for open_dataset() first.
The win on this lazy pipeline comes from three mechanics working together:
- Lazy evaluation. Each verb records, nothing executes. You can stack
filter |> mutate |> group_by |> summarisewithout paying for any intermediate result. - Predicate pushdown.
filter(year == 2024)lets Arrow skip whole directories (the partition value is in the path) and skip row groups inside Parquet files using min/max statistics stored in the file footer. Files that do not match are never read. - Projection pushdown.
select(tip_amount, total_amount, passenger_count)reads only those three columns from each row group. The other nine columns stay on disk.
Partitioning choice matters. The columns you partition on should be the ones you filter on most often, and the resulting files should be hundreds of MB each: not 1 MB (file listing dominates) and not 10 GB (you lose parallelism).
From huge CSV to fast Parquet with one pipeline
A common entry point: you have a 9 GB CSV that will not fit in RAM, and you want to analyse it. arrow can read it lazily and write it back out as a partitioned Parquet dataset in one pipeline:
huge <- open_csv_dataset(
sources = "data/seattle-library-checkouts.csv",
col_types = schema(ISBN = string()),
format = "csv"
)
huge |>
mutate(year = year(CheckoutTime)) |>
write_dataset(
path = "data/seattle-parquet",
partitioning = "year",
format = "parquet"
)
# Verify
list.files("data/seattle-parquet", recursive = TRUE)
# [1] "year=2005/part-0.parquet" "year=2006/part-0.parquet" ...
This is also the moment to learn write_dataset(). The signature mirrors open_dataset() in the things that matter (format, partitioning, hive_style), and adds a few performance knobs (max_rows_per_file, max_rows_per_group) that control how Arrow splits the output. Read these once and you will rarely need to touch them again.
The signature:
write_dataset(
dataset,
path,
format = c("parquet", "csv", "tsv", "text", "feather", "ipc"),
partitioning = NULL,
hive_style = TRUE,
...,
max_partitions = 1024L,
max_open_files = 900L,
max_rows_per_file = 0L,
min_rows_per_group = 0L,
max_rows_per_group = 1024L * 1024L
)
write_dataset() writes the dataset one piece at a time. It accepts a Table, a FileSystemDataset, a dplyr query, or a data.frame, and you do not need the full result in memory. The partitioning argument may be a string or a character vector of column names, and the resulting layout is key=value/part-N.parquet.
Parquet is columnar on disk, so the same columns read the same way they are stored. There is no R-level row-to-column reshape; data moves through Arrow buffers, not via R data.frames. The same query on the resulting Parquet dataset is usually 10–100× faster than on the original CSV.
compute() vs collect(): when to keep data on disk
Two ways to materialise a query:
collect()returns an R tibble. The whole result must fit in RAM.compute("path")writes the result back to disk as a new Parquet dataset. The whole result only needs to fit on disk.
The decision rule: if the result is too big for RAM but small enough to keep on disk, use compute(). If the result is small enough to want in R, use collect(). If the result is too big for both, drop down to map_batches() (next section).
# Pipeline that filters to 2 B rows of summaries, too big for RAM,
# small enough to land on disk.
result_path <- ds |>
filter(year >= 2020) |>
group_by(year, month) |>
summarise(total_trips = n(), avg_fare = mean(fare_amount)) |>
compute("intermediate-parquet")
# Now `result_path` is a small, fast dataset you can collect() safely
result_path |> collect()
compute() returns a FileSystemDataset pointing at the new directory, so subsequent filter() / select() / collect() calls keep the same lazy semantics.
Cloud storage: reading from S3 and GCS
Pass an S3FileSystem (or GCSFileSystem) wherever a path is accepted:
bucket <- s3_bucket("arrow-datasets/nyc-taxi-tiny", anonymous = TRUE)
ds <- open_dataset(bucket)
ds |> count(year) |> collect()
# # A tibble: 2 × 2
# year n
# <int> <int>
# 1 2023 111
# 2 2024 230
For GCS, the equivalent is gs_bucket("bucket-name", anonymous = TRUE). The Apache Arrow project hosts two versions of the NYC taxi data on arrow-datasets: nyc-taxi-tiny (~70 MB, useful for prototyping) and nyc-taxi (~70 GB, ~1.7 B rows).
Caching a remote dataset locally speeds up iteration, because the alternative (repeatedly listing thousands of small files on S3) can take minutes per open_dataset() call:
copy_files(bucket, "local-cache/nyc-taxi")
ds <- open_dataset("local-cache/nyc-taxi")
A note on the arrow build: S3 and GCS support depend on how the package was compiled. Run arrow::arrow_info() and check $capabilities$s3 and $capabilities$gcs before assuming cloud access works. The CRAN binary for Linux, macOS, and Windows ships with both; some Linux distributions build arrow without S3.
Streaming work over a too-big dataset with map_batches()
Sometimes even compute() is not an option, because the result you actually want is “one number per batch” rather than “the full filtered table.” map_batches() lets you write a per-batch callback that Arrow feeds row groups to:
ds |>
filter(year == 2024) |>
select(tip_amount, total_amount) |>
map_batches(~ tibble(
n = nrow(.x),
mean_tip_pct = mean(.x$tip_amount / .x$total_amount, na.rm = TRUE)
)) |>
collect()
# # A tibble: 1 × 2
# n mean_tip_pct
# <int> <dbl>
# 1 47 0.157
Each batch is a single Arrow Table; the callback returns whatever it wants (a tibble, a model object, a one-row summary) and Arrow concatenates the per-batch results. This is the right tool for things like “fit one model per partition” or “compute a checksum per file.”
The catch: map_batches() is marked experimental in the arrow documentation. Do not ship it in production ETL without a fallback.
Common mistakes when going big in R
- Forgetting
format = "csv"(oropen_csv_dataset()).open_dataset()defaults to"parquet"; on a CSV directory it fails or picks up no files. Use the explicit wrappers when in doubt. - Hive-style assumption. If your directories are
2023/01/data.parquetinstead ofyear=2023/month=1/data.parquet, you must passpartitioning = c("year", "month")or arrow will treat the directory names as opaque. - Type drift across files. If
passenger_countisint32in some files andint64in others, open with an explicitschema = schema(...)to force a single type. - Auto-
collect()on in-memory Tables. A dplyr pipeline on an in-memoryarrow Tableauto-collect()s on verbs arrow does not implement. That is convenient in interactive sessions and a production hazard. - S3 listing on tiny files. If your partitioning is too fine-grained,
open_dataset(s3://...)spends minutes listing. Coarsen the partition granularity or cache locally withcopy_files(). - JSON writing is not supported.
open_dataset(..., format = "json")works;write_dataset(..., format = "json")does not. Convert JSON inputs to Parquet and move on. map_batches()is experimental. Documented as such in arrow; do not put it in production ETL without a fallback path.
Conclusion
If you have ever watched read_csv() eat 16 GB of RAM and your session vanish, arrow is the right next step. Start with open_dataset() on a directory of Parquet files, write your usual filter |> group_by |> summarise pipeline, finish with collect() or compute(). The verbs are the same ones you already know. Arrow just does not pay for the intermediates.
For the single-file case, the Parquet IO guide is the right place to start. For an in-memory OLAP companion that speaks the same dplyr dialect, see the DuckDB guide. If your bottleneck is RAM rather than disk, the R memory management guide is worth a read first.
See Also
- dplyr
filter(): the verb that does the most work in a lazy arrow pipeline, and the one that triggers predicate pushdown. - dplyr
group_by()pairs withsummarise()to define the per-group shape of the result. - dplyr
summarise(): the verb that turns a query into something you actuallycollect().