Working Parquet Files in R with the Arrow Package
Working Parquet files in R becomes straightforward with the Arrow package, which provides read_parquet() and write_parquet() for handling this columnar storage format. Parquet gives you efficient compression and fast query performance, slashing storage by 50-90% compared to CSV. This guide shows you how to use them effectively.
Why Parquet matters
If you’re working with large datasets, Parquet files can reduce your storage footprint by 50-90% compared to CSV. The columnar format means queries that only need a few columns load much faster because they skip the irrelevant data. For data science workflows where you repeatedly load the same datasets, this format is a practical choice.
The Arrow package connects R to the Apache Arrow ecosystem, giving you these benefits without leaving your familiar R workflow.
Installing Arrow
You need R 4.0.0 or later to use the Arrow package. Install it from CRAN:
install.packages("arrow")
On macOS and Windows, the binary package includes the Arrow C++ library. On Linux, you may need to install additional system dependencies—check the Arrow installation docs for details.
Reading Parquet files
The read_parquet() function loads Parquet files into R. It returns a tibble by default, or an Arrow Table if you prefer.
library(arrow)
# Read a Parquet file into a tibble
df <- read_parquet("data/sales.parquet")
head(df)
Selecting columns
The basic read_parquet() call loads every column. For wide tables with dozens of columns, this wastes memory and I/O time reading data you will never use. Column selection tells Arrow to skip unneeded columns at the file level, before any data reaches R’s memory.
You can load only the columns you need using col_select. This is useful for large files where you don’t need every column.
# Keep only specific columns
df <- read_parquet("data/sales.parquet",
col_select = c("date", "amount", "customer_id"))
# Use tidy select helpers
df <- read_parquet("data/sales.parquet",
col_select = starts_with("s"))
Memory-Mapped reading
Column selection reduces how much data you read, but memory mapping changes how that data occupies RAM. Set mmap = TRUE (the default) to use memory mapping for large files. This lets you work with files larger than available RAM without loading the entire thing into memory—Arrow maps the file into virtual memory and reads only the pages it actually needs.
# Read as Arrow Table instead of tibble
df <- read_parquet("data/large_file.parquet", as_data_frame = FALSE)
class(df)
Writing Parquet files
Reading is only half the story. Once you have transformed your data in R, writing it back to Parquet preserves the columnar layout and compression benefits for downstream consumers. The write_parquet() function saves R data frames to Parquet format with sensible defaults that work well for most use cases.
library(dplyr)
# Create some sample data
df <- tibble(
id = 1:1000,
value = rnorm(1000),
category = sample(c("A", "B", "C"), 1000, replace = TRUE)
)
# Write to Parquet
write_parquet(df, "output/data.parquet")
Compression options
The default settings produce a valid Parquet file, but tuning the compression algorithm can shrink your output significantly or speed up repeated reads. Parquet supports multiple compression algorithms. The default is snappy, which gives you fast reads with reasonable compression and works well as a general-purpose starting point.
# Use gzip compression for smaller files
write_parquet(df, "output/data.parquet", compression = "gzip")
# Use brotli for even better compression
write_parquet(df, "output/data.parquet", compression = "brotli")
# Disable compression entirely
write_parquet(df, "output/data.parquet", compression = "uncompressed")
Controlling row groups
Compression operates within each row group, so the group size determines the granularity of both compression and I/O. The chunk_size parameter controls how many rows go into each row group. Smaller row groups give you better random access; larger groups give better throughput for full scans and typically achieve higher compression ratios since more similar values are grouped together.
# Write with 10,000 rows per row group
write_parquet(df, "output/data.parquet", chunk_size = 10000)
Performance tuning
Choosing compression
The right compression depends on your use case:
- snappy: Fastest, moderate compression. Good for frequent reads.
- gzip: Slower, better compression. Good for archival.
- brotli: Best compression, moderate speed. Good for cold storage.
- zstd: Balanced option available in recent Parquet versions.
# Check which codecs are available
codec_is_available("zstd")
Working with large files
Picking the right codec helps with storage, but the bigger performance win comes from avoiding loading the entire dataset into R. For datasets that don’t fit in memory, consider reading as an Arrow Table and processing in chunks. Arrow Tables stay outside R’s memory heap, and the Arrow dplyr verbs execute in C++ rather than R, so you can filter and aggregate before ever materializing the result:
# Read and process in chunks
table <- read_parquet("large_file.parquet", as_data_frame = FALSE)
# Filter using Arrow dplyr
library(dplyr)
filtered <- table |>
filter(value > 100) |>
collect()
Common options reference
read_parquet parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | string | required | Path to Parquet file |
col_select | character | NULL | Columns to read |
as_data_frame | logical | TRUE | Return tibble (FALSE = Arrow Table) |
mmap | logical | TRUE | Use memory mapping |
write_parquet parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
x | data.frame | required | Data to write |
sink | string | required | Output file path |
compression | string | ”snappy” | Compression algorithm |
chunk_size | integer | NULL | Rows per row group |
version | string | ”2.4” | Parquet format version |
Troubleshooting
File not found: Parquet files are sensitive to exact paths. Use normalizePath() to verify the path exists before reading.
Type mismatches: Parquet has strict type enforcement. If your data has mixed types in a column, you may need to clean it before writing.
Version errors: Some older Parquet readers don’t support version 2.6. Use version = "2.4" for maximum compatibility.
Schema and type handling
Parquet files store schema information, column names and types, in the file header. arrow::schema(df) shows the schema. Arrow has a richer type system than R: it distinguishes int8, int16, int32, int64, float32, float64, and various date/time types. When reading into R, Arrow types are mapped to R types (int64 becomes double to avoid overflow). Use as_data_frame() with int64_as = "integer64" to preserve 64-bit integers with the bit64 package.
Columnar storage benefits
Parquet stores data column-by-column rather than row-by-row. Reading only specific columns is fast because the reader skips the bytes for unneeded columns entirely. For an analytical query that uses 3 of 50 columns, Parquet reads approximately 3/50 of the data. This columnar layout also enables efficient compression, each column has a uniform data type, so dictionary and run-length encodings compress repeated values aggressively. A 1GB CSV often compresses to 100-200MB as Parquet.
The Arrow format
Apache Arrow defines an in-memory columnar format for analytical data. Columns are stored as contiguous arrays, enabling SIMD (vectorized CPU) operations on entire columns without row-by-row processing. Arrow is the format that makes data transfer between R, Python, Spark, and databases efficient, the same memory layout works across all these systems.
arrow::read_parquet("file.parquet") reads a Parquet file into an Arrow table. arrow::as_arrow_table(df) converts an R data frame to an Arrow table in memory. Both representations support the same dplyr verbs through Arrow’s compute engine, operations are translated to Arrow C++ rather than running in R.
Parquet is Arrow’s columnar file format. Where Arrow is an in-memory format, Parquet is a persistent storage format with compression and encoding optimizations. A Parquet file is typically 5-10x smaller than a CSV of the same data, and reads 10-50x faster because only the requested columns are decompressed.
Lazy evaluation
arrow::open_dataset("directory/") creates a reference to a Parquet dataset without reading it. Apply dplyr verbs and they accumulate without executing. collect() fetches the result.
ds <- open_dataset("data/sales/")
result <- ds %>%
filter(year == 2024, region == "North") %>%
group_by(product) %>%
summarise(total = sum(revenue)) %>%
collect()
Arrow evaluates this in C++ without reading unnecessary data, it skips partitions that do not match the filter and reads only the columns used. For large datasets, this is orders of magnitude faster than reading everything into R first.
Partitioned datasets
Hive-style partitioning stores each partition in a directory: data/year=2024/region=North/file.parquet. Arrow’s dataset API reads partitioned data smoothly and uses partition values as filter predicates, filtering on year == 2024 skips directories for other years entirely.
arrow::write_dataset(df, "data/", partitioning = c("year", "region")) writes a partitioned dataset. Partition by columns that you commonly filter on. Too many partitions (e.g., one per user ID) creates overhead from many small files; too few loses the filtering benefit.
Data types and schema
Arrow has richer types than R’s base types. Arrow int32 maps to R integer; float64 maps to R double; utf8 maps to R character; dictionary (categorical) maps to R factor. Nested types (list, struct, map) have no direct R equivalent and are represented as R lists.
arrow::schema(id = int64(), name = utf8(), value = float32()) defines a schema. read_parquet("file.parquet", schema = my_schema) coerces columns to the specified types on read. read_parquet("file.parquet", col_select = c("id", "value")) reads only specific columns, faster than reading all columns and selecting later.
DuckDB integration
Arrow and DuckDB integrate tightly. duckdb::duckdb_register_arrow(con, "my_view", arrow_table) registers an Arrow table as a DuckDB view, enabling SQL queries over Arrow data without copying. arrow::to_duckdb(arrow_ds) returns a dbplyr tbl backed by DuckDB, combining Arrow’s efficient I/O with DuckDB’s SQL engine.
This combination, Arrow for storage and DuckDB for computation, is a practical alternative to Spark for datasets up to hundreds of gigabytes on a single machine. duckdb reads Parquet files natively: DBI::dbGetQuery(con, "SELECT * FROM read_parquet('data/*.parquet') WHERE year = 2024").
Choosing between Arrow and data.table
Arrow and data.table address overlapping but distinct use cases. data.table excels at in-memory aggregations and joins with very low overhead, if your data fits comfortably in RAM, data.table is often faster and simpler. Arrow shines when data exceeds available memory, when you need to read only a subset of columns or rows from large files, or when you need interoperability with other systems that read Parquet. Many workflows combine both: use Arrow to read and filter large Parquet files down to a manageable subset, then convert to data.table or a tibble for the final aggregation.
See also
- Reading and Writing CSV Files in R — For handling the most common text-based data format
- Data Wrangling with dplyr — Transform your data after loading
- R Memory Management — Optimize memory usage for large datasets