Working with Parquet Files using Arrow
Parquet is a columnar storage file format that gives you efficient compression and fast query performance. The Arrow package in R provides read_parquet() and write_parquet() functions that make working with Parquet files straightforward. This guide shows you how to use them effectively.
Why Parquet Matters
If you’re working with large datasets, Parquet files can reduce your storage footprint by 50-90% compared to CSV. The columnar format means queries that only need a few columns load much faster because they skip the irrelevant data. For data science workflows where you repeatedly load the same datasets, this format is a practical choice.
The Arrow package connects R to the Apache Arrow ecosystem, giving you these benefits without leaving your familiar R workflow.
Installing Arrow
You need R 4.0.0 or later to use the Arrow package. Install it from CRAN:
install.packages("arrow")
On macOS and Windows, the binary package includes the Arrow C++ library. On Linux, you may need to install additional system dependencies—check the Arrow installation docs for details.
Reading Parquet Files
The read_parquet() function loads Parquet files into R. It returns a tibble by default, or an Arrow Table if you prefer.
library(arrow)
# Read a Parquet file into a tibble
df <- read_parquet("data/sales.parquet")
head(df)
Selecting Columns
You can load only the columns you need using col_select. This is useful for large files where you don’t need every column.
# Keep only specific columns
df <- read_parquet("data/sales.parquet",
col_select = c("date", "amount", "customer_id"))
# Use tidy select helpers
df <- read_parquet("data/sales.parquet",
col_select = starts_with("s"))
Memory-Mapped Reading
Set mmap = TRUE (the default) to use memory mapping for large files. This lets you work with files larger than available RAM without loading the entire thing into memory.
# Read as Arrow Table instead of tibble
df <- read_parquet("data/large_file.parquet", as_data_frame = FALSE)
class(df)
Writing Parquet Files
The write_parquet() function saves R data frames to Parquet format.
library(dplyr)
# Create some sample data
df <- tibble(
id = 1:1000,
value = rnorm(1000),
category = sample(c("A", "B", "C"), 1000, replace = TRUE)
)
# Write to Parquet
write_parquet(df, "output/data.parquet")
Compression Options
Parquet supports multiple compression algorithms. The default is snappy, which gives you fast reads with reasonable compression.
# Use gzip compression for smaller files
write_parquet(df, "output/data.parquet", compression = "gzip")
# Use brotli for even better compression
write_parquet(df, "output/data.parquet", compression = "brotli")
# Disable compression entirely
write_parquet(df, "output/data.parquet", compression = "uncompressed")
Controlling Row Groups
The chunk_size parameter controls how many rows go into each row group. Smaller row groups give you better random access; larger groups give better throughput for full scans.
# Write with 10,000 rows per row group
write_parquet(df, "output/data.parquet", chunk_size = 10000)
Performance Tuning
Choosing Compression
The right compression depends on your use case:
- snappy: Fastest, moderate compression. Good for frequent reads.
- gzip: Slower, better compression. Good for archival.
- brotli: Best compression, moderate speed. Good for cold storage.
- zstd: Balanced option available in recent Parquet versions.
# Check which codecs are available
codec_is_available("zstd")
Working with Large Files
For datasets that don’t fit in memory, consider reading as an Arrow Table and processing in chunks:
# Read and process in chunks
table <- read_parquet("large_file.parquet", as_data_frame = FALSE)
# Filter using Arrow dplyr
library(dplyr)
filtered <- table |>
filter(value > 100) |>
collect()
Common Options Reference
read_parquet Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | string | required | Path to Parquet file |
col_select | character | NULL | Columns to read |
as_data_frame | logical | TRUE | Return tibble (FALSE = Arrow Table) |
mmap | logical | TRUE | Use memory mapping |
write_parquet Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
x | data.frame | required | Data to write |
sink | string | required | Output file path |
compression | string | ”snappy” | Compression algorithm |
chunk_size | integer | NULL | Rows per row group |
version | string | ”2.4” | Parquet format version |
Troubleshooting
File not found: Parquet files are sensitive to exact paths. Use normalizePath() to verify the path exists before reading.
Type mismatches: Parquet has strict type enforcement. If your data has mixed types in a column, you may need to clean it before writing.
Version errors: Some older Parquet readers don’t support version 2.6. Use version = "2.4" for maximum compatibility.
See Also
- Reading and Writing CSV Files in R — For handling the most common text-based data format
- Data Wrangling with dplyr — Transform your data after loading
- R Memory Management — Optimize memory usage for large datasets