Working with Parquet Files using Arrow

· 4 min read · Updated March 11, 2026 · intermediate
r arrow parquet data

Parquet is a columnar storage file format that gives you efficient compression and fast query performance. The Arrow package in R provides read_parquet() and write_parquet() functions that make working with Parquet files straightforward. This guide shows you how to use them effectively.

Why Parquet Matters

If you’re working with large datasets, Parquet files can reduce your storage footprint by 50-90% compared to CSV. The columnar format means queries that only need a few columns load much faster because they skip the irrelevant data. For data science workflows where you repeatedly load the same datasets, this format is a practical choice.

The Arrow package connects R to the Apache Arrow ecosystem, giving you these benefits without leaving your familiar R workflow.

Installing Arrow

You need R 4.0.0 or later to use the Arrow package. Install it from CRAN:

install.packages("arrow")

On macOS and Windows, the binary package includes the Arrow C++ library. On Linux, you may need to install additional system dependencies—check the Arrow installation docs for details.

Reading Parquet Files

The read_parquet() function loads Parquet files into R. It returns a tibble by default, or an Arrow Table if you prefer.

library(arrow)

# Read a Parquet file into a tibble
df <- read_parquet("data/sales.parquet")
head(df)

Selecting Columns

You can load only the columns you need using col_select. This is useful for large files where you don’t need every column.

# Keep only specific columns
df <- read_parquet("data/sales.parquet", 
                   col_select = c("date", "amount", "customer_id"))

# Use tidy select helpers
df <- read_parquet("data/sales.parquet",
                   col_select = starts_with("s"))

Memory-Mapped Reading

Set mmap = TRUE (the default) to use memory mapping for large files. This lets you work with files larger than available RAM without loading the entire thing into memory.

# Read as Arrow Table instead of tibble
df <- read_parquet("data/large_file.parquet", as_data_frame = FALSE)
class(df)

Writing Parquet Files

The write_parquet() function saves R data frames to Parquet format.

library(dplyr)

# Create some sample data
df <- tibble(
  id = 1:1000,
  value = rnorm(1000),
  category = sample(c("A", "B", "C"), 1000, replace = TRUE)
)

# Write to Parquet
write_parquet(df, "output/data.parquet")

Compression Options

Parquet supports multiple compression algorithms. The default is snappy, which gives you fast reads with reasonable compression.

# Use gzip compression for smaller files
write_parquet(df, "output/data.parquet", compression = "gzip")

# Use brotli for even better compression
write_parquet(df, "output/data.parquet", compression = "brotli")

# Disable compression entirely
write_parquet(df, "output/data.parquet", compression = "uncompressed")

Controlling Row Groups

The chunk_size parameter controls how many rows go into each row group. Smaller row groups give you better random access; larger groups give better throughput for full scans.

# Write with 10,000 rows per row group
write_parquet(df, "output/data.parquet", chunk_size = 10000)

Performance Tuning

Choosing Compression

The right compression depends on your use case:

  • snappy: Fastest, moderate compression. Good for frequent reads.
  • gzip: Slower, better compression. Good for archival.
  • brotli: Best compression, moderate speed. Good for cold storage.
  • zstd: Balanced option available in recent Parquet versions.
# Check which codecs are available
codec_is_available("zstd")

Working with Large Files

For datasets that don’t fit in memory, consider reading as an Arrow Table and processing in chunks:

# Read and process in chunks
table <- read_parquet("large_file.parquet", as_data_frame = FALSE)

# Filter using Arrow dplyr
library(dplyr)
filtered <- table |> 
  filter(value > 100) |>
  collect()

Common Options Reference

read_parquet Parameters

ParameterTypeDefaultDescription
filestringrequiredPath to Parquet file
col_selectcharacterNULLColumns to read
as_data_framelogicalTRUEReturn tibble (FALSE = Arrow Table)
mmaplogicalTRUEUse memory mapping

write_parquet Parameters

ParameterTypeDefaultDescription
xdata.framerequiredData to write
sinkstringrequiredOutput file path
compressionstring”snappy”Compression algorithm
chunk_sizeintegerNULLRows per row group
versionstring”2.4”Parquet format version

Troubleshooting

File not found: Parquet files are sensitive to exact paths. Use normalizePath() to verify the path exists before reading.

Type mismatches: Parquet has strict type enforcement. If your data has mixed types in a column, you may need to clean it before writing.

Version errors: Some older Parquet readers don’t support version 2.6. Use version = "2.4" for maximum compatibility.

See Also