Using Polars from R
Polars started as a Rust-based alternative to pandas, designed for speed. It has since grown into a multi-language ecosystem, with a native R binding that lets you tap into its performance from your existing R workflows.
This guide shows you how to get started with polars in R and when it makes sense to choose it over dplyr.
What is Polars?
Polars is a DataFrame library written in Rust that emphasizes:
- Speed — Often 5-10x faster than pandas for large datasets
- Memory efficiency — Uses Apache Arrow under the hood
- Lazy evaluation — Optimizes query plans automatically
- Type safety — Catches errors at compile time
The R package, simply called polars, provides a nearly complete port of the Python API. You get most of the functionality of Python Polars but with R-native syntax conventions.
Installation
Install the package from CRAN:
install.packages("polars")
Load it like any other package:
library(polars)
The package loads quickly and has minimal dependencies, making it a lightweight addition to your R setup.
Your First Polars DataFrame
Create a DataFrame directly in Polars:
df <- pl$DataFrame(
name = c("Alice", "Bob", "Charlie", "Diana"),
age = c(25, 30, 35, 28),
salary = c(55000, 72000, 89000, 61000)
)
df
Output:
shape: (4, 3)
┌─────────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ str ┆ i32 ┆ i32 │
├─────────┼─────┼────────┤
│ Alice ┆ 25 ┆ 55000 │
│ Bob ┆ 30 ┆ 72000 │
│ Charlie ┆ 35 ┆ 89000 │
│ Diana ┆ 28 ┆ 61000 │
└─────────┴─────┴────────┘
The output format shows you the schema upfront—useful for understanding your data structure. You can see the column names, their data types (str, i32), and the actual values in one glance.
Reading Data
Polars reads files extremely fast. Here is how to read a CSV:
df <- pl$read_csv("data.csv")
For larger files, use read_csv_batched() for better performance:
df <- pl$read_csv_batched("large_data.csv")
Polars also supports:
read_parquet()— For Parquet files, the gold standard for columnar storageread_json()— For JSON files, both line-delimited and standardread_ipc()— For Arrow IPC files, great for interoperabilityread_delta()— For Delta Lake tables, useful in data lake architectures
Reading Parquet files is particularly useful since they are compressed and maintain type information, leading to faster loads than CSV.
Basic Operations
Selecting Columns
df$select("name", "salary")
# Or use the pipe-like syntax
df$select(pl$col("name", "salary"))
Filtering Rows
# Filter people over 30
df$filter(pl$col("age") > 30)
Creating New Columns
df$with_columns(
pl$col("salary") * 1.1$alias("new_salary")
)
Grouping and Aggregating
df$group_by("department")$agg(
pl$col("salary")$mean()$alias("avg_salary"),
pl$col("name")$count()$alias("count")
)
Joining DataFrames
Polars provides several join types similar to dplyr:
# Left join
employees <- pl$DataFrame(
id = 1:3,
name = c("Alice", "Bob", "Charlie")
)
departments <- pl$DataFrame(
dept_id = c(1, 2, 3),
department = c("Engineering", "Sales", "Marketing")
)
# Join on different column names
employees$join(
departments,
by = pl$col("id"),
how = "left"
)
Window Functions
Window functions in Polars let you perform calculations across rows:
df <- pl$DataFrame(
department = c("A", "A", "B", "B"),
salary = c(50000, 60000, 55000, 65000)
)
# Calculate running total within each department
df$with_columns(
pl$col("salary")$sum()$over("department")$alias("dept_total")
)
Lazy Mode
Polars shines with lazy evaluation, which optimizes your entire pipeline before execution:
lazy_df <- pl$read_csv("data.csv")$lazy()
result <- lazy_df %>%
pl$filter(pl$col("age") > 25) %>%
pl$select("name", "salary") %>%
pl$sort("salary") %>%
pl$collect() # Execute the optimized plan
The $collect() call triggers execution. Between creation and collection, Polars builds an optimized query plan.
Lazy mode is particularly powerful when you have complex transformations. Polars can eliminate unnecessary operations, push filters down to the source, and parallelize work automatically.
Performance Comparison
Polars typically outperforms dplyr on large datasets. Here is a quick benchmark:
library(microbenchmark)
# Create a large dataset
big_df <- data.frame(
x = rep(1:1000, each = 1000),
y = rnorm(1000000),
z = sample(letters, 1000000, replace = TRUE)
)
# dplyr approach
dplyr_result <- big_df %>%
group_by(x) %>%
summarise(mean_y = mean(y))
# Polars approach
polars_result <- pl$DataFrame(big_df)$group_by("x")$agg(
pl$col("y")$mean()$alias("mean_y")
)
On datasets with millions of rows, Polars often finishes 5-10x faster. The speedup comes from several factors: better memory layout, multi-threaded execution, and query optimization.
When to Use Polars
Choose Polars when:
- You work with datasets over 1GB
- Processing time is critical
- You need features not available in dplyr (like streaming mode)
- You are porting Python Polars code to R
Stick with dplyr when:
- You prefer the tidyverse syntax
- Your data fits comfortably in memory
- You need maximum compatibility with R packages
Converting Between Polars and R
Polars to R
# Polars DataFrame to R data.frame
as.data.frame(polars_df)
# Polars DataFrame to tibble
as_tibble(polars_df)
R to Polars
# R data.frame to Polars
pl$DataFrame(r_dataframe)
# R tibble to Polars
pl$DataFrame(r_tibble)
Common Pitfalls
Syntax Differences
Polars uses different syntax than dplyr. For example:
# dplyr
df %>% filter(age > 30) %>% select(name)
# Polars
df$filter(pl$col("age") > 30)$select("name")
NSE Evaluation
Polars is stricter about non-standard evaluation. Always use $ or explicit column references:
# This works in dplyr but not Polars
df$filter(age > 30) # Error
# Use explicit col()
df$filter(pl$col("age") > 30) # Works
Next Steps
To deepen your Polars knowledge:
- Explore the official Polars R documentation
- Learn about
pl$DataFrameandpl$LazyFramemethods - Try integrating Polars into your existing tidyverse workflows
See Also
filter()— dplyr’s filtering functionmutate()— Creating columns with dplyrc()— Base R’s combine functionreadr-and-file-import— Fast file import with readr