Using dtplyr: dplyr Syntax on data.table
dtplyr lets you write dplyr code that runs on data.table’s engine. You get the readable, composable syntax of dplyr with the raw speed of data.table — no need to choose between readability and performance.
Why dtplyr?
dplyr is beloved for its readable syntax, but data.table is faster. dtplyr bridges this gap by translating your dplyr code into data.table calls under the hood.
When you’re working with large datasets (millions of rows), the speed difference matters. A grouped summarize that takes 2 seconds in dplyr might complete in 200ms with data.table. dtplyr gives you that performance without rewriting your code.
The tradeoff: dtplyr adds a small overhead for translation. For small datasets (<100k rows), plain dplyr is often faster. For larger data, dtplyr wins.
Getting Started
Install from CRAN:
install.packages("dtplyr")
Load it alongside dplyr:
library(dtplyr)
library(dplyr)
dtplyr works by converting tibbles to lazy data.tables. You still write dplyr code, but it executes as data.table.
Basic Usage
Create a lazy data.table from any data frame:
library(dtplyr)
library(dplyr)
dt <- mtcars |>
lazy_dt()
The lazy_dt() function creates a lazy data.table that records your dplyr operations but doesn’t execute them until you call collect().
Filtering Rows
dt |>
filter(cyl == 4) |>
collect()
The filter() translates to data.table’s i argument. The result is materialised when you call collect().
Selecting Columns
dt |>
select(mpg, cyl, disp) |>
collect()
Adding New Columns
dt |>
mutate(kpm = mpg / 0.425,
hp_per_cyl = hp / cyl) |>
collect()
This translates to data.table’s j argument for column creation.
Grouping and Summarising
dt |>
group_by(cyl) |>
summarise(mean_mpg = mean(mpg),
n = n()) |>
collect()
Grouping translates to data.table’s by argument, and summarise becomes the j expression with grouping.
How Translation Works
dtplyr translates dplyr verbs to data.table syntax. Here’s what happens under the hood:
| dplyr | data.table |
|---|---|
filter() | DT[i, ] |
select() | DT[, .(cols)] |
mutate() | DT[, :=(new = ...)] |
group_by() | DT[, by = "grp"] |
summarise() | DT[, .(agg = ...), by = "grp"] |
arrange() | DT[order(...)] |
left_join() | DT2[DT1] |
You can see the generated data.table code with show_query():
dt |>
filter(cyl > 4) |>
mutate(kpm = mpg * 0.425) |>
group_by(cyl) |>
summarise(mean_kpm = mean(kpm)) |>
show_query()
This prints the equivalent data.table code, which is useful for learning and debugging.
Performance Tips
Always Use collect() at the End
Lazy evaluation means nothing runs until you call collect(). Chaining multiple operations without collecting is efficient — dtplyr combines them into a single data.table call.
# Efficient: one data.table call
dt |>
filter(cyl > 4) |>
mutate(kpm = mpg * 0.425) |>
group_by(cyl) |>
summarise(mean_kpm = mean(kpm)) |>
collect()
# Less efficient: multiple materializations
filtered <- dt |> filter(cyl > 4) |> collect()
mutated <- filtered |> mutate(kpm = mpg * 0.425) |> collect()
# ...
Use summarise() with Multiple Functions
data.table handles multiple summary functions efficiently:
dt |>
group_by(cyl) |>
summarise(mean = mean(mpg),
sd = sd(mpg),
min = min(mpg),
max = max(mpg),
n = n()) |>
collect()
Prefer data.table’s Native Functions
For operations that don’t have dplyr equivalents, you can drop to data.table syntax:
dt |>
filter(cyl > 4) |>
mutate(!!sym("kpl") := mpg * 0.425) |>
collect()
Limitations
dtplyr doesn’t support everything in dplyr:
- Window functions are limited — some advanced ranking functions may not translate
- Complex joins with non-equi conditions can be tricky
- Some tidyverse packages like tidyr pivoting may not fully work
For these cases, convert back to a regular data.table or tibble and use the native functions.
When to Use dtplyr
Use dtplyr when:
- You have large datasets (>100k rows)
- Your code is already written in dplyr
- You need readable code with data.table speed
Stick with plain dplyr when:
- Your data fits in memory easily
- You need advanced tidyverse features
- You’re working with small datasets where translation overhead matters
See Also
- data.table — the underlying engine
- dplyr — the syntax you write
- Fast Data Manipulation with data.table — going native