Fast Data Manipulation with data.table
data.table is R’s fastest way to manipulate tabular data. It extends data frames with a concise syntax that handles filtering, selection, and aggregation in fewer keystrokes than base R or dplyr. This guide covers the fundamentals and builds toward real-world workflows.
Why data.table?
When working with large datasets, speed matters. data.table operates on data in-place, avoiding the copy-on-modify behavior that slows down data frames and tibbles. The package is used by data scientists at Google, Microsoft, and New York Times for analyses involving millions of rows.
The core insight behind data.table is its three-part syntax:
DT[i, j, by]
- i: rows to select (like
WHEREin SQL) - j: columns to compute (like
SELECT) - by: grouping variable (like
GROUP BY)
This single line replaces what would be multiple operations in dplyr.
Getting Started
Install and load the package:
install.packages("data.table")
library(data.table)
Convert your data frame to a data.table:
iris_dt <- as.data.table(iris)
# Or create directly
iris_dt <- data.table(
id = 1:100,
value = rnorm(100),
group = sample(letters[1:4], 100, replace = TRUE)
)
Subsetting Rows with i
The first argument filters rows. Pass a logical condition inside the brackets.
# Filter rows where species is setosa
iris_dt[Species == "setosa"]
# Multiple conditions — use & for AND
iris_dt[Sepal.Length > 5 & Species == "versicolor"]
# Use | for OR
iris_dt[Species == "setosa" | Species == "virginica"]
# Filter by row number (first 10 rows)
iris_dt[1:10]
# Filter by vector membership
iris_dt[Species %in% c("setosa", "virginica")]
You can also use the %between% operator for range checks:
iris_dt[Sepal.Length %between% c(5, 6)]
Selecting and Computing Columns with j
The second argument controls what happens to columns. Unlike dplyr’s select and mutate, data.table combines both in j.
# Select specific columns
iris_dt[, .(Sepal.Length, Species)]
# Or using the character vector shortcut
iris_dt[, c("Sepal.Length", "Species")]
# Compute new columns
iris_dt[, Sepal.Area := Sepal.Length * Sepal.Width]
The := operator assigns by reference, modifying the table in-place without copying.
# Multiple new columns at once
iris_dt[, `:=`(
Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width
)]
# Compute and return without modifying
iris_dt[, sqrt(Sepal.Length)]
For expressions that return multiple values, wrap in .() to get a data.table back:
iris_dt[, .(Mean.Length = mean(Sepal.Length))]
Grouping with by
The third argument groups the computation. Combine j and by for aggregations.
# Count rows per species
iris_dt[, .N, by = Species]
# Multiple aggregations
iris_dt[, .(
Mean.Length = mean(Sepal.Length),
Max.Width = max(Sepal.Width),
Count = .N
), by = Species]
# Group by multiple columns
iris_dt <- data.table(mtcars)
DT[, .(mpg_mean = mean(mpg)), by = .(cyl, gear)]
The .N is a special variable that counts rows in each group.
Chaining Operations
Chain expressions by adding more brackets:
# Filter, then group, then sort
iris_dt[
Sepal.Length > 5,
.(Mean.Length = mean(Sepal.Length)),
by = Species
][order(-Mean.Length)]
This reads as: “From iris_dt, where Sepal.Length > 5, compute mean by Species, then sort by Mean.Length descending.”
Setting Keys for Speed
Keys make row subsetting and grouping faster by sorting data internally.
setkey(iris_dt, Species)
# Now filtering by Species is much faster
iris_dt["setosa"]
# Key by multiple columns
setkey(iris_dt, cyl, gear)
DT[.(4, 3)] # Get rows where cyl==4 and gear==3
The .() syntax is a shortcut for list() when used in i for key lookup.
Fast File Operations
data.table includes fread() and fwrite() for blazing-fast file I/O.
# Read CSV much faster than read.csv
iris_dt <- fread("large_file.csv")
# Write CSV faster than write.csv
fwrite(iris_dt, "output.csv")
# Read with specific column types
iris_dt <- fread("file.csv",
colClasses = c(id = "character", value = "numeric")
)
# Skip rows while reading
iris_dt <- fread("file.csv", skip = 10)
These functions are often 10-100x faster than base R equivalents.
Common Patterns
Update by Reference with :=
Modify existing columns:
DT[, price := price * 1.1] # Increase all prices by 10%
Conditional Updates
DT[category == "A", price := price * 0.9]
DT[quantity < 10, status := "low_stock"]
Add Rows
# Add new rows
rbind(iris_dt, new_row_dt)
# Or use rbindlist for many rows
iris_dt <- rbindlist(list(iris_dt, row1, row2, row3))
Remove Columns
DT[, column_to_remove := NULL]
Rolling Joins
data.table supports fast rolling joins:
# Example: rolling forward last known value
DT1 <- data.table(
date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-05")),
value = c(10, NA, 15)
)
DT2 <- data.table(
date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"))
)
setkey(DT1, date)
setkey(DT2, date)
DT2[DT1, value := i.value, roll = TRUE]
When to Use data.table
data.table is ideal when:
- Your dataset has millions of rows
- You need maximum performance for iteration
- You’re building production pipelines
- Memory is constrained (in-place modification helps)
Consider alternatives when:
- Readability is more important than speed (dplyr may be clearer)
- Working in a tidyverse-only environment
- Integrating with tidyverse packages that expect tibbles
Installing data.table
install.packages("data.table")
library(data.table)
Summary
| Syntax | Purpose |
|---|---|
DT[i] | Filter rows |
DT[, j] | Select or compute columns |
DT[i, j] | Filter then compute |
DT[i, j, by] | Filter, compute, group |
:= | Assign by reference |
.N | Count rows in group |
.() | Wrap result in data.table |
setkey() | Set sort key for fast lookups |
fread() / fwrite() | Fast file I/O |
The i-j-k pattern is the heart of data.table. Once it clicks, you’ll manipulate data faster than ever before.
See Also
- Data Wrangling with dplyr — The tidyverse alternative for readable data manipulation
- Reading and Writing CSV Files in R — Compare base R vs data.table file I/O performance
- DuckDB for R — In-memory SQL database for analytical queries