Fast Data Manipulation with data.table
data.table provides fast data manipulation for tabular data in R. It extends data frames with a concise syntax that handles filtering, selection, and aggregation in fewer keystrokes than base R or dplyr. This guide covers the fundamentals and builds toward real-world workflows where speed and memory efficiency matter.
Why data.table?
When working with large datasets, speed matters. data.table operates on data in-place, avoiding the copy-on-modify behavior that slows down data frames and tibbles. The package is used by data scientists at Google, Microsoft, and New York Times for analyses involving millions of rows.
The core insight behind data.table is its three-part syntax:
DT[i, j, by]
- i: rows to select (like
WHEREin SQL) - j: columns to compute (like
SELECT) - by: grouping variable (like
GROUP BY)
One line of data.table code replaces what would be multiple operations in dplyr.
Getting started
Install and load the package:
install.packages("data.table")
library(data.table)
Convert an existing data frame to a data.table with as.data.table(). The conversion preserves all columns and rows but adds the data.table class, unlocking the bracket syntax and in-place assignment operators. You can also create a data.table directly from vectors, mirroring the data.frame() constructor but with better default behavior: strings are not coerced to factors.
iris_dt <- as.data.table(iris)
# Or create one directly from vectors
iris_dt <- data.table(
id = 1:100,
value = rnorm(100),
group = sample(letters[1:4], 100, replace = TRUE)
)
Subsetting rows with i
The first argument filters rows. Pass a logical condition inside the brackets: the expression is evaluated in the context of the data.table, so you can reference column names directly without quoting them or prefixing with the table name.
# Filter rows where species is setosa
iris_dt[Species == "setosa"]
# Multiple conditions — use & for AND, | for OR
iris_dt[Sepal.Length > 5 & Species == "versicolor"]
iris_dt[Species == "setosa" | Species == "virginica"]
# Filter by row number (first 10 rows)
iris_dt[1:10]
Logical conditions in i combine naturally: & requires both conditions to be true, | requires either. The conditions are vectorized, so each row is tested independently. Unlike dplyr’s filter(), data.table expressions in i do not need to reference the data frame name; column names are resolved directly from the table’s environment.
# Filter by vector membership
iris_dt[Species %in% c("setosa", "virginica")]
# The %between% operator checks a range
iris_dt[Sepal.Length %between% c(5, 6)]
Selecting and computing columns with j
The second argument controls what happens to columns. Unlike dplyr’s separate select() and mutate(), data.table combines column selection, computation, and assignment into the single j argument. What j returns depends on what you put there: a list returns a new data.table, a vector returns a plain vector, and := modifies in place.
# Select specific columns — .() is shorthand for list()
iris_dt[, .(Sepal.Length, Species)]
# Or use a character vector for column names
iris_dt[, c("Sepal.Length", "Species")]
The .() function wraps column references into a list, producing a data.table result. Using a character vector is convenient when column names are stored in a variable, but .() gives you named columns and allows computed expressions alongside simple selections.
# Compute new columns with := (assignment by reference, no copy)
iris_dt[, Sepal.Area := Sepal.Length * Sepal.Width]
# Multiple new columns at once
iris_dt[, `:=`(
Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width
)]
The := operator is data.table’s key innovation. It modifies the table in place without making a copy, which means adding a column to a 10-million-row table is nearly instantaneous. The trade-off is that in-place modification changes the object for all references to it, which differs from R’s usual copy-on-modify behavior and can surprise if you are not expecting it.
# Compute and return a result without modifying the table
iris_dt[, sqrt(Sepal.Length)]
# Wrap in .() for a named data.table result
iris_dt[, .(Mean.Length = mean(Sepal.Length))]
Grouping with by
The third argument groups the computation. Combine j and by for split-apply-combine aggregations that run in a single pass over the data. The special symbol .N counts rows in each group, and .SD (Subset of Data) refers to the group’s data for more complex operations.
# Count rows per species using the special .N symbol
iris_dt[, .N, by = Species]
# Multiple aggregations in one pass — mean, max, and count per group
iris_dt[, .(
Mean.Length = mean(Sepal.Length),
Max.Width = max(Sepal.Width),
Count = .N
), by = Species]
Each expression in j is evaluated independently for every group, so you can compute as many summary statistics as you need in a single call. The by argument accepts a single column name, a list of columns with .(), or a computed grouping variable. Grouping by multiple columns creates a cross-product of group levels; every combination of the grouping variables gets its own row in the result.
# Group by multiple columns with .() (shorthand for list())
DT <- data.table(mtcars)
DT[, .(mpg_mean = mean(mpg)), by = .(cyl, gear)]
Chaining operations
data.table operations can be chained by appending more [] calls to the result of the previous one. Each bracket receives the output of the bracket before it, creating a readable pipeline that avoids intermediate variable assignments. This pattern reads naturally from left to right: filter, then aggregate, then sort.
# Filter, then group, then sort — each [] operates on the previous result
iris_dt[
Sepal.Length > 5,
.(Mean.Length = mean(Sepal.Length)),
by = Species
][order(-Mean.Length)]
This reads as: “From iris_dt, where Sepal.Length > 5, compute mean by Species, then sort by Mean.Length descending.” Chaining keeps related operations together and eliminates the need for temporary variables that clutter the workspace and consume memory.
Setting keys for speed
Keys make row subsetting and grouping dramatically faster by sorting data internally and creating a binary search tree. When a key is set, subsetting uses binary search instead of a full vector scan; on large tables, this can be the difference between milliseconds and seconds.
setkey(iris_dt, Species)
# Now filtering by Species uses fast binary search
iris_dt["setosa"]
Key-based subsetting uses the DT["value"] syntax without an explicit column name. data.table matches “setosa” against the key column automatically. The .() syntax in i creates a list for multi-column key lookups, matching each element against the corresponding key column in order.
# Key by multiple columns and look up specific combinations
setkey(DT, cyl, gear)
DT[.(4, 3)] # Get rows where cyl==4 and gear==3
Fast file operations
data.table includes fread() and fwrite() for file I/O that outperforms both base R and readr on large files. These functions use multi-threaded C code and intelligent type detection to read and write CSVs at speeds often 10-100x faster than their base R equivalents.
# Read CSV much faster than read.csv — auto-detects types and separators
iris_dt <- fread("large_file.csv")
# Write CSV faster than write.csv — supports on-the-fly compression
fwrite(iris_dt, "output.csv")
fread() handles gzip and bzip2 compressed files transparently, detects column types by sampling rather than reading the entire file first, and can read directly from URLs. The colClasses argument lets you override the auto-detected types when you know the schema in advance.
# Read with explicit column types and skip header rows
iris_dt <- fread("file.csv",
colClasses = c(id = "character", value = "numeric"),
skip = 10
)
Common patterns
Update by reference and conditional operations
The := operator modifies columns in place without allocating a new table. This is the mechanism that makes data.table so memory-efficient for large datasets: adding a column to a 10-million-row table takes microseconds rather than the seconds a copy would require. Use it alone for bulk updates across all rows, or combine with i to update only the rows that match a condition.
# Update all rows — no filtering
DT[, price := price * 1.1]
# Conditional updates — filter in i, assign in j
DT[category == "A", price := price * 0.9]
DT[quantity < 10, status := "low_stock"]
Conditional updates with DT[i, col := value] only touch rows that match the i condition; unmatched rows keep their existing values. This is equivalent to an SQL UPDATE ... WHERE statement and is dramatically faster than subsetting, modifying, and recombining.
Adding and removing data
# Add new rows — rbind works with data.tables
rbind(iris_dt, new_row_dt)
# rbindlist is faster for many small tables
iris_dt <- rbindlist(list(iris_dt, row1, row2, row3))
# Remove columns by assigning NULL
DT[, column_to_remove := NULL]
Assigning NULL to a column deletes it in place without copying the table. This is the data.table equivalent of select(-column_to_remove) in dplyr, but it modifies the original object rather than returning a new one.
Rolling joins
Rolling joins align data from two tables where an exact key match is not required. The most common use case is carrying forward the last known value in time series data: for each timestamp in one table, find the nearest preceding timestamp in another.
# Rolling forward: for each date in DT2, find the nearest preceding date in DT1
DT1 <- data.table(
date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-05")),
value = c(10, NA, 15)
)
DT2 <- data.table(
date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"))
)
setkey(DT1, date)
setkey(DT2, date)
DT2[DT1, value := i.value, roll = TRUE]
When to use data.table
data.table is ideal when:
- Your dataset has millions of rows
- You need maximum performance for iteration
- You’re building production pipelines
- Memory is constrained (in-place modification helps)
Consider alternatives when:
- Readability is more important than speed (dplyr may be clearer)
- Working in a tidyverse-only environment
- Integrating with tidyverse packages that expect tibbles
Installing data.table
install.packages("data.table")
library(data.table)
Reference semantics
Unlike data frames, data.table modifies in place by reference. := is the in-place assignment operator: DT[, new_col := x * 2] adds a column without copying the entire table. copy(DT) creates a deep copy when you need two independent versions. The reference semantics are the key performance advantage, large table operations avoid the O(n) copy that base R and dplyr trigger.
Keys and indexing
setkey(DT, id) sorts the table by id and marks it as the key column. Keyed joins are faster than non-keyed: DT1[DT2] joins on the key. setindex(DT, col) creates a secondary index without sorting, useful for columns frequently used in i filters. DT[.("value")] does a fast binary search when a key is set, equivalent to DT[id == "value"] but using the sorted structure.
Chaining and piping
data.table operations can be chained with [][]: DT[i][j][, .(result)] applies filters and transformations in sequence. Each [] is a separate data.table operation. With the native R pipe, DT[age > 30] |> (\(dt) dt[, mean(salary), by = dept])() is equivalent. For complex pipelines, chaining with [][] is conventional in data.table code because it keeps the DT syntax without breaking the column expression.
The data.table syntax
data.table extends data.frame with a concise [i, j, by] syntax that parallels SQL: i is the WHERE clause (row filter), j is the SELECT clause (column operations), and by is the GROUP BY clause. All three are optional. DT[i, j, by] reads: “take DT, subset rows by i, then compute j grouped by by.”
DT[x > 0] filters rows. DT[, .(mean_val = mean(value))] computes a summary. DT[, .(mean_val = mean(value)), by = category] groups. The .() is shorthand for list(). Multiple operations in one call: DT[x > 0, .(n = .N, avg = mean(y)), by = group].
:= modifies columns by reference (no copy): DT[, new_col := x * 2] adds a column in-place. DT[, c("a", "b") := list(x + 1, y - 1)] adds multiple columns. DT[, col := NULL] deletes a column. These in-place modifications avoid R’s copy-on-modify semantics, saving memory for large data.
Keys and indexes
setkey(DT, id) sorts the table by id and sets it as the primary key. Keyed tables support fast binary search. DT[.(10)] retrieves rows where id == 10 using binary search rather than a full scan. For large tables, this is orders of magnitude faster.
setindex(DT, col) creates an index without sorting the table. An index (secondary key) enables fast subset without physically reordering data. Multiple indexes can coexist on different columns.
Rolling joins use DT[DT2, roll = TRUE], for each row in DT2, finds the nearest preceding row in DT by the key column. This is essential for aligning time series data at different intervals.
Grouping and aggregation
.N is the number of rows in the current group. .SD is the subset of data for the current group (a data.table). .SDcols specifies which columns .SD contains.
DT[, lapply(.SD, mean), by = group, .SDcols = is.numeric] computes the mean of all numeric columns per group. This is the data.table equivalent of group_by() %>% summarise(across(where(is.numeric), mean)).
DT[, head(.SD, 2), by = group] returns the first two rows of each group, the data.table equivalent of slice_head(n = 2) in dplyr. .SD operations are flexible but slower than built-in aggregation functions.
Reshaping data
dcast(DT, row_formula ~ col_formula, value.var = "value") reshapes from long to wide (pivot_wider equivalent). melt(DT, id.vars = "id", measure.vars = c("x", "y")) reshapes from wide to long (pivot_longer equivalent). Both support multiple value variables simultaneously.
dcast() with fun.aggregate = sum aggregates when there are multiple values per cell. fill = 0 replaces missing cells with zero.
Joining data.tables
DT1[DT2, on = "id"] performs a right join, similar to SQL RIGHT JOIN on the key. DT1[DT2, on = "id", nomatch = NULL] performs an inner join (drops unmatched). DT1[DT2, on = "id", nomatch = NA] is the right join (keeps all DT2 rows, NA for missing DT1 values).
merge(DT1, DT2, by = "id", all.x = TRUE) performs a left join with familiar syntax. data.table’s merge() is faster than the base R version for large tables.
Performance best practices
Reading CSV: data.table::fread("file.csv") is the fastest CSV reader in R, outperforming readr::read_csv() for large files. fread auto-detects column types, handles compressed files, and supports parallel reading.
Writing: fwrite(DT, "file.csv") writes CSV faster than write.csv(). It supports gzip compression on the fly with compress = "gzip".
For complex operations, chain [ calls: DT[x > 0][, total := sum(y), by = group][order(-total)] filters, then aggregates, then sorts in one clean expression. Each [ returns the modified data.table, enabling method chaining without intermediate objects.
The i, j, by syntax
The data.table bracket syntax is DT[i, j, by], where i selects rows, j selects or computes columns, and by groups the operation. Understanding this three-part structure is the key to reading data.table code. Unlike base R’s matrix-style indexing, j can contain arbitrary expressions, function calls, multi-column computations, assignments with :=, not just column names or indices.
The := operator modifies columns in place without copying the data. This is the main source of data.table’s memory efficiency: where dplyr’s mutate() creates a new data frame, := modifies the existing one. The trade-off is that in-place modification changes the object everywhere it is referenced, which differs from R’s usual copy-on-modify semantics. This can cause surprising behavior when the same data.table is referenced from multiple variables.
Summary
| Syntax | Purpose |
|---|---|
DT[i] | Filter rows |
DT[, j] | Select or compute columns |
DT[i, j] | Filter then compute |
DT[i, j, by] | Filter, compute, group |
:= | Assign by reference |
.N | Count rows in group |
.() | Wrap result in data.table |
setkey() | Set sort key for fast lookups |
fread() / fwrite() | Fast file I/O |
The i-j-k pattern is the heart of data.table. Once it clicks, you’ll manipulate data faster than ever before.
See also
- Data Wrangling with dplyr — The tidyverse alternative for readable data manipulation
- DuckDB for R — In-memory SQL database for analytical queries on large datasets
- Databases with DBI — Connect R to relational databases with DBI
- dbplyr — SQL Translation — Translate dplyr verbs to SQL for database backends
- Arrow and Parquet in R — Columnar data formats for big data workflows