Fast Data Manipulation with data.table

· 4 min read · Updated March 11, 2026 · intermediate
data.table data-manipulation performance cran

data.table is R’s fastest way to manipulate tabular data. It extends data frames with a concise syntax that handles filtering, selection, and aggregation in fewer keystrokes than base R or dplyr. This guide covers the fundamentals and builds toward real-world workflows.

Why data.table?

When working with large datasets, speed matters. data.table operates on data in-place, avoiding the copy-on-modify behavior that slows down data frames and tibbles. The package is used by data scientists at Google, Microsoft, and New York Times for analyses involving millions of rows.

The core insight behind data.table is its three-part syntax:

DT[i, j, by]
  • i: rows to select (like WHERE in SQL)
  • j: columns to compute (like SELECT)
  • by: grouping variable (like GROUP BY)

This single line replaces what would be multiple operations in dplyr.

Getting Started

Install and load the package:

install.packages("data.table")
library(data.table)

Convert your data frame to a data.table:

iris_dt <- as.data.table(iris)
# Or create directly
iris_dt <- data.table(
  id = 1:100,
  value = rnorm(100),
  group = sample(letters[1:4], 100, replace = TRUE)
)

Subsetting Rows with i

The first argument filters rows. Pass a logical condition inside the brackets.

# Filter rows where species is setosa
iris_dt[Species == "setosa"]
# Multiple conditions — use & for AND
iris_dt[Sepal.Length > 5 & Species == "versicolor"]
# Use | for OR
iris_dt[Species == "setosa" | Species == "virginica"]
# Filter by row number (first 10 rows)
iris_dt[1:10]
# Filter by vector membership
iris_dt[Species %in% c("setosa", "virginica")]

You can also use the %between% operator for range checks:

iris_dt[Sepal.Length %between% c(5, 6)]

Selecting and Computing Columns with j

The second argument controls what happens to columns. Unlike dplyr’s select and mutate, data.table combines both in j.

# Select specific columns
iris_dt[, .(Sepal.Length, Species)]

# Or using the character vector shortcut
iris_dt[, c("Sepal.Length", "Species")]
# Compute new columns
iris_dt[, Sepal.Area := Sepal.Length * Sepal.Width]

The := operator assigns by reference, modifying the table in-place without copying.

# Multiple new columns at once
iris_dt[, `:=`(
  Sepal.Area = Sepal.Length * Sepal.Width,
  Petal.Area = Petal.Length * Petal.Width
)]
# Compute and return without modifying
iris_dt[, sqrt(Sepal.Length)]

For expressions that return multiple values, wrap in .() to get a data.table back:

iris_dt[, .(Mean.Length = mean(Sepal.Length))]

Grouping with by

The third argument groups the computation. Combine j and by for aggregations.

# Count rows per species
iris_dt[, .N, by = Species]
# Multiple aggregations
iris_dt[, .(
  Mean.Length = mean(Sepal.Length),
  Max.Width = max(Sepal.Width),
  Count = .N
), by = Species]
# Group by multiple columns
iris_dt <- data.table(mtcars)
DT[, .(mpg_mean = mean(mpg)), by = .(cyl, gear)]

The .N is a special variable that counts rows in each group.

Chaining Operations

Chain expressions by adding more brackets:

# Filter, then group, then sort
iris_dt[
  Sepal.Length > 5,
  .(Mean.Length = mean(Sepal.Length)),
  by = Species
][order(-Mean.Length)]

This reads as: “From iris_dt, where Sepal.Length > 5, compute mean by Species, then sort by Mean.Length descending.”

Setting Keys for Speed

Keys make row subsetting and grouping faster by sorting data internally.

setkey(iris_dt, Species)

# Now filtering by Species is much faster
iris_dt["setosa"]
# Key by multiple columns
setkey(iris_dt, cyl, gear)
DT[.(4, 3)]  # Get rows where cyl==4 and gear==3

The .() syntax is a shortcut for list() when used in i for key lookup.

Fast File Operations

data.table includes fread() and fwrite() for blazing-fast file I/O.

# Read CSV much faster than read.csv
iris_dt <- fread("large_file.csv")

# Write CSV faster than write.csv
fwrite(iris_dt, "output.csv")
# Read with specific column types
iris_dt <- fread("file.csv", 
  colClasses = c(id = "character", value = "numeric")
)

# Skip rows while reading
iris_dt <- fread("file.csv", skip = 10)

These functions are often 10-100x faster than base R equivalents.

Common Patterns

Update by Reference with :=

Modify existing columns:

DT[, price := price * 1.1]  # Increase all prices by 10%

Conditional Updates

DT[category == "A", price := price * 0.9]
DT[quantity < 10, status := "low_stock"]

Add Rows

# Add new rows
rbind(iris_dt, new_row_dt)

# Or use rbindlist for many rows
iris_dt <- rbindlist(list(iris_dt, row1, row2, row3))

Remove Columns

DT[, column_to_remove := NULL]

Rolling Joins

data.table supports fast rolling joins:

# Example: rolling forward last known value
DT1 <- data.table(
  date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-05")),
  value = c(10, NA, 15)
)
DT2 <- data.table(
  date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"))
)

setkey(DT1, date)
setkey(DT2, date)
DT2[DT1, value := i.value, roll = TRUE]

When to Use data.table

data.table is ideal when:

  • Your dataset has millions of rows
  • You need maximum performance for iteration
  • You’re building production pipelines
  • Memory is constrained (in-place modification helps)

Consider alternatives when:

  • Readability is more important than speed (dplyr may be clearer)
  • Working in a tidyverse-only environment
  • Integrating with tidyverse packages that expect tibbles

Installing data.table

install.packages("data.table")
library(data.table)

Summary

SyntaxPurpose
DT[i]Filter rows
DT[, j]Select or compute columns
DT[i, j]Filter then compute
DT[i, j, by]Filter, compute, group
:=Assign by reference
.NCount rows in group
.()Wrap result in data.table
setkey()Set sort key for fast lookups
fread() / fwrite()Fast file I/O

The i-j-k pattern is the heart of data.table. Once it clicks, you’ll manipulate data faster than ever before.

See Also