rguides

data.table R Guide: fread, Joins, and In-Place Updates

This guide picks up where the basics guide leaves off. If you already know the i, j, and by slots and have used := once or twice, the material here is the next layer: the on= join syntax, fread/fwrite for fast I/O, secondary indices, non-equi and rolling joins, the .SD/.SDcols machinery, and how to write code that constructs data.table calls programmatically.

Why data.table stays useful in 2026

The pitch has not changed since Matt Dowle started the package: do more in one expression, mutate in place, and let the C internals handle the boring parts. The 1.18 line (current on CRAN, vignette refreshed 2026-05-04) keeps the same syntax but tightens performance, adds more fcase/fifelse coverage, and ships faster rolling-window helpers (frollmean, frollsum, frollapply). If you are reaching for a 10 GB CSV or doing 100 million row joins, base R and dplyr will both still finish the job. data.table just finishes it with less code and less memory.

Installing and creating a data.table

Install from CRAN and load it. There is no separate package to enable.

install.packages("data.table")
library(data.table)

A data.table is a data.frame with a few extras tacked on. The constructor behaves the same way you would build a data.frame, with the same column types and the same recycling rules, but the result inherits from both classes at once:

dt <- data.table(
  id    = c("b", "b", "b", "a", "a", "c"),
  x     = 1:6,
  y     = 7:12,
  z     = 13:18
)
class(dt)
# [1] "data.table" "data.frame"

If you already have a data.frame lying around, you can promote it in place rather than copying. setDT() walks the input list and rewrites the class vector on the spot, so the original object becomes a data.table without a fresh allocation:

df <- data.frame(a = 1:3, b = letters[1:3])
setDT(df)         # df is now a data.table, no assignment needed

setDT() modifies the input in place, which is the default for almost every set* function in the package. Hold that thought: it is the source of most “my function broke my data” bug reports.

The x[i, j, by] general form

Every data.table query fits this shape:

DT[i, j, by]

Read it as: take DT, subset rows using i, then compute j, grouped by by. The ?data.table reference page lists every argument, but the three that matter day to day are i, j, and by (with keyby as a sorted variant, and on = to pick the join key).

A minimal example:

dt[id == "a" & x > 2, .(sum_x = sum(x), mean_y = mean(y))]
#    id sum_x mean_y
# 1:  a     9     10

.() is an alias for list(). Wrapping the output in .() lets you return multiple columns with names. Skipping it returns a single vector.

Column names inside i and j are evaluated in the table’s scope, which means you skip the dt$x prefix entirely. That is the source of the next gotcha, when you want to refer to a column by a variable name. Hold that thought for the Programming with data.table section below.

Reading and writing files with fread and fwrite

fread() and fwrite() are the package’s fast I/O pair. They beat read.csv and write.csv by 5-50x on real files and they guess types for you.

flights <- fread("flights14.csv")
class(flights)
# [1] "data.table" "data.frame"
dim(flights)
# [1] 253316     11

The full fread signature is wider than most people need, but a handful of arguments are worth memorising. Reach for these first when you are tuning read performance or probing unfamiliar files in a hurry:

fread(
  input, sep = "auto", nrows = Inf, header = "auto",
  na.strings = "NA", stringsAsFactors = FALSE,
  select = NULL, drop = NULL, colClasses = NULL,
  data.table = TRUE, nThread = getDTthreads()
)

select and drop read only the columns you ask for. On a 200-column file, that is the difference between a 2 GB allocation and a 200 MB one. nrows is a hard cap, useful when you are probing a file you have never seen. sep = "auto" sniffs the separator from ,\t|;:. fread() also accepts URLs (http://, https://) and shell commands through cmd = "grep error logfile".

fwrite() mirrors it. The interesting flag is quote = "auto", which only quotes fields that contain a separator, quote, or newline. Base R’s write.csv() quotes everything instead.

fwrite(flights, "flights_clean.csv")
fwrite(flights, "flights.tsv", sep = "\t")
fwrite(flights, "flights.tsv.gz", compress = "gzip")

dateTimeAs controls how POSIXct columns are serialised. "ISO" is the safest default for round-tripping through another tool; "squash" is the human-friendly "yyyy-mm-dd HH:MM:SS".

Joins: equi, rolling, and non-equi

The on = argument is the modern way to join. It replaces the older setkey()-then-subset pattern, and it is the only way to write non-equi and rolling joins.

Equi join

airports <- data.table(
  iata = c("JFK", "LAX", "ORD"),
  city = c("New York", "Los Angeles", "Chicago")
)

airports[flights, on = .(iata = origin)]
# adds city to flights, keeps all flights

Anti-join (not-join)

Drop rows on the left that have a match on the right. In data.table, you get this for free by putting ! in front of the right-hand table, and the syntax reads naturally as “flights where there is no matching airport”:

flights[!airports, on = .(origin = iata)]
# rows in flights with no matching airport

Rolling join

A rolling join matches each row on the left to the most recent row on the right. This is the killer feature for time-series “as of” joins, like last-known price, last-seen status, or last-recorded measurement.

trades <- data.table(
  symbol = c("A", "A", "A"),
  t      = c(1, 2, 3),
  qty    = c(10, 20, 15)
)
prices <- data.table(
  symbol = c("A", "A", "A"),
  t      = c(1, 3),
  px     = c(100, 105)
)

prices[trades, on = .(symbol, t), roll = TRUE]
#    symbol     t   px  qty
# 1:      A     1  100   10
# 2:      A     2  100   20   # rolled back to the last available price
# 3:      A     3  105   15

roll = -Inf rolls backwards strictly; roll = TRUE rolls in the direction of the index. rollends controls whether the first and last interval are treated as closed.

Non-equi and range joins

A non-equi join matches each row on the left to every row on the right that satisfies an inequality. The syntax is on = .(x >= a, x <= b).

# prices where the trade was open
trades[prices, on = .(start_date <= date, end_date >= date), nomatch = 0]

The nomatch = 0 drops rows with no match. If your inequality matches many rows on both sides, data.table refuses by default to avoid building a Cartesian product you did not intend:

options(datatable.allow.cartesian = TRUE)

Only set this when you actually want the full cross product. If you hit this error, the fix is almost always to add another column to on =.

Keys vs secondary indices

Two ways to make lookups fast. They look similar and they are not interchangeable.

Toolsetkey(DT, col)setindex(DT, col)
EffectReorders the data in RAM. Subsequent subset on the key column is a binary search.Builds an index without reordering.
on = still required?Optional, but recommended.Required.
Build costO(n) radix sort.O(n) radix sort.
Use whenYou filter or join on the same column repeatedly and the row order does not matter to you.You need fast lookups but the table order is meaningful, or you have many candidate join columns.

A secondary index is closer to a database index than to a key. Auto-indexing kicks in the first time you join on a column via on = if no index exists yet; subsequent joins reuse it. Disable with options(datatable.auto.index = FALSE) if the first-call cost matters more than the second-call speedup.

setkey(flights, origin)
flights[.(c("JFK", "LAX"))]    # uses the key, no on= needed

setindex(flights, dest)
flights[.("ORD"), on = "dest"] # uses the secondary index

setorder() and setorderv() are the in-place sort you want when you do not need a key. x[order(x$col)] copies the whole table, while setorder(x, col) does not.

Working with .SD and .SDcols

.SD is a Subset of Data — a data.table containing the rows of the current group, restricted to the columns listed in .SDcols. If you do not pass .SDcols, .SD is empty, and most uses of it will error.

# WRONG: .SD is empty because .SDcols was not set
dt[, lapply(.SD, mean), by = id]
# Error in FUN(X[[i]]) ... argument "FUN" is missing

# RIGHT: restrict to the columns you want summarised
dt[, lapply(.SD, mean), by = id, .SDcols = c("x", "y", "z")]
#    id     x     y     z
# 1:  b   2.0   8.0  14.0
# 2:  a   5.0  11.0  17.0
# 3:  c   6.0  12.0  18.0

.SDcols accepts more shapes than the character vector you saw above. The full list of accepted forms includes character vectors, integer indices, column ranges, predicate functions, and regex patterns via the patterns() helper:

# Integer indices
dt[, lapply(.SD, mean), by = id, .SDcols = 2:4]

# Column range (data.table 1.10+)
dt[, lapply(.SD, mean), by = id, .SDcols = x:z]

# A predicate function
dt[, lapply(.SD, mean), by = id, .SDcols = is.numeric]

# Regex via patterns()
dt[, lapply(.SD, mean), by = id, .SDcols = patterns("^price")]

The patterns() helper is the one to remember. It accepts a vector of regexes, and melt() can use the same trick to grab several measure sets at once.

Reshaping with melt and dcast

melt() and dcast() are the long-format and wide-format conversions. They are the same functions as in the reshape2 package, with a few extras.

wide <- data.table(id = c("a", "b"),
                   q1 = c(1, 2), q2 = c(3, 4), q3 = c(5, 6))

# Wide -> long
long <- melt(wide, id.vars = "id",
             measure.vars = c("q1", "q2", "q3"),
             variable.name = "quarter", value.name = "sales")
long
#    id quarter sales
# 1:  a       q1     1
# 2:  b       q1     2
# 3:  a       q2     3
# 4:  b       q2     4
# 5:  a       q3     5
# 6:  b       q3     6

# Long -> wide
dcast(long, id ~ quarter, value.var = "sales")
#    id q1 q2 q3
# 1:  a  1  3  5
# 2:  b  2  4  6

The newer pattern-based melt() (1.9.6+) grabs several measure sets by regex. Useful when your wide columns are named like q1_value, q1_status, q2_value, q2_status:

melt(wide, id.vars = "id",
     measure.vars = patterns("^q", "_value$"),
     value.name = c("quarter", "value"))

shift, frollmean, and friends

Lag, lead, and rolling means are first-class operations on a data.table. You do not need an external package or a windowed recipe; the primitives live in the namespace and respect by = for grouped calculations, so the rolling mean of one column is a one-liner:

dt[, .(x, x_lag1 = shift(x, 1), x_lead1 = shift(x, 1, type = "lead"))]
dt[, x_diff := x - shift(x, 1), by = id]   # diff within group
dt[, x_rollmean3 := frollmean(x, 3), by = id]  # rolling mean, on-line

frollmean, frollsum, and frollapply are the fast rolling-window family added in 1.10+. If you are doing any time-series work, this is usually the reason people switch from data.frame to data.table.

Programming with data.table

This is the part dplyr users trip on. Inside j, column names are looked up in the table, not in the calling environment. So dt[, my_col] does not return a column called my_col; it returns the literal value of my_col if it exists in the calling environment, or an error otherwise.

The fix is .., which walks one level up the scope:

my_col <- "x"

dt[, ..my_col]                          # pull the column
dt[, .(..my_col, mean_x = mean(x))]     # mix with computed columns

For programmatic i, the same scoping rule applies. Reach for get() to walk back up to the calling environment and pull the variable that holds the column name as a string:

dt[get(my_col) > 5]

For programmatic .SDcols, you can use mget() to pull a column-name vector from somewhere else in the call stack, or simply pass the character vector directly when it is already in scope:

cols <- c("x", "y")
dt[, lapply(.SD, mean), .SDcols = mget(cols, as.environment(dt))]
# or, more commonly:
dt[, lapply(.SD, mean), .SDcols = cols]

For fully programmatic queries (say, a Shiny app that takes a column name and a threshold from the user), read the vignette("datatable-programming") page. The short version is that env = list(my_col = "x") lets you pass bindings in explicitly.

In-place updates with := and set*

The := operator is what makes data.table fast on memory. It modifies the table by reference; no copy.

flights[, dep_delay_z := (dep_delay - mean(dep_delay, na.rm = TRUE)) /
                              sd(dep_delay, na.rm = TRUE), by = origin]

flights[, c("gain", "speed") := .(arr_delay - dep_delay,
                                   distance / (air_time/60))]

flights[, c("dep_delay_z", "gain") := NULL]   # delete by reference

A common mistake: := returns its right-hand side invisibly, not the table. If you type dt[, new := x*2] at the REPL, you will see NULL in the console and wonder where the column went. The column was added. To print the result, add an empty []:

dt[, new := x*2][]

The set* family follows the same rules. setnames(dt, old, new), setorder(dt, col), set(dt, i, j, value), setattr(): all modify in place. Use copy(dt) when you want a deep copy, especially inside a function that takes a data.table argument. Otherwise you will surprise the caller by mutating their data.

Performance notes

A few things are worth knowing even if you never write C code:

  • := and the set* family modify in place, so you avoid the per-step copy that base R and dplyr make.
  • setkey and setorder use radix sort, O(n) on integer and numeric columns.
  • Many aggregations are sped up by GForce, a C-level specialisation that fires when j is a single function applied to a single column, e.g. dt[, sum(x), by = g] or dt[, .N, by = g]. This is the reason dt[, .N, by = grp] beats dplyr::count(dt, grp) on large data.
  • data.table is multi-threaded by default. getDTthreads() reports the current setting, setDTthreads(n) changes it. The default is the number of physical cores.
  • If you want to confirm any of this, run bench::mark() or microbenchmark::microbenchmark() on your actual data.

Common mistakes

A short list, in order of how often they bite:

  1. Forgetting copy() inside a function. The caller’s data.table gets mutated.
  2. Assuming := returns the table. It returns NULL. Add [] to print.
  3. Cross-product joins fail by default. Set allow.cartesian = TRUE only when you really mean it; more often the fix is a tighter on = clause.
  4. Joining with by = instead of on =. by is the grouping key for j, not the join key. Joins are driven by the columns of the right-hand i table, and you pick which with on =.
  5. Logical NA in i is treated as FALSE. Different from base R, where NA propagates.
  6. Row numbers in print are not row names. The 1:, 2:, … at the start of a printed data.table are row numbers, not names. rownames(dt) <- NULL is a no-op.
  7. Modifying a data.table inside a dplyr pipeline. If you must mix, copy() first.

Conclusion

The pattern that holds the package together is short verbs on a single expression. x[i, j, by] is the only syntax you really need to memorise; the rest is argument lists on fread, fwrite, melt, and dcast. The two design choices that make it fast (in-place mutation and C-level specialisation for common aggregations) are also the two choices that make it easy to write bugs. Use copy() inside functions, read the table after a :=, and lean on on = for joins.

If you want a one-liner summary: data.table is still the fastest general-purpose data manipulation tool in R, and the 1.18 line has not added anything that breaks what you already know.

See also