Apply Family in R: lapply, sapply, tapply and More
The apply family is a set of base R functions that let you iterate over vectors, lists, or matrices without writing explicit loops. These functions are fundamental to writing idiomatic R code that is both concise and expressive. Unlike for loops, which require manual index management and result accumulation, the apply family encapsulates the iteration pattern so you focus on what to compute rather than how to loop. This guide covers apply(), lapply(), sapply(), and tapply(), showing when to use each one, plus related functions like replicate() and mapply() that extend the family into simulation and multi-argument mapping.
apply() for matrices
apply() works on matrices or arrays. It takes three main arguments: the data, the margin (1 for rows, 2 for columns), and the function to apply.
# Create a sample matrix
mat <- matrix(1:9, nrow = 3, ncol = 3)
mat
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 8
# [3,] 3 6 9
# Sum each row (margin = 1)
apply(mat, 1, sum)
# [1] 12 15 18
# Sum each column (margin = 2)
apply(mat, 2, sum)
# [1] 6 15 24
The margin argument determines the direction of iteration: 1 operates across rows, 2 across columns. You can pass extra arguments to the function being applied by adding them after the function name. Common examples include na.rm = TRUE for ignoring missing values when computing means or sums, or trim for trimmed means.
# Calculate mean of each column, removing NAs
mat_with_na <- mat
mat_with_na[1, 1] <- NA
apply(mat_with_na, 2, mean, na.rm = TRUE)
# [1] 2.0 5.0 8.0
When the built-in functions don’t meet your needs, you can pass an anonymous function defined inline. This pattern is common for computing derived statistics that aren’t available as standalone functions, like the range, interquartile range, or any multi-step calculation on each row or column.
# Custom function: range (max - min)
apply(mat, 1, function(x) max(x) - min(x))
# [1] 6 6 6
The apply function excels with structured matrix data, but R programs more commonly work with lists and data frames. For list-based iteration, lapply() is the foundational tool. It always returns a list, preserving element names and handling heterogeneous output types without coercion.
lapply() for lists and vectors
lapply() always returns a list. It takes a vector or list as input and applies a function to each element.
# Create a list of vectors
data_list <- list(
a = c(1, 2, 3),
b = c(4, 5, 6),
c = c(7, 8, 9)
)
# Calculate mean of each element
lapply(data_list, mean)
# $a
# [1] 2
# $b
# [1] 5
# $c
# [1] 8
lapply() is the foundation of many data manipulation workflows in R. A common pattern loads multiple CSV files into a list of data frames using lapply(files, read.csv), then applies transformations like dim() or summary() to inspect each one. This approach scales to any number of files without copying code.
# Load multiple data frames into a list
df1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
df2 <- data.frame(x = 4:6, y = c("d", "e", "f"))
all_df <- list(df1 = df1, df2 = df2)
# Check dimensions of each data frame
lapply(all_df, dim)
# $df1
# [1] 3 2
# $df2
# [1] 3 2
For cases where a list of individual results is cumbersome, sapply() simplifies the output automatically. It wraps lapply() and attempts to collapse the result into a vector or matrix when all elements have compatible lengths and types. This convenience makes sapply() popular in interactive use, though the automatic type conversion can surprise you if inputs produce unequal-length results.
sapply() for simplified output
sapply() is a user-friendly wrapper around lapply(). It tries to simplify the output into a vector or matrix when possible.
# With sapply, lists get simplified to vectors
result <- sapply(data_list, mean)
result
# a b c
# 2 5 8
# Check the type
typeof(result)
# [1] "double"
When simplification fails, for instance when each element of the input produces an output of a different length, sapply() silently returns a list instead of a vector. This unpredictable return type is the main reason experienced R programmers prefer lapply() or vapply() in production code, where consistent output shapes matter.
# This returns a list because elements have different lengths
mixed_list <- list(
vec1 = c(1, 2),
vec2 = c(3, 4, 5)
)
sapply(mixed_list, summary)
Use sapply() when you want concise code and are comfortable with automatic type conversion. For production code where predictable output matters, lapply() gives you more control.
tapply() for grouped operations
tapply() applies a function to subsets of a vector, defined by a grouping factor.
# Sample data: values and groups
scores <- c(85, 92, 78, 88, 95, 72)
group <- c("A", "A", "B", "B", "A", "B")
# Calculate mean score by group
tapply(scores, group, mean)
# A B
# 90.3 81.5
This is equivalent to the group_by() plus summarise() pattern in dplyr. The base R function aggregate() uses a formula interface that many R users find familiar, though its performance on large datasets lags behind both dplyr and data.table. For quick grouped summaries without loading packages, the tapply/aggregate combination works well.
# Same result using dplyr
df <- data.frame(scores = scores, group = group)
aggregate(scores ~ group, data = df, FUN = mean)
tapply() handles multiple grouping factors by accepting a list of factor vectors. The result is a matrix with rows and columns corresponding to the factor levels, which is ideal for two-way summary tables common in statistical reporting. For three or more grouping factors, consider switching to aggregate() or dplyr::group_by() with summarise(), as the output dimensionality becomes harder to work with.
# Two grouping factors
treatment <- c("drug", "drug", "placebo", "drug", "placebo", "placebo")
tapply(scores, list(group, treatment), mean)
# drug placebo
# A 93.5 95.0
# B 78.0 79.5
When to use which
Choose the right function based on your data structure and output needs. Each function targets a specific data shape, and picking the wrong one leads to unnecessary type conversions or awkward code:
- apply(), matrices or arrays, row/column operations. Use when data has at least two dimensions and you need to compute per-row or per-column summaries.
- lapply(), vectors or lists, when you need list output. The safe default for list iteration; always returns the same structure regardless of input.
- sapply(), quick exploratory work with automatic simplification. Prefer vapply() or lapply() in scripts where predictable output matters.
- tapply(), grouped statistics by factor levels. The base R equivalent of a SQL GROUP BY; returns a named array keyed by factor combinations.
For operations on multiple parallel inputs, mapply() and Map() extend the pattern to functions of two or more arguments.
Common patterns
The apply family shines in data analysis workflows:
# Apply a function to multiple columns
numeric_cols <- mtcars[, c("mpg", "disp", "hp")]
lapply(numeric_cols, function(x) round(mean(x), 2))
# $mpg
# [1] 20.09
# $disp
# [1] 230.72
# $hp
# [1] 146.69
replicate() for simulation
replicate(n, expr) is a wrapper around sapply() for repeated evaluation of an expression. replicate(1000, mean(rnorm(100))) generates 1000 sample means from normal distributions. It returns a vector when expr returns a scalar and a matrix when expr returns a vector. This is the base R idiom for Monte Carlo simulation before the purrr era, still useful for simple simulations without tidyverse dependencies.
mapply and map
mapply(FUN, ...) applies FUN with multiple arguments in parallel. Where lapply(x, fun) calls fun(x[[1]]), fun(x[[2]]), …, mapply(fun, x, y) calls fun(x[[1]], y[[1]]), fun(x[[2]], y[[2]]), … This is the vectorized equivalent of a two-argument loop.
Map(FUN, ...) is mapply() with SIMPLIFY = FALSE, always returning a list. Prefer Map() when you want a list result and mapply() when you want simplification.
Relationship to purrr
purrr::map() is the tidyverse equivalent of lapply(). purrr::map_dbl(), map_chr(), map_int(), map_lgl() are type-safe equivalents of vapply(). purrr::map2() is the equivalent of mapply() for two parallel inputs. purrr::pmap() handles any number of parallel inputs.
purrr has cleaner syntax for complex anonymous functions, better error messages, and integrates with the pipe. For new tidyverse-oriented code, prefer purrr. For base-R code or packages that want to minimize dependencies, use apply, lapply, vapply, and Map.
Performance notes
The apply functions are written in C, making them faster than explicit loops in R. However, for most everyday tasks, the performance difference is negligible. Readability should guide your choice.
For large datasets, consider data.table or dplyr for better performance. These packages use optimized C++ code under the hood.
Conclusion
Master the apply family and you will write more expressive R code. Each function serves a specific purpose. Start with lapply() for safety, sapply() for exploration, and apply() for matrices. Use tapply() whenever you need grouped summaries.
See also
- purrr Functional Programming, The tidyverse approach to iteration with type-stable outputs
- dplyr Basics, Data manipulation with the pipe and grouped operations
- Data Tables in R, High-performance alternative for large datasets