Memory Management in R: Avoid Pitfalls and Handle Large Data
Memory management in R works differently than in languages like C or Python. R uses copy-on-modify semantics and automatic garbage collection, which simplifies coding but can cause unexpected memory usage patterns. Understanding these internals helps you write code that handles large datasets efficiently.
How R manages memory
R allocates memory dynamically and manages it through a garbage collector that runs automatically. When you create objects, R allocates space from the heap. When objects are no longer referenced, the garbage collector reclaims that memory.
Copy-on-Modify
When you modify an object in R, the language doesn’t immediately create a copy. Instead, it uses copy-on-write semantics. The object is only copied when you attempt to modify it, and at that point, R duplicates the data.
x <- 1:1000
y <- x # No copy made yet - y points to same memory as x
# Modifying y triggers the copy
y[1] <- 0
# Now y has its own copy
This behavior matters in functions. Passing a large vector to a function doesn’t duplicate it unless the function modifies the input. This is one of R’s smartest memory features: you can pass large data between functions without paying the copy cost, as long as you treat the argument immutably within the function body.
Garbage collection
R’s garbage collector runs automatically when the memory heap reaches a threshold. You don’t need to call gc() manually for normal usage. The collector identifies objects with no remaining references and frees their memory, which means local variables created inside a function body are cleaned up as soon as the function returns.
Use gcinfo(TRUE) to see when garbage collection occurs:
gcinfo(TRUE)
x <- integer(100000)
x <- c(x, 1:18) # Triggers gc
gcinfo(FALSE)
The output shows memory used by different object types and how much was collected. Understanding these statistics requires familiarity with R’s internal memory categories: Ncells hold language objects like expressions and environments, while Vcells store numeric and character data. Most user-facing datasets reside in Vcells, so that number is typically the one to watch.
Measuring object size
R provides several ways to measure memory usage, but the numbers can be confusing because different functions measure different things: allocated bytes, shared bytes, or total process memory.
object.size()
The base R function object.size() returns the memory allocation for an object:
x <- rep(1:10, each = 100)
object.size(x)
# 8320 bytes
This function reports what R allocates, which tends to overestimate because it doesn’t account for shared references. When multiple variables point to the same underlying data (as happens after assignment without modification), object.size() counts each variable as if it held an independent copy. For simple atomic vectors this error is negligible, but for nested lists and data frames it can inflate estimates dramatically.
lobstr::obj_size()
The lobstr package provides a more accurate measurement that accounts for object sharing. Instead of summing the sizes of individual components, obj_size() tracks which memory regions are referenced multiple times and counts them only once. This gives you the true marginal memory cost of adding a reference.
library(lobstr)
x <- 1:1000
y <- list(x, x, x) # Three references to same object
object.size(y)
# 24568 bytes (counts x three times)
obj_size(y)
# 8120 bytes (recognizes shared memory)
The difference matters when you’re measuring memory usage of complex objects with shared components. For instance, a tree model object might reference the same training data frame from multiple internal slots; object.size() would multiply-count it, while obj_size() reflects the actual memory footprint.
Total memory usage
Check R’s total memory consumption with mem_used():
mem_used()
# 1.2 GB of memory allocated
This shows heap memory and cons cells used by R’s internal structures. The value reported by mem_used() is R’s own accounting of memory allocation, which may differ from what the operating system reports because R pre-allocates memory from the OS in chunks and does not immediately return freed memory.
Common memory pitfalls
Several patterns cause unexpected memory growth.
Growing vectors in loops
The most common mistake is growing a vector inside a loop:
# BAD: Creates a new vector on each iteration
result <- numeric(0)
for (i in 1:10000) {
result <- c(result, i)
}
# GOOD: Pre-allocate the vector
result <- numeric(10000)
for (i in 1:10000) {
result[i] <- i
}
Each c() call copies the entire vector. With 10,000 iterations, you create thousands of copies unnecessarily, and the total bytes copied grows quadratically with the number of iterations. Pre-allocating the result vector eliminates copies entirely and runs orders of magnitude faster on large loop counts.
Unintended object retention
Objects remain in memory until explicitly removed or they go out of scope. In scripts and notebooks, old objects accumulate unnoticed because R’s global environment persists across cell evaluations. A typical interactive session may hold dozens of intermediate data frames that consume memory long after they’re needed.
ls() # Shows all objects in your environment
rm(large_object) # Remove specific object
rm(list = ls()) # Clear everything
The workspace grows over time, consuming memory even when objects aren’t actively used. Regular cleanup with rm() and periodic restarts of the R session are the simplest ways to reclaim memory during long development sessions.
Data frame copies
Operations that seem to modify data in place often create copies:
df <- data.frame(x = 1:1e6, y = rnorm(1e6))
# This creates a copy
df$z <- df$x + df$y
# Use transform or within, still creates copy
df <- transform(df, z = x + y)
For very large data frames, these intermediate copies double memory usage temporarily. This is because R must hold both the original data frame and the copy simultaneously until the assignment completes. With a 2 GB data frame, adding a single column requires enough free memory for at least 4 GB: the original, the new column’s data, and the copy being assembled.
Reducing memory usage
Remove unneeded objects
The simplest approach is deleting objects you no longer need:
rm(object1, object2)
gc() # Force garbage collection to return memory to OS
Call gc() after removing large objects if you need the memory back immediately. Without an explicit gc() call, R returns reclaimed memory to the operating system only when it decides the time is right, which may not happen during a critical memory-sensitive phase of your analysis.
Use appropriate data types
Smaller types use less memory. Choosing between numeric (double, 8 bytes) and integer (4 bytes) for whole-number data halves the memory footprint, which matters when working with vectors containing millions of elements.
# Instead of numeric (8 bytes per element)
x <- 1:1000 # Uses ~8 KB
# Use integer (4 bytes) when appropriate
y <- as.integer(1:1000) # Uses ~4 KB
# Use logical (1 byte) instead of character for flags
flags <- rep(TRUE, 1000) # Uses ~1 KB
Process data in chunks
For files larger than available memory, reading the entire file at once is impossible. Chunked processing reads a fixed number of lines at a time, processes the batch, and discards it before reading the next. This keeps the in-memory footprint proportional to the chunk size rather than the full file size. The pattern uses a connection object and a loop that reads until the file is exhausted.
read_chunked <- function(filepath, chunk_size = 10000) {
con <- file(filepath, "r")
on.exit(close(con))
repeat {
lines <- readLines(con, n = chunk_size)
if (length(lines) == 0) break
# Process chunk
process(lines)
}
}
The readr package provides read_csv_chunked() for this pattern, which adds automatic column type parsing and handles the chunking loop internally. For CSV files, read_csv_chunked() is both faster and safer than manually implementing the connection-based loop shown above.
Use disk.frame or arrow for large data
External memory solutions handle datasets larger than RAM by storing data on disk and loading only the portions needed for each operation. disk.frame virtualizes data frames so they behave like in-memory objects while transparently chunking computations. Arrow takes a different approach by memory-mapping files, letting the operating system decide which pages to keep in RAM.
library(disk.frame)
library(arrow)
# disk.frame keeps data on disk, processes in chunks
df <- as.disk.frame(mtcars, outdir = "df/", nchunks = 2)
# Arrow allows memory-mapped files
tbl <- arrow::open_dataset("parquet_files/")
The gc() function
The gc() function triggers garbage collection and reports memory statistics. Despite what you might think, calling gc() is rarely necessary for performance, since R’s automatic collector runs frequently enough that manual calls rarely improve throughput. The real value of gc() is in returning unreferenced memory to the operating system, which matters in long-running processes that allocate and release large objects over time.
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 450000 6.0 1200000 17.0 676000 8.5
# Vcells 1200000 9.2 1600000 12.2 1300000 9.9
R automatically runs garbage collection when needed. The only reason to call gc() manually is to return memory to the operating system, which matters in long-running processes or when you want accurate memory reporting. The output table shows Ncells (language objects) and Vcells (data), each with a current used column, a trigger threshold, and a maximum used column since the last reset.
Set gc(reset = TRUE) to reset the memory tracking statistics:
gc(reset = TRUE)
# Now gc() reports from a clean baseline
Memory profiling tools
profvis
The profvis package visualizes memory usage during code execution. Unlike time-based profilers that show where your code spends its time, profvis tracks both time and memory allocation per line. The flame graph view highlights allocation hot spots, making it easy to identify which function or expression is responsible for unexpected memory growth.
library(profvis)
profvis({
data <- data.frame(x = rnorm(1e6), y = rnorm(1e6))
data$z <- data$x + data$y
mean(data$z)
})
The profiling output shows memory allocated and released at each line, helping identify where memory spikes occur. Pay attention to lines where allocation exceeds release — those are points where intermediate objects consume memory but are never explicitly freed.
lobstr::mem_addr()
Get the memory address of an object to verify whether two variables share memory. Two variables pointing to the same address confirm that no copy has occurred yet, which is the desired state when working with large objects. This function is primarily a debugging tool for understanding when copy-on-modify triggers in your code.
library(lobstr)
x <- 1:10
y <- x
mem_addr(x) == mem_addr(y) # TRUE - same address
y <- c(x) # Explicit copy
mem_addr(x) == mem_addr(y) # FALSE - different addresses
tracemem()
The built-in tracemem() function prints messages when an object is copied. After calling tracemem() on an object, any operation that forces a copy prints a diagnostic message to the console showing the memory address of both the original and the copy. This is invaluable for finding unexpected copies in code that should, in theory, operate in place.
x <- 1:1e6
tracemem(x)
x[1] <- 0 # Prints: [1] "<0x...> is copied"
This helps identify exactly where copy-on-modify triggers in your code. Place tracemem() on the object before the suspicious operation and check the console output when the operation runs. If a copy message appears where you didn’t expect one, the surrounding R code has created a reference that forced duplication.
Working with large datasets efficiently
Avoid loading entire files
Use column selection and filtering with database-backed data using dplyr’s SQL translation. The select() and filter() calls are translated to SQL and executed on the database server, meaning only the result set (not the full table) is transferred to R. This pattern works with any DBI-compatible database and also with dbplyr’s lazy data frames.
library(dplyr)
# Only select needed columns, filter early
result <- db %>%
select(id, value, date) %>%
filter(date > "2024-01-01") %>%
collect()
Clone less
Every object you create uses memory. Reuse existing objects when possible by combining transformations into a single transform() or mutate() call. Each intermediate assignment creates a copy, so performing all transformations at once eliminates the intermediate objects that would otherwise linger in memory during the pipeline.
# Instead of creating multiple transformed versions
df1 <- transform(df, log_x = log(x))
df2 <- transform(df1, sqrt_y = sqrt(y))
# Transform once
df <- transform(df, log_x = log(x), sqrt_y = sqrt(y))
Monitor memory limits
Check and adjust memory limits with memory.limit(), which reports or sets the maximum memory R can allocate. This function is most relevant on Windows, where R enforces a configurable hard limit; exceeding it throws an error rather than allowing the OS to handle the shortfall through paging.
memory.limit() # Get current limit in MB
memory.limit(8000) # Increase to 8 GB on Windows
Linux and macOS don’t enforce hard memory limits the same way Windows does; on these systems, R will allocate until the OS runs out of memory and the out-of-memory killer terminates the process.
Memory allocation patterns to avoid
Growing a vector in a loop is the most common memory mistake in R. result <- c(); for(i in 1:n) result <- c(result, i) copies the entire vector on each iteration — O(n²) total work and memory allocations for an O(n) problem. Pre-allocate with result <- vector("numeric", n) and fill by index.
The same problem applies to data frames: building a data frame by row in a loop is slow. Collect results in a list and combine with do.call(rbind, list) or dplyr::bind_rows(list) at the end.
Copy-on-modify semantics
R uses copy-on-modify: when you modify a variable that has been assigned to another variable, R makes a copy before modifying. y <- x; y[1] <- 0 makes a copy of x for y. Until the modification, both x and y share the same memory (R uses reference counting to determine when copies are needed). tracemem(x) traces when an object is copied. Understanding copy-on-modify helps predict memory usage: operations that look like they operate on one object may be operating on two.
Large datasets
For data that does not fit in memory, arrow::open_dataset() reads Parquet files without loading them — queries are pushed down to the Parquet reader. duckdb::dbConnect(duckdb::duckdb()) runs SQL on disk-based data. data.table can process data in chunks from disk using fread() with nrows and skip arguments. These tools allow analysis of datasets much larger than available RAM without cluster computing.
Tools for memory analysis
pryr::object_size() measures the memory occupied by an R object, including all referenced objects. This is more accurate than object.size() from base R, which does not follow references. For a data frame, pryr::object_size(df) includes the memory for all columns.
gc() forces garbage collection and returns statistics about memory usage before and after. gcinfo(TRUE) prints a message each time garbage collection runs, showing how frequently R reclaims memory during a computation. This is useful for identifying scripts that allocate large amounts of temporary memory.
memoise::memoise() caches function results, which is a form of memory management for expensive functions: trade memory for computation time. The cache can be bounded in size with memoise::memoize(f, cache = cachem::cache_mem(max_size = 100 * 1024^2)) to prevent unbounded memory growth.
When to optimize memory
Most R analyses do not require memory optimization. The time to optimize is when R’s memory usage causes problems: slowdowns due to garbage collection, out-of-memory errors, or paging to disk. Measure first with lobstr::obj_size() and profmem::profmem() to find the actual bottleneck before spending time on optimization. Often a single large object or a loop with poor allocation is responsible for most of the memory pressure, and fixing that one issue is sufficient.
Summary
R’s copy-on-modify semantics protect against accidental mutation but can cause excessive copying in tight loops. The main strategies to reduce memory pressure are: preallocate output vectors instead of growing them, use data.table for in-place modification of large data frames, process data in chunks using readr’s chunked reader for files too large to fit in RAM, and check object sizes with lobstr::obj_size() to identify the largest allocations. When memory use becomes critical, profile with profmem before optimizing.
See also
- Performance Benchmarking in R — measuring and comparing R code performance
- High-Performance Data Manipulation with data.table — in-place modification for large datasets
- Parallel Computing in R — distribute computation across multiple cores
- Working with Arrow and Parquet — memory-mapped file access for out-of-core data
- Reading and Writing CSV Files in R — chunked file reading strategies