Memory Management in R
Memory management in R works differently than in languages like C or Python. R uses copy-on-modify semantics and automatic garbage collection, which simplifies coding but can cause unexpected memory usage patterns. Understanding these internals helps you write code that handles large datasets efficiently.
How R Manages Memory
R allocates memory dynamically and manages it through a garbage collector that runs automatically. When you create objects, R allocates space from the heap. When objects are no longer referenced, the garbage collector reclaims that memory.
Copy-on-Modify
When you modify an object in R, the language doesn’t immediately create a copy. Instead, it uses copy-on-write semantics. The object is only copied when you attempt to modify it, and at that point, R duplicates the data.
x <- 1:1000
y <- x # No copy made yet - y points to same memory as x
# Modifying y triggers the copy
y[1] <- 0
# Now y has its own copy
This behavior matters in functions. Passing a large vector to a function doesn’t duplicate it unless the function modifies the input.
Garbage Collection
R’s garbage collector runs automatically when the memory heap reaches a threshold. You don’t need to call gc() manually for normal usage. The collector identifies objects with no remaining references and frees their memory.
Use gcinfo(TRUE) to see when garbage collection occurs:
gcinfo(TRUE)
x <- integer(100000)
x <- c(x, 1:18) # Triggers gc
gcinfo(FALSE)
The output shows memory used by different object types and how much was collected.
Measuring Object Size
R provides several ways to measure memory usage, but the numbers can be confusing.
object.size()
The base R function object.size() returns the memory allocation for an object:
x <- rep(1:10, each = 100)
object.size(x)
# 8320 bytes
This function reports what R allocates, which tends to overestimate because it doesn’t account for shared references.
lobstr::obj_size()
The lobstr package provides a more accurate measurement that accounts for object sharing:
library(lobstr)
x <- 1:1000
y <- list(x, x, x) # Three references to same object
object.size(y)
# 24568 bytes (counts x three times)
obj_size(y)
# 8120 bytes (recognizes shared memory)
The difference matters when you’re measuring memory usage of complex objects with shared components.
Total Memory Usage
Check R’s total memory consumption with mem_used():
mem_used()
# 1.2 GB of memory allocated
This shows heap memory and cons cells used by R’s internal structures.
Common Memory Pitfalls
Several patterns cause unexpected memory growth.
Growing Vectors in Loops
The most common mistake is growing a vector inside a loop:
# BAD: Creates a new vector on each iteration
result <- numeric(0)
for (i in 1:10000) {
result <- c(result, i)
}
# GOOD: Pre-allocate the vector
result <- numeric(10000)
for (i in 1:10000) {
result[i] <- i
}
Each c() call copies the entire vector. With 10,000 iterations, you create thousands of copies unnecessarily.
Unintended Object Retention
Objects remain in memory until explicitly removed or they go out of scope. In scripts and notebooks, old objects accumulate:
ls() # Shows all objects in your environment
rm(large_object) # Remove specific object
rm(list = ls()) # Clear everything
The workspace grows over time, consuming memory even when objects aren’t actively used.
Data Frame Copies
Operations that seem to modify data in place often create copies:
df <- data.frame(x = 1:1e6, y = rnorm(1e6))
# This creates a copy
df$z <- df$x + df$y
# Use transform or within, still creates copy
df <- transform(df, z = x + y)
For very large data frames, these intermediate copies double memory usage temporarily.
Reducing Memory Usage
Remove Unneeded Objects
The simplest approach is deleting objects you no longer need:
rm(object1, object2)
gc() # Force garbage collection to return memory to OS
Call gc() after removing large objects if you need the memory back immediately.
Use Appropriate Data Types
Smaller types use less memory:
# Instead of numeric (8 bytes per element)
x <- 1:1000 # Uses ~8 KB
# Use integer (4 bytes) when appropriate
y <- as.integer(1:1000) # Uses ~4 KB
# Use logical (1 byte) instead of character for flags
flags <- rep(TRUE, 1000) # Uses ~1 KB
Process Data in Chunks
For files larger than available memory, read and process in chunks:
read_chunked <- function(filepath, chunk_size = 10000) {
con <- file(filepath, "r")
on.exit(close(con))
repeat {
lines <- readLines(con, n = chunk_size)
if (length(lines) == 0) break
# Process chunk
process(lines)
}
}
The readr package provides read_csv_chunked() for this pattern.
Use disk.frame or arrow for Large Data
External memory solutions handle datasets larger than RAM:
library(disk.frame)
library(arrow)
# disk.frame keeps data on disk, processes in chunks
df <- as.disk.frame(mtcars, outdir = "df/", nchunks = 2)
# Arrow allows memory-mapped files
tbl <- arrow::open_dataset("parquet_files/")
The gc() Function
The gc() function triggers garbage collection and reports memory statistics. Despite what you might think, calling gc() is rarely necessary for performance:
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 450000 6.0 1200000 17.0 676000 8.5
# Vcells 1200000 9.2 1600000 12.2 1300000 9.9
R automatically runs garbage collection when needed. The only reason to call gc() manually is to return memory to the operating system, which matters in long-running processes or when you want accurate memory reporting.
Set gc(reset = TRUE) to reset the memory tracking statistics:
gc(reset = TRUE)
# Now gc() reports from a clean baseline
Memory Profiling Tools
profvis
The profvis package visualizes memory usage during code execution:
library(profvis)
profvis({
data <- data.frame(x = rnorm(1e6), y = rnorm(1e6))
data$z <- data$x + data$y
mean(data$z)
})
The profiling output shows memory allocated and released at each line, helping identify where memory spikes occur.
lobstr::mem_addr()
Get the memory address of an object to verify whether two variables share memory:
library(lobstr)
x <- 1:10
y <- x
mem_addr(x) == mem_addr(y) # TRUE - same address
y <- c(x) # Explicit copy
mem_addr(x) == mem_addr(y) # FALSE - different addresses
tracemem()
The built-in tracemem() function prints messages when an object is copied:
x <- 1:1e6
tracemem(x)
x[1] <- 0 # Prints: [1] "<0x...> is copied"
This helps identify exactly where copy-on-modify triggers in your code.
Working with Large Datasets Efficiently
Avoid Loading Entire Files
Use column selection and filtering in database queries or with packages like disk.frame:
library(dplyr)
# Only select needed columns, filter early
result <- db %>%
select(id, value, date) %>%
filter(date > "2024-01-01") %>%
collect()
Clone Less
Every object you create uses memory. Reuse existing objects when possible:
# Instead of creating multiple transformed versions
df1 <- transform(df, log_x = log(x))
df2 <- transform(df1, sqrt_y = sqrt(y))
# Transform once
df <- transform(df, log_x = log(x), sqrt_y = sqrt(y))
Monitor Memory Limits
Check and adjust memory limits with memory.limit():
memory.limit() # Get current limit in MB
memory.limit(8000) # Increase to 8 GB on Windows
Linux and macOS don’t enforce hard memory limits the same way Windows does.
Summary
R’s memory management is automatic but not invisible. The copy-on-modify behavior means passing objects to functions is cheap until you modify them. Use lobstr::obj_size() for accurate measurements, avoid growing vectors in loops, and remove objects you no longer need. For truly large data, consider external memory solutions like disk.frame or Arrow. The gc() function is rarely needed for performance but helps return memory to the OS in long-running processes.