Parallel purrr with furrr: Speed Up R Iterations
The furrr package brings parallel processing to your tidyverse workflows using the familiar purrr API. If you already write map() calls, switching to parallel purrr with furrr requires minimal code changes but can give you massive speedups on CPU-intensive tasks that would otherwise run sequentially.
Why furrr?
The purrr package gives you elegant functional iteration. But by default, purrr runs sequentially. Each iteration waits for the previous one to finish. When you have hundreds or thousands of items to process, this adds up.
furrr replaces purrr’s sequential mapping functions with parallel versions. You swap map() for future_map(), and your code runs across multiple cores automatically.
The magic comes from the future package, which handles the parallelism details. furrr translates your purrr-style code into future-based parallel execution.
Setup
Install both packages from CRAN:
install.packages(c("furrr", "future"))
Load them together. The future package provides the parallel backend infrastructure, while furrr wraps purrr’s mapping functions to use that backend. The plan(multisession) call configures the parallel strategy before any mapping work begins — without this, future_map() would still run, but it would fall back to sequential execution in the main process:
library(furrr)
library(future)
plan(multisession) # Use multiple R sessions
The plan() function controls how futures are resolved and determines where your parallel code runs. multisession creates separate R processes on your machine, each with its own memory space. This is the safest cross-platform option and works on Windows, macOS, and Linux without additional configuration. For a quick test on a single machine, this is usually the right choice.
Basic parallel mapping
Here’s the simplest example - transforming a numeric vector:
library(furrr)
library(future)
plan(multisession)
# Sequential (standard purrr)
slow_function <- function(x) {
Sys.sleep(0.1) # Simulate work
x * 2
}
# This takes 1 second (10 * 0.1)
system.time(result <- map(1:10, slow_function))
# This takes ~0.2 seconds on a 4-core machine
system.time(result <- future_map(1:10, slow_function))
The interface is identical to purrr. Change the function name from map() to future_map() and your loops run in parallel. The system.time() output confirms the speedup: the sequential version takes roughly 1 second because each Sys.sleep(0.1) blocks for 100ms in series, while the parallel version distributes the work across cores so the total wall-clock time drops to a fraction of the sequential runtime.
Different output types
Just like purrr, furrr provides variants for different output types. These type-stable variants return vectors of the expected type rather than lists, avoiding the need to post-process results with unlist() or map_dbl():
future_map()- list outputfuture_map_lgl()- logical vectorfuture_map_int()- integer vectorfuture_map_dbl()- double vectorfuture_map_chr()- character vectorfuture_map_dfr()- row-bound data framefuture_map_dfc()- column-bound data frame
Example with data frames:
library(tidyverse)
library(furrr)
library(future)
plan(multisession)
# Apply transformation to each group in parallel
results <- iris %>%
split(.$Species) %>%
future_map_dfr(~.x %>% mutate(sepal_area = Sepal.Length * Sepal.Width))
Progress bars
Parallel code can feel slow if you do not see progress, especially for long-running batch jobs with hundreds or thousands of iterations. The progressr package integrates with furrr to display a progress bar that updates as each worker completes its task. Wrap your map call with with_progress() and register a progress handler:
library(furrr)
library(future)
library(progressr)
plan(multisession)
handlers(progressbar)
# Now you see a progress bar
with_progress({
results <- future_map(1:100, ~.x^2, .options = furrr_options(seed = TRUE))
})
The .options argument also lets you set a random seed for reproducibility.
Error handling
furrr works with purrr’s safety functions. Wrap your function with safely() or possibly() to prevent a single failed iteration from aborting the entire parallel computation. With safely(), each result is a two-element list containing either a result or an error, letting you separate successes from failures after the map completes:
# safely() returns a list with result and error
safe_divide <- safely(function(x, y) x / y, otherwise = NA)
results <- future_map(1:10, ~safe_divide(.x, sample(0:1, 1)))
# Extract results and errors separately
successes <- map_dbl(results, "result")
errors <- map(results, "error") %>% map_lgl(is.null)
The possibly() variant is simpler — it returns a default value on error instead of a two-element list. This is convenient when you want the map to produce a clean vector without post-processing, sacrificing the ability to inspect which iterations failed:
safe_log <- possibly(log, otherwise = NA_real_)
future_map(c(1, -1, 2), safe_log) # NA for log(-1)
Performance tips
Chunking
For very large iterables with thousands of elements, processing each one as a separate future creates excessive scheduling overhead. The chunk.size option groups elements into batches, reducing the number of round-trips between the main process and workers while still distributing work across cores:
future_map(1:10000, slow_function,
.options = furrr_options(chunk.size = 100))
Limiting workers
Setting the number of workers to match your available cores prevents oversubscription, where too many R processes compete for CPU time and degrade performance. A good starting point is the number of physical cores minus one, leaving a core free for the operating system and the main R process:
plan(multisession(workers = 4))
Seed setting
Always set a seed for reproducible results when your mapped function uses random number generation. The furrr_options(seed = TRUE) argument uses L’Ecuyer-CMRG streams to assign each worker an independent, reproducible random subsequence. Without this option, parallel RNG states are undefined and results vary between runs:
future_map(1:100, ~rnorm(1), .options = furrr_options(seed = TRUE))
Common pitfalls
Shared state
Each worker is a separate R process with its own memory space. Variables defined in the parent session are not visible to workers unless explicitly exported or captured as function arguments. This is the most common source of confusion when first using parallel operations in R:
# This doesn't work as expected
global_lookup <- c(a = 1, b = 2)
future_map(c("a", "b"), ~global_lookup[.x]) # Won't find global_lookup
Pass data explicitly as function arguments instead.
Side effects
Writing to files or modifying global state from inside future_map() can cause race conditions where multiple workers attempt to write to the same file simultaneously. Each worker runs in its own process, so there is no coordination between them for shared resources. Return the data from the map and perform side effects in the main process after collecting results:
# Bad
future_map(df$path, ~write.csv(read.csv(.x), "output.csv"))
# Good
future_map(df$path, ~read.csv(.x)) %>%
map(~write.csv(.x, "output.csv"))
Small tasks
Parallelism adds overhead. If each iteration takes less than 10 milliseconds, sequential purrr might actually be faster.
Setting up a plan
furrr uses future plans to determine where parallel work runs. plan(multisession) creates R worker processes in the background; it is the safest option on all platforms including Windows. plan(multicore) uses fork() for lower overhead but works only on Unix-like systems. plan(sequential) runs everything in the current process, which is useful for debugging since parallel errors can be harder to trace.
Set the number of workers with plan(multisession, workers = 4). A good default is parallelly::availableCores() minus one, leaving a core for the main process and OS tasks.
Random number reproducibility
Parallel code with random numbers requires careful seed management. furrr propagates seeds automatically using L’Ecuyer-CMRG streams when you call furrr_options(seed = TRUE). Each worker gets a different but reproducible sub-sequence, so results are reproducible across runs even in parallel. Without this, random states in workers are undefined and results differ between runs.
When parallelism helps
Parallelism is most effective for CPU-bound tasks with minimal inter-process communication. Model fitting on independent folds, simulation replications, and bootstrapping are ideal. Parallelism adds overhead: process startup, data serialization, and result collection. For fast operations (under a millisecond each), the overhead exceeds the benefit. Benchmark with microbenchmark or bench::mark() before committing to parallel execution.
Passing data to workers
Each furrr worker is a separate R process with no shared memory. Data is serialized, sent to the worker, and the result is serialized back. Objects referenced inside the mapped function are automatically detected and exported. For large shared objects (a big data frame accessed by every iteration), it is more efficient to write the data to disk (as a Parquet file or pin) and have each worker read it independently rather than serializing and sending it to every worker.
Progress tracking
furrr_options(stdout = TRUE) forwards worker output to the main console — useful for debugging but verbose in production. For progress bars, use progressr::with_progress(): wrap future_map() in a with_progress() call and call p() inside the mapped function. This works across all future backends without changing the parallel code.
Debugging parallel code
Parallel errors can be harder to debug than sequential errors because the worker’s state is not visible in the main session. Debug by switching to plan(sequential) — all code runs in the main process, making browser(), traceback(), and print() debugging available. Once the bug is fixed, switch back to the parallel plan. If the bug only occurs in parallel (e.g., a race condition or missing export), inspect the worker’s environment with future::value(future({ls()})).
Combining with tidyverse
furrr integrates directly with tidyverse patterns. In a dplyr pipeline, replace mutate(result = map(col, slow_fn)) with mutate(result = future_map(col, slow_fn)) to parallelize the map. The entire rest of the pipeline remains sequential. For model fitting across groups, group_modify(~ future_map_dfr(...)) parallelizes within each group. The future_pmap() variant handles multiple-argument iteration over list columns or data frame rows.
The furrr value proposition
Parallelizing code in R traditionally required explicit setup: creating a cluster, exporting variables, managing workers, and cleaning up. furrr removes this friction by making purrr’s map functions work in parallel through a one-line backend change. You write map calls in the normal way, call plan() once to enable parallelism, and the same code runs across multiple CPU cores. Switching back to sequential execution means calling plan(sequential).
The cost of parallelism is serialization overhead: each worker runs in a separate R process, and data must be sent to workers and results returned. For small or fast computations, this overhead exceeds the time saved by parallelism. The rough threshold is tasks that each take at least a few tenths of a second to compute. Below that threshold, parallel execution is often slower than sequential.
Choosing a backend
furrr uses the future package as its backend. The plan() function selects the execution strategy. On a single machine, plan(multisession) starts multiple R processes and is the safest cross-platform option. On Linux, plan(multicore) uses forked processes, which is faster because it avoids copying the parent environment but does not work on Windows or in RStudio on any platform.
For cluster or HPC environments, future supports distributed backends that spread work across multiple machines. The same furrr code runs unchanged — only the plan() call differs. This backend portability is the main reason to choose furrr over lower-level parallelism approaches when you anticipate needing to scale beyond a single machine.
Managing state and side effects
Workers in a multisession plan run in separate R processes that do not share memory. Variables defined in the parent session are automatically copied to workers through future’s automatic export mechanism, but modifications made in workers do not propagate back. Side effects like writing to a shared database or updating a global variable are not safe in parallel code.
Random number generation requires explicit handling in parallel code. The future_map functions use the future package’s RNG-safe generators when .options = furrr_options(seed = TRUE) is set, which ensures reproducible and statistically independent random sequences across workers. Forgetting to set this option when your code uses random numbers produces results that vary between runs and may have subtle correlations between parallel branches.
Monitoring progress
furrr integrates with the progressr package for progress reporting. Wrap the map call body with a progress handler and call with_progress() around the outer call to see a progress bar as workers complete tasks. Progress reporting works across all future backends including parallel ones, which is useful for long-running batch jobs where you need to know how much work remains.
Conclusion
furrr makes parallel processing accessible to tidyverse users. The learning curve is minimal if you already know purrr. The speedup can be dramatic for CPU-bound tasks.
Start with future_map() replacing your map() calls. Add progress bars with progressr. Use safely/possibly for error handling. Profile with microbenchmark to verify you’re actually gaining speed.
See also
- Parallel Computing in R: mclapply, parLapply, and the foundations furrr builds on
- Functional Programming with purrr: The sequential mapping functions that furrr parallelizes
- The apply Family: apply, lapply, sapply, tapply: Base R iteration patterns
- Fast Data Manipulation with data.table: Combine fast data processing with parallel iteration