rguides

Random Forests in R, Train random forests in R with the

Random forests are one of the most widely used machine learning algorithms in R. They build hundreds of decision trees and combine their predictions, producing models that are accurate, stable, and relatively resistant to overfitting. This tutorial covers how random forests work, how to train them using the randomForest package and the tidymodels framework, which hyperparameters to tune, and how to interpret feature importance scores.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Random Forests in R — Train random forests in R with the. By the end, you will know how to apply the core functions in real data analysis workflows.

What is a random forest?

A random forest is an ensemble of many individual decision trees. Each tree is trained on a bootstrap sample of the data, sampling rows with replacement, rather than the full dataset. At each node in each tree, only a random subset of features is considered for splitting. When making a prediction, every tree in the forest casts a vote and the final output is the majority vote for classification or the average for regression.

This two-layer randomness, bootstrap sampling of rows and random selection of features at each split, produces a collection of trees that are diverse yet each reasonably accurate. Averaging or voting across diverse models cancels out individual errors, yielding a final prediction that is more reliable than any single tree could produce.

The key advantages are:

  • Each tree learns a different part of the data, so the forest as a whole handles non-linearity and feature interactions without explicit engineering.
  • Adding more trees rarely hurts performance; the out-of-bag (OOB) error stabilises once enough trees have been grown.
  • The algorithm handles both classification and regression natively and is reliable to outliers that affect only a minority of bootstrap samples.

Training with the randomForest package

The randomForest package is the canonical implementation in R, developed by Leo Breiman and Andy Liaw. Install it from CRAN:

install.packages("randomForest")

Classification example

The iris dataset is built into R and provides a convenient test case for classification:

library(randomForest)

set.seed(31)
rf_model <- randomForest(Species ~ ., data = iris, ntree = 500, mtry = 2)
print(rf_model)

The output reports the OOB error estimate and confusion matrix automatically.

Regression example

For regression tasks, the target variable must be numeric. The MASS package ships the Boston housing dataset:

library(MASS)

set.seed(7)
rf_boston <- randomForest(medv ~ ., data = Boston, ntree = 500, mtry = 4)
print(rf_boston)

The output reports the mean squared error (MSE) on OOB predictions and the percentage of variance the model explains.

Key hyperparameters

Four hyperparameters control the behaviour of a random forest. Understanding what each one does helps you navigate the bias-variance trade-off when tuning a model.

ParameterWhat it controlsDefault (classification)Default (regression)
ntreeNumber of trees in the forest500500
mtryFeatures considered at each splitfloor(sqrt(p))floor(p / 3)
nodesizeMinimum size of terminal nodes15
maxnodesMaximum terminal nodes per treeunlimitedunlimited

ntree controls how many trees are grown. Error usually stabilises after a few hundred trees, but using 500 or more gives a stable OOB estimate.

mtry is the number of features randomly sampled as candidates at each split. A lower value forces trees to be more diverse. For classification, floor(sqrt(p)) is a conventional starting point; for regression, floor(p / 3) is standard.

nodesize sets the minimum number of observations in a terminal node. Smaller values allow deeper, more complex trees; larger values prune the trees aggressively.

maxnodes directly limits the number of terminal nodes per tree, providing another way to control complexity.

Here is how to pass all four to randomForest():

set.seed(42)
rf_tuned <- randomForest(
  Species ~ .,
  data = iris,
  ntree = 300,
  mtry = 3,
  nodesize = 2,
  maxnodes = 20
)

Making predictions

Use predict() with a fitted randomForest object:

pred_class <- predict(rf_model, newdata = iris[1:5, ])
pred_prob <- predict(rf_model, newdata = iris[1:5, ], type = "prob")
pred_num <- predict(rf_boston, newdata = Boston[1:5, ])

A tidymodels pipeline

The tidymodels ecosystem provides a consistent interface through parsnip’s rand_forest():

library(tidymodels)

rf_spec <- rand_forest(trees = 500, mtry = 2, min_n = 5) |>
  set_engine("randomForest") |>
  set_mode("classification")

rf_fit <- rf_spec |> fit(Species ~ ., data = iris)
predict(rf_fit, new_data = iris[1:5, ])

Switching the engine to ranger is a one-line change for large datasets:

rf_spec_ranger <- rand_forest(trees = 500, mtry = 2, min_n = 5) |>
  set_engine("ranger") |>
  set_mode("classification")

Hyperparameter tuning with tune_grid

The tidymodels workflow integrates with cross-validation and grid search:

library(titanic)

titanic <- titanic_train |>
  select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) |>
  mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age))

rf_tune_spec <- rand_forest(trees = 300, mtry = tune(), min_n = tune()) |>
  set_engine("randomForest") |>
  set_mode("classification")

folds <- vfold_cv(titanic, v = 5, strata = Survived)

rf_grid <- grid_regular(mtry(range = c(2, 6)), min_n(range = c(5, 20)), levels = 3)

tune_results <- rf_tune_spec |>
  tune_grid(Survived ~ ., resamples = folds, grid = rf_grid,
            metrics = metric_set(accuracy, roc_auc))

show_best(tune_results, metric = "roc_auc")

best_params <- select_best(tune_results, metric = "roc_auc")
rf_spec_final <- rf_tune_spec |> finalize_model(best_params)
rf_final <- rf_spec_final |> fit(Survived ~ ., data = titanic)

grid_regular() evaluates all combinations at the specified number of levels. For larger search spaces, grid_max_entropy() draws a space-filling sample instead.

How random forests work

A random forest trains many decision trees, each on a random bootstrap sample of the training data and a random subset of features at each split. Individual trees overfit but make different errors. The ensemble average (regression) or majority vote (classification) of all trees is more accurate than any single tree. This bagging + feature randomization combination reduces both variance and correlation between trees.

Feature importance

ranger::importance() extracts feature importance scores. The default importance = "impurity" measures how much each feature reduces node impurity across all splits in all trees. importance = "permutation" shuffles each feature and measures the drop in accuracy, more reliable but slower. Variable importance guides feature selection but can be misleading for correlated features, both correlated features may show low importance even though either alone predicts well.

Hyperparameters

Key hyperparameters for tuning: num.trees (number of trees, more is always better up to diminishing returns), mtry (features per split, default is sqrt(p) for classification), min.node.size (minimum leaf size, controls tree depth). The ranger package defaults are reasonable for most cases, only tune if performance is unsatisfactory. Use cross-validation via caret or tidymodels to select hyperparameters.

Out-of-Bag error

Random forests have a built-in cross-validation estimate: each tree is trained on ~63% of data; the remaining 37% (out-of-bag, OOB) estimates performance without a held-out set. ranger(outcome ~ ., data = train)$prediction.error gives the OOB error directly. For large datasets, OOB error is faster to compute than k-fold CV and gives a reliable estimate of generalization performance.

Tree ensembles and the bias-Variance tradeoff

Random forests are an ensemble method that reduces prediction error by combining many weak learners, individual decision trees, into one strong learner. Each tree has high variance: a tree trained on slightly different data produces a different structure and different predictions. By averaging predictions from many trees trained on bootstrapped samples, random forests reduce variance without increasing bias. This variance reduction is why ensembles outperform individual trees.

The diversity of trees in the ensemble is critical. If all trees are highly correlated, always splitting on the same features in similar ways, averaging them does not reduce variance much. The random feature selection at each split (considering only a random subset of features rather than all features) is what creates diversity. Trees that differ because they each see different subsets of features disagree more with each other, and their average is more accurate than correlated trees’ average.

Implementation choices in R

The randomForest package is the original R implementation, based directly on Breiman’s algorithm. The ranger package is a faster, memory-efficient reimplementation that supports parallel computation and handles larger datasets. The randomForestSRC package adds support for survival outcomes and mixed-type features. For most classification and regression tasks, ranger is the practical choice because it is fast enough to make cross-validation and hyperparameter tuning feasible.

The tidymodels integration through the ranger engine provides a consistent interface. Setting the number of trees, the mtry parameter, and the minimum node size through tidymodels allows tuning these hyperparameters within a tidymodels workflow alongside preprocessing steps, using the same cross-validation infrastructure.

Interpreting and explaining predictions

Random forests are not inherently interpretable, the prediction comes from averaging hundreds of trees, and there is no single parameter to examine. Permutation-based variable importance provides a global explanation of which features matter most. Partial dependence plots show the marginal relationship between one feature and the predicted outcome, averaging over all other features.

For individual predictions, the SHAP (SHapley Additive exPlanations) framework attributes the difference between a prediction and the baseline prediction to each feature. The shapviz package computes SHAP values for randomForest and ranger objects. SHAP values satisfy consistency and efficiency properties that make them theoretically grounded explainability measures. For applications where explaining individual predictions matters, medical decisions, credit scoring — SHAP values are the standard approach.

Summary

Random forests are a reliable, low-configuration machine learning algorithm suitable for both classification and regression tasks. The randomForest package gives you direct control over ntree, mtry, nodesize, and maxnodes, while the tidymodels framework provides a consistent interface through parsnip’s rand_forest().

Key takeaways:

  • Set ntree to at least 300–500 and leave it there; focus tuning effort on mtry and min_n.
  • Use permutation-based importance from vip rather than Gini importance for more reliable rankings.
  • tidymodels handles the boilerplate for cross-validation and hyperparameter search, producing a tuned model ready for new data.

Next steps

Now that you understand random forests in r — train random forests in r with the, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.

See also