Random Forests in R

· 5 min read · Updated March 26, 2026 · intermediate
machine-learning random-forest tidymodels classification regression

Random forests are one of the most widely used machine learning algorithms in R. They build hundreds of decision trees and combine their predictions, producing models that are accurate, stable, and relatively resistant to overfitting. This tutorial covers how random forests work, how to train them using the randomForest package and the tidymodels framework, which hyperparameters to tune, and how to interpret feature importance scores.

What Is a Random Forest?

A random forest is an ensemble of many individual decision trees. Each tree is trained on a bootstrap sample of the data — sampling rows with replacement — rather than the full dataset. At each node in each tree, only a random subset of features is considered for splitting. When making a prediction, every tree in the forest casts a vote and the final output is the majority vote for classification or the average for regression.

This two-layer randomness — bootstrap sampling of rows and random selection of features at each split — produces a collection of trees that are diverse yet each reasonably accurate. Averaging or voting across diverse models cancels out individual errors, yielding a final prediction that is more reliable than any single tree could produce.

The key advantages are:

  • Each tree learns a different part of the data, so the forest as a whole handles non-linearity and feature interactions without explicit engineering.
  • Adding more trees rarely hurts performance; the out-of-bag (OOB) error stabilises once enough trees have been grown.
  • The algorithm handles both classification and regression natively and is robust to outliers that affect only a minority of bootstrap samples.

Training with the randomForest Package

The randomForest package is the canonical implementation in R, developed by Leo Breiman and Andy Liaw. Install it from CRAN:

install.packages("randomForest")

Classification Example

The iris dataset is built into R and provides a convenient test case for classification:

library(randomForest)

set.seed(31)
rf_model <- randomForest(Species ~ ., data = iris, ntree = 500, mtry = 2)
print(rf_model)

The output reports the OOB error estimate and confusion matrix automatically.

Regression Example

For regression tasks, the target variable must be numeric. The MASS package ships the Boston housing dataset:

library(MASS)

set.seed(7)
rf_boston <- randomForest(medv ~ ., data = Boston, ntree = 500, mtry = 4)
print(rf_boston)

The output reports the mean squared error (MSE) on OOB predictions and the percentage of variance the model explains.

Key Hyperparameters

Four hyperparameters control the behaviour of a random forest. Understanding what each one does helps you navigate the bias-variance trade-off when tuning a model.

ParameterWhat it controlsDefault (classification)Default (regression)
ntreeNumber of trees in the forest500500
mtryFeatures considered at each splitfloor(sqrt(p))floor(p / 3)
nodesizeMinimum size of terminal nodes15
maxnodesMaximum terminal nodes per treeunlimitedunlimited

ntree controls how many trees are grown. Error usually stabilises after a few hundred trees, but using 500 or more gives a stable OOB estimate.

mtry is the number of features randomly sampled as candidates at each split. A lower value forces trees to be more diverse. For classification, floor(sqrt(p)) is a conventional starting point; for regression, floor(p / 3) is standard.

nodesize sets the minimum number of observations in a terminal node. Smaller values allow deeper, more complex trees; larger values prune the trees aggressively.

maxnodes directly limits the number of terminal nodes per tree, providing another way to control complexity.

Here is how to pass all four to randomForest():

set.seed(42)
rf_tuned <- randomForest(
  Species ~ .,
  data = iris,
  ntree = 300,
  mtry = 3,
  nodesize = 2,
  maxnodes = 20
)

Feature Importance

After fitting, randomForest computes importance measures automatically. Pass importance = TRUE when fitting:

rf_model <- randomForest(Species ~ ., data = iris, importance = TRUE, ntree = 300)
importance(rf_model, type = 2)

The vip package produces cleaner visualisations and permutation-based importance:

library(vip)
vip(rf_model)
vi_permute(rf_model, nsim = 10)

Making Predictions

Use predict() with a fitted randomForest object:

pred_class <- predict(rf_model, newdata = iris[1:5, ])
pred_prob <- predict(rf_model, newdata = iris[1:5, ], type = "prob")
pred_num <- predict(rf_boston, newdata = Boston[1:5, ])

A Tidymodels Pipeline

The tidymodels ecosystem provides a consistent interface through parsnip’s rand_forest():

library(tidymodels)

rf_spec <- rand_forest(trees = 500, mtry = 2, min_n = 5) |>
  set_engine("randomForest") |>
  set_mode("classification")

rf_fit <- rf_spec |> fit(Species ~ ., data = iris)
predict(rf_fit, new_data = iris[1:5, ])

Switching the engine to ranger is a one-line change for large datasets:

rf_spec_ranger <- rand_forest(trees = 500, mtry = 2, min_n = 5) |>
  set_engine("ranger") |>
  set_mode("classification")

Hyperparameter Tuning with tune_grid

The tidymodels workflow integrates with cross-validation and grid search:

library(titanic)

titanic <- titanic_train |>
  select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) |>
  mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age))

rf_tune_spec <- rand_forest(trees = 300, mtry = tune(), min_n = tune()) |>
  set_engine("randomForest") |>
  set_mode("classification")

folds <- vfold_cv(titanic, v = 5, strata = Survived)

rf_grid <- grid_regular(mtry(range = c(2, 6)), min_n(range = c(5, 20)), levels = 3)

tune_results <- rf_tune_spec |>
  tune_grid(Survived ~ ., resamples = folds, grid = rf_grid,
            metrics = metric_set(accuracy, roc_auc))

show_best(tune_results, metric = "roc_auc")

best_params <- select_best(tune_results, metric = "roc_auc")
rf_spec_final <- rf_tune_spec |> finalize_model(best_params)
rf_final <- rf_spec_final |> fit(Survived ~ ., data = titanic)

grid_regular() evaluates all combinations at the specified number of levels. For larger search spaces, grid_max_entropy() draws a space-filling sample instead.

Summary

Random forests are a reliable, low-configuration machine learning algorithm suitable for both classification and regression tasks. The randomForest package gives you direct control over ntree, mtry, nodesize, and maxnodes, while the tidymodels framework provides a consistent interface through parsnip’s rand_forest().

Key takeaways:

  • Set ntree to at least 300–500 and leave it there; focus tuning effort on mtry and min_n.
  • Use permutation-based importance from vip rather than Gini importance for more reliable rankings.
  • tidymodels handles the boilerplate for cross-validation and hyperparameter search, producing a tuned model ready for new data.

See Also