Random Forests in R

· 4 min read · Updated March 7, 2026 · intermediate
random-forest tidymodels machine-learning classification regression

Random forests are one of the most versatile and powerful machine learning algorithms. They build multiple decision trees during training and output the class that is the mode of the classes (classification) or mean prediction (regression) of individual trees. This ensemble approach dramatically reduces overfitting and variance compared to single decision trees.

In this tutorial, you will learn how to build random forest models in R using the tidymodels framework. You will prepare data, train models, tune hyperparameters, and evaluate performance.

Prerequisites

Before starting, you should be familiar with:

  • Basic R syntax and data manipulation with dplyr
  • The concept of machine learning (training/test splits)
  • Previous tutorials in this series on regression and classification

You will need the following packages:

install.packages(c("tidymodels", "ranger", "vip"))
library(tidymodels)
library(ranger)
library(vip)

Preparing the Data

We will use the Titanic dataset from the tidymodels package to predict survival. Let us load and preprocess the data:

# Load data
data("titanic", package = "modeldata")
titanic <- titanic::titanic_train

# Quick look
glimpse(titanic)

The dataset contains information about passengers including their class, sex, age, fare, and whether they survived.

Let us prepare the data for modeling:

titanic_clean <- titanic %>%
  select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked) %>%
  mutate(
    Survived = factor(Survived, levels = c(0, 1)),
    across(where(is.character), factor)
  ) %>%
  step_impute_median(all_numeric()) %>%
  prep() %>%
  bake(new_data = NULL)

glimpse(titanic_clean)

Building the Random Forest Model

The ranger package provides a fast implementation of random forests. Let us set up a model specification:

rf_spec <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) %>%
  set_mode("classification") %>%
  set_engine("ranger")

Key hyperparameters:

  • mtry: Number of predictors sampled at each split
  • trees: Number of trees in the forest
  • min_n: Minimum number of observations required to split a node

Creating a Workflow

A workflow bundles the model specification with preprocessing:

rf_workflow <- workflow() %>%
  add_formula(Survived ~ .) %>%
  add_model(rf_spec)

Hyperparameter Tuning

Random forests need their hyperparameters tuned for optimal performance. We will use cross-validation:

set.seed(123)
folds <- vfold_cv(titanic_clean, v = 5)

rf_tune <- rf_workflow %>%
  tune_grid(
    resamples = folds,
    grid = 10,
    metrics = metric_set(accuracy, roc_auc)
  )

# Show best parameters
show_best(rf_tune, metric = "roc_auc")

The tuning process evaluates different combinations of hyperparameters using 5-fold cross-validation.

Selecting the Best Model

Extract the best performing model based on ROC AUC:

best_params <- select_best(rf_tune, metric = "roc_auc")

rf_final <- rf_workflow %>%
  finalize_workflow(best_params)

# Fit on full training data
rf_fit <- rf_final %>%
  fit(titanic_clean)

rf_fit

Evaluating Performance

Let us assess the model predictive ability:

# Predictions
predictions <- rf_fit %>%
  predict(titanic_clean) %>%
  bind_cols(predict(rf_fit, titanic_clean, type = "prob")) %>%
  bind_cols(titanic_clean)

# Confusion matrix
conf_mat(predictions, Survived, .pred_class)

# Metrics
accuracy(predictions, Survived, .pred_class)
roc_auc(predictions, Survived, .pred_1)

Variable Importance

Random forests naturally provide variable importance scores:

# Extract variable importance
rf_fit %>%
  extract_fit_engine() %>%
  vip(num_features = 8)

This shows which features most influenced predictions. Passenger class, sex, and age are typically the most important predictors of survival.

Making Predictions on New Data

To predict on new data, ensure it has the same preprocessing:

# Example: predict on a new passenger
new_passenger <- tibble(
  Pclass = 1,
  Sex = "male",
  Age = 30,
  SibSp = 0,
  Parch = 0,
  Fare = 100,
  Embarked = "C"
)

rf_fit %>%
  predict(new_passenger) %>%
  bind_cols(predict(rf_fit, new_passenger, type = "prob"))

Tuning Tips

  • More trees generally improve performance but increase computation time
  • mtry is typically set to the square root of predictors for classification
  • min_n controls tree depth; larger values prevent overfitting
  • For imbalanced data, consider class weights or sampling strategies

When to Use Random Forests

Random forests excel when:

  • You need robust predictions without extensive tuning
  • Interpretability via variable importance is sufficient
  • Your data has many features or interactions
  • You want to detect feature importance

Consider alternatives when:

  • You need highly interpretable models (use decision trees)
  • Linear relationships dominate (use logistic regression)
  • Maximum predictive accuracy is critical (try gradient boosting)

Next Steps

Now that you have mastered random forests, continue to the next tutorial in this series: Gradient Boosting with xgboost. You will learn about another powerful ensemble method that often outperforms random forests on structured data.

To deepen your understanding, experiment with:

  • Regression problems (set_mode “regression”)
  • Different imputation strategies
  • Feature engineering before modeling
  • Combining predictions from multiple model types