Random Forests in R — Build, Tune, and Interpret
Random forests combine hundreds of decision trees into a single model that is far more accurate than any individual tree. Each tree trains on a bootstrap sample of the data and considers only a random subset of features at each split, which forces the ensemble to discover diverse patterns rather than memorising noise. This tutorial walks through building a random forest classifier for the Titanic survival dataset using tidymodels, from data preprocessing through hyperparameter tuning to model evaluation and interpretation.
Prerequisites
Before starting, you should be familiar with:
- Basic R syntax and data manipulation with dplyr
- The concept of machine learning (training/test splits)
- Previous tutorials in this series on regression and classification
You will need the following packages:
install.packages(c("tidymodels", "ranger", "vip"))
library(tidymodels)
library(ranger)
library(vip)
The tidymodels ecosystem bundles modeling, preprocessing, and evaluation into a single consistent interface. The Titanic dataset is a classic binary classification problem: given passenger attributes like class, sex, and age, can we predict who survived the sinking? Start by loading the modeldata package, which ships a cleaned version of the training set, and inspect the columns with glimpse().
Preparing the data
# Load data
data("titanic", package = "modeldata")
titanic <- titanic::titanic_train
# Quick look
glimpse(titanic)
Each row represents one passenger with eight columns: the binary Survived outcome plus predictors like passenger class, sex, age, number of siblings aboard, fare paid, and port of embarkation. Before feeding this data into a random forest, you need to convert character columns to factors and handle any missing values. The recipe below imputes missing numeric values with the median before baking the preprocessed dataset.
titanic_clean <- titanic %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked) %>%
mutate(
Survived = factor(Survived, levels = c(0, 1)),
across(where(is.character), factor)
) %>%
step_impute_median(all_numeric()) %>%
prep() %>%
bake(new_data = NULL)
glimpse(titanic_clean)
With the data cleaned, the next step is defining the model. The rand_forest() function from parsnip creates a model specification that is independent of the underlying engine, so you can swap between ranger, randomForest, or spark without rewriting your workflow. The ranger engine is preferred for its speed on larger datasets and native support for parallel processing. The specification below marks three hyperparameters for tuning: mtry, trees, and min_n.
Building the model
rf_spec <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()
) %>%
set_mode("classification") %>%
set_engine("ranger")
These three hyperparameters control the bias-variance tradeoff in your forest. The mtry parameter determines how many predictors each tree considers at a split: smaller values produce more diverse trees but weaker individual predictors. The trees parameter sets the ensemble size, with diminishing returns beyond a few hundred. The min_n parameter is a regulariser. Larger values prevent the trees from growing too deep and overfitting to noise in small datasets. A tidymodels workflow bundles the model specification with a formula or recipe so that fitting and prediction happen through a single unified interface.
Creating a workflow
rf_workflow <- workflow() %>%
add_formula(Survived ~ .) %>%
add_model(rf_spec)
Tuning identifies the hyperparameter combination that produces the best held-out performance. The tune_grid() function from the tune package evaluates each candidate across cross-validation folds, measuring ROC AUC for classification or RMSE for regression. After the tuning grid has been evaluated, select_best() picks the row with the highest metric value, and finalize_workflow() applies those hyperparameters to the specification so you can fit the final model on the full training set.
Selecting the best model
best_params <- select_best(rf_tune, metric = "roc_auc")
rf_final <- rf_workflow %>%
finalize_workflow(best_params)
# Fit on full training data
rf_fit <- rf_final %>%
fit(titanic_clean)
rf_fit
Once the final model is fitted, the real question is how well it generalises. Evaluation on the training data gives an optimistic estimate, so cross-validated metrics are preferred, but examining in-sample predictions still reveals whether the model has learned anything useful at all. The code below generates class predictions, class probabilities, a confusion matrix, and key classification metrics.
Evaluating performance
# Predictions
predictions <- rf_fit %>%
predict(titanic_clean) %>%
bind_cols(predict(rf_fit, titanic_clean, type = "prob")) %>%
bind_cols(titanic_clean)
# Confusion matrix
conf_mat(predictions, Survived, .pred_class)
# Metrics
accuracy(predictions, Survived, .pred_class)
roc_auc(predictions, Survived, .pred_1)
One of the key advantages of random forests over black-box models is that they produce interpretable variable importance scores. Each split in every tree reduces impurity, and the total reduction attributed to each predictor across all trees measures that predictor’s contribution to the model. The vip package plots these scores in a clean, readable format, making it easy to identify which features drive predictions.
Variable importance
# Extract variable importance
rf_fit %>%
extract_fit_engine() %>%
vip(num_features = 8)
This shows which features most influenced predictions. Passenger class, sex, and age are typically the most important predictors of survival. Once you understand what the model relies on, the natural next step is applying it to unseen data. New observations must go through the same preprocessing pipeline as the training data: same factor levels, same imputation rules. Otherwise the predictions will be unreliable.
Making predictions on new data
# Example: predict on a new passenger
new_passenger <- tibble(
Pclass = 1,
Sex = "male",
Age = 30,
SibSp = 0,
Parch = 0,
Fare = 100,
Embarked = "C"
)
rf_fit %>%
predict(new_passenger) %>%
bind_cols(predict(rf_fit, new_passenger, type = "prob"))
Tuning tips
- More trees generally improve performance but increase computation time
- mtry is typically set to the square root of predictors for classification
- min_n controls tree depth; larger values prevent overfitting
- For imbalanced data, consider class weights or sampling strategies
When to reach for ensemble models
Random forests excel when:
- You need reliable predictions without extensive tuning
- Interpretability via variable importance is sufficient
- Your data has many features or interactions
- You want to detect feature importance
Consider alternatives when:
- You need highly interpretable models (use decision trees)
- Linear relationships dominate (use logistic regression)
- Maximum predictive accuracy is critical (try gradient boosting)
The ensemble learning mechanism
Random forests build many decision trees on bootstrap samples of the training data and average their predictions. Each tree uses a random subset of features at each split (controlled by mtry), decorrelating the trees so their errors do not compound. Averaging many diverse trees reduces variance without increasing bias.
randomForest::randomForest(y ~ ., data = train, ntree = 500, mtry = sqrt(ncol(train)-1)) fits a model. The default mtry for classification is the square root of the number of features; for regression it is one-third. Larger ntree is better until improvements plateau. Check with plot(model, main = "Error vs trees").
ranger::ranger(y ~ ., data = train) is a faster implementation, especially for large datasets. It supports parallel processing natively and handles the same interface as randomForest. For feature selection workflows and large datasets, ranger is preferred.
Out-of-Bag error
Because each tree trains on a bootstrap sample, roughly one-third of observations are left out of each tree (out-of-bag observations). The model uses these left-out observations to estimate generalization error without a separate validation set. model$err.rate (classification) or model$mse (regression) gives the OOB error across tree counts.
OOB error is nearly as accurate as cross-validation for error estimation and requires no extra computation. This makes random forests particularly efficient for exploratory modeling when you want a quick performance estimate.
Feature importance
importance(model) returns variable importance. For classification forests, type = 1 gives mean decrease in accuracy (remove the variable and observe the accuracy drop on OOB samples), and type = 2 gives mean decrease in Gini impurity (average decrease in node impurity from splits on that variable). Mean decrease in accuracy is more reliable but slower to compute.
varImpPlot(model) produces a dot plot of importance scores. ranger::importance(ranger_model) gives importance from a ranger model. Both measures should agree in ranking, with minor differences due to randomness.
Correlation between features inflates importance for features that share information. If features A and B are highly correlated, importance splits between them, understating each one’s individual contribution. Partial dependence plots examine the marginal effect of each feature after accounting for others.
Hyperparameter tuning
The three key hyperparameters: ntree (more is better, diminishing returns after 200-500), mtry (try values from 2 to ncol/2), nodesize (minimum node size; larger values reduce overfitting on small datasets).
Grid search with caret: tunegrid <- expand.grid(mtry = c(2, 4, 6, 8, 10)) and train(y ~ ., data = train, method = "rf", tuneGrid = tunegrid, trControl = trainControl(method = "cv", number = 5)).
For high-dimensional data (many features), larger mtry often helps because no single feature dominates. For data with a few highly predictive features, smaller mtry increases forest diversity.
Imbalanced classes
Random forests perform poorly on severely imbalanced classification tasks (e.g., 99% class A, 1% class B). Strategies: sampsize = c(100, 100) in randomForest() uses a balanced bootstrap sample; classwt = c(1, 99) applies higher misclassification cost to the minority class.
ROSE::ovun.sample(formula, data, method = "over") oversamples the minority class. DMwR::SMOTE() generates synthetic minority class examples by interpolation. After balancing, retrain the model and evaluate with precision/recall rather than accuracy.
Why ensemble methods outperform single trees
A single decision tree overfits to the training data: it learns the noise along with the signal. Random forests reduce overfitting by building many trees, each trained on a bootstrap sample of the training data with a random subset of features available at each split. The bootstrap sampling introduces variance into each tree, and the feature subsampling ensures that the trees are not all dominated by the same strong predictors. Averaging predictions across many imperfect trees produces a prediction that is better than any individual tree.
The key hyperparameters in random forests control the tradeoff between bias and variance. More trees reduce variance by averaging more imperfect predictions, with diminishing returns beyond a few hundred trees. The number of features available at each split (the mtry parameter) controls the strength of individual trees and their correlation with each other. Fewer features per split produces more diverse, weaker trees that average to a stronger ensemble.
Feature importance interpretation
Feature importance in random forests measures how much each variable contributes to reducing prediction error across all trees. The impurity-based importance, which measures total reduction in node impurity from splits on each variable, is fast to compute but biased toward variables with many unique values. Permutation importance, which measures how much prediction error increases when each variable’s values are randomly shuffled, is less biased but more expensive to compute.
Importance scores are relative: they tell you which variables are most and least important, not whether any variable has practical significance. High importance means the model relies on that variable; it does not mean the relationship is causal. Two highly correlated predictors may split importance between them unpredictably, making importance scores unreliable when the feature set contains redundant variables.
Tuning and evaluation
Cross-validated tuning of mtry and the minimum node size uses the caret or tidymodels frameworks for systematic search. The default mtry of the square root of the number of predictors for classification and one-third for regression is often close to optimal. Tuning other hyperparameters typically yields smaller improvements. The most impactful change to random forest performance is usually the feature set, not the hyperparameters.
Out-of-bag error provides a free performance estimate without setting aside a separate test set. Each training observation appears in roughly two-thirds of the bootstrap samples and is excluded from the remaining third. Predictions for the excluded trees constitute the out-of-bag predictions. OOB error estimates generalization performance without requiring a separate validation split, making it useful for small datasets where holding out a test set wastes valuable training data.
Next steps
Now that you have mastered random forests, continue to the next tutorial in this series: Gradient Boosting with xgboost. You will learn about another powerful ensemble method that often outperforms random forests on structured data.
To deepen your understanding, experiment with:
- Regression problems (set_mode “regression”)
- Different imputation strategies
- Feature engineering before modeling
- Combining predictions from multiple model types
See also
- Tidymodels Regression: apply the same tidymodels workflow to regression problems
- XGBoost in R: a gradient boosting alternative that often beats random forests on structured data
- Classification with Caret: another framework for training and tuning classification models
- Machine Learning in R: foundational concepts and terminology for the full series