Hyperparameter Tuning in R
Hyperparameter tuning is the process of finding the best settings for a machine learning model before training begins. Unlike model parameters, which the algorithm learns automatically from data, hyperparameters are external configurations you set manually. Getting them right often determines whether your model barely works or performs significantly better than a baseline.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Hyperparameter Tuning in R. By the end, you will know how to apply the core functions in real data analysis workflows.
Parameters vs hyperparameters
A quick distinction keeps things clear. Parameters are what the model learns during training: regression coefficients, decision tree split points, or neural network weights. Hyperparameters control how that learning happens. How many trees should a random forest contain? How deep should a gradient boosting model grow before stopping? These are hyperparameter choices.
If you train a random forest with ntree = 50 and it learns 50 trees, those tree structures are parameters. The 50 you chose is a hyperparameter.
Examples across common algorithms:
| Algorithm | Hyperparameters |
|---|---|
| Random Forest | mtry (variables per split), ntree (number of trees) |
| XGBoost | eta (learning rate), max_depth, nrounds |
| SVM | C (cost parameter), kernel type |
| KNN | k (number of neighbors) |
Grid search
Grid search is the most straightforward tuning approach. You define a discrete set of values for each hyperparameter, and the method evaluates every possible combination.
With caret, you pass a tuning grid using expand.grid():
library(caret)
tune_grid <- expand.grid(.mtry = c(2, 4, 6, 8, 10))
model <- train(
Species ~ .,
data = iris,
method = "rf",
tuneGrid = tune_grid,
trControl = trainControl(method = "cv", number = 5)
)
print(model$bestTune)
The trControl argument handles resampling. Here, 5-fold cross-validation evaluates each grid point. caret then selects the combination with the best average performance.
Grid search is exhaustive, which sounds appealing. The problem is combinatorial explosion. Tuning 4 hyperparameters with 5 values each means 5^4 = 625 combinations. Add a 5th hyperparameter and you jump to 3125. For slow models, this becomes prohibitively expensive.
Grid search works well when you have 1-2 hyperparameters and want guaranteed coverage. It’s simple to reason about and trivially parallelizable.
Random search
Random search samples hyperparameter combinations from defined ranges instead of enumerating a full grid. You specify how many combinations to try:
library(caret)
tune_length <- 30 # number of random combinations
model <- train(
Species ~ .,
data = iris,
method = "rf",
tuneLength = tune_length,
trControl = trainControl(method = "cv", number = 5)
)
print(model$bestTune)
Research by Bergstra and Bengio (2012) showed that random search often outperforms grid search in high-dimensional spaces. The intuition: if only a few hyperparameters actually matter, random search concentrates evaluations on the dimensions that count rather than wasting them exploring every value of irrelevant parameters.
Random search is also easier to work with when hyperparameters have continuous ranges. You can sample from distributions rather than picking discrete values, giving finer resolution without combinatorial blowup.
The trade-off is reproducibility. Without a fixed seed, you won’t get the same results on reruns. If reproducibility matters for your project, set set.seed() before each run.
Bayesian optimization
Grid search and random search treat the objective function as a black box. Bayesian optimization builds a probabilistic model of it, then uses that model to decide which configurations to try next.
The rBayesianOptimization package implements this in R. You define an objective function that trains a model and returns a score:
library(rBayesianOptimization)
obj_func <- function(mtry, min_child_weight, max_depth) {
# Train a model and return cross-validated score
fit <- train(
Species ~ .,
data = iris,
method = "rpart",
trControl = trainControl(method = "cv", number = 3)
)
list(Score = fit$results$Accuracy, Pred = 0)
}
result <- BayesianOptimization(
obj_func,
bounds = list(
mtry = c(2L, 10L),
min_child_weight = c(1L, 10L),
max_depth = c(3L, 10L)
),
init_points = 5,
n_iter = 20
)
print(result$Best_Par)
ParBayesianOptimization uses a Tree Parzen Estimator instead of a Gaussian Process, which scales better to many hyperparameters and supports parallel evaluation:
library(ParBayesianOptimization)
obj_fn <- function(mtry, max_depth) {
fit <- train(Species ~ ., data = iris, method = "rf", trControl = trainControl(method = "cv", number = 3))
# NMIZE: index of the score to maximize (1 = first element of Score list)
# nb: named list of parameter values for the optimizer to track
list(Score = fit$results$Accuracy, NMIZE = 1, nb = c(mtry = mtry, max_depth = max_depth))
}
results <- parBayesianOptimization(
obj_fn,
bounds = list(mtry = c(2L, 10L), max_depth = c(4L, 12L)),
initGrid = NULL,
nIters = 30,
parallelPackage = "future"
)
The key advantage of Bayesian optimization is sample efficiency. For expensive objective functions (slow models, large datasets), it typically finds better configurations in far fewer evaluations than grid or random search.
Cross-Validation for hyperparameter tuning
Tuning hyperparameters without proper validation leads to overfitting to your validation set. Cross-validation gives you a more reliable performance estimate.
Simple hold-out splitting is easy but unreliable:
library(caret)
train_index <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
A single split can be misleading due to random variation in how data gets distributed.
K-fold cross-validation is the standard. The data splits into k folds, and the model trains on k-1 folds while validating on the remaining fold. This repeats k times, giving you k performance estimates:
library(caret)
trControl <- trainControl(method = "cv", number = 5)
model <- train(
Species ~ .,
data = iris,
method = "rf",
trControl = trControl
)
Repeated K-fold runs the process multiple times with different random splits, giving more stable estimates:
trControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3
)
Early stopping for iterative models
Gradient boosting models and neural networks train iteratively. Without some form of stopping, they keep fitting until they overfit the training data. Early stopping monitors validation performance and halts training when it stops improving.
XGBoost in R supports early stopping natively:
library(xgboost)
dtrain <- xgb.DMatrix(data = as.matrix(iris[, -5]), label = as.numeric(iris$Species) - 1)
params <- list(
objective = "multi:softmax",
num_class = 3,
max_depth = 6,
eta = 0.1,
eval_metric = "mlogloss"
)
model <- xgb.train(
params = params,
data = dtrain,
nrounds = 500,
evals = list(train = dtrain, eval = dtrain),
early_stopping_rounds = 20,
verbose = FALSE
)
cat("Best iteration:", model$best_iteration, "\n")
Note that the example above uses the same data for both train and eval for simplicity. In practice, you should pass a separate validation set to eval so early stopping actually prevents overfitting.
You can also use early stopping through caret:
library(caret)
trControl <- trainControl(method = "cv", number = 5)
model <- train(
Species ~ .,
data = iris,
method = "xgbTree",
trControl = trControl,
early_stopping_rounds = 20
)
A complete tuning workflow with tidymodels
The tidymodels framework provides a modular alternative to caret. Here’s an end-to-end example tuning a gradient boosting model on the Hitters dataset:
library(tidymodels)
library(ISLR)
data("Hitters", package = "ISLR")
hitters <- Hitters %>%
filter(!is.na(Salary)) %>%
mutate(Salary = log(Salary))
set.seed(42)
split <- initial_split(hitters, prop = 0.8)
train_data <- training(split)
test_data <- testing(split)
folds <- vfold_cv(train_data, v = 5)
gbm_spec <- boost_tree(
trees = 500,
learn_rate = tune(),
tree_depth = tune(),
min_n = tune(),
mtry = tune()
) %>%
set_engine("gbm") %>%
set_mode("regression")
gbm_params <- parameters(
learn_rate(range = c(0.01, 0.3), trans = scales::log10_trans()),
tree_depth(range = c(2, 8)),
min_n(range = c(5, 30)),
mtry(range = c(2, ncol(hitters) - 1))
)
set.seed(42)
gbm_grid <- grid_max_entropy(gbm_params, size = 30)
gbm_tuned <- gbm_spec %>%
tune_grid(
Salary ~ .,
resamples = folds,
grid = gbm_grid,
metrics = metric_set(rmse, rsq, mae)
)
show_best(gbm_tuned, metric = "rmse")
best_params <- select_best(gbm_tuned, metric = "rmse")
final_model <- gbm_spec %>%
finalize_model(best_params) %>%
fit(Salary ~ ., data = train_data)
test_pred <- predict(final_model, test_data)
The grid_max_entropy() function creates a space-filling design for random search, which tends to cover the parameter space more efficiently than uniform random sampling.
For Bayesian optimization with tune, use tune_bayes() instead of tune_grid(). The tidymodels approach composes cleanly: swap grid_regular(), grid_random(), or grid_max_entropy() for grid or random search, and swap tune_bayes() for Bayesian optimization.
Grid search vs random search
Grid search evaluates all combinations of specified parameter values. For 3 parameters with 5 values each, grid search evaluates 125 combinations. Random search samples the parameter space randomly, with the same budget of 125 evaluations, it often finds better results than grid search because it explores more of the space, especially when only a few parameters matter.
grid_regular(penalty(), mixture(), levels = 5) creates a regular grid. grid_random(penalty(), mixture(), size = 50) creates a random grid. grid_latin_hypercube(penalty(), mixture(), size = 50) creates a space-filling design that is better than pure random.
What hyperparameters are
Hyperparameters control the learning algorithm, not the learned model. A random forest’s number of trees is a hyperparameter, it is set before training and determines how the training process runs. The learned feature importance scores are parameters — they are determined by the training data. This distinction matters because hyperparameters must be set by the analyst (or a tuning process), while parameters are learned automatically.
Different model types have different hyperparameters. Linear models have regularization strength. Decision trees have maximum depth and minimum samples per leaf. Neural networks have learning rate, batch size, and architecture details. Knowing which hyperparameters matter most for a given model type is the starting point for tuning. For tree-based models, the number of features per split and tree depth are most impactful. For regularized linear models, the regularization strength matters most.
Grid search and random search
Grid search exhaustively evaluates every combination of hyperparameter values in a specified grid. With three values for each of three hyperparameters, grid search runs 27 model fits. The result is complete coverage of the specified search space, which is useful when the number of hyperparameters and values is small.
Random search selects hyperparameter combinations at random from specified distributions rather than from a grid. With the same total number of model fits, random search covers more of the relevant hyperparameter space because it does not repeat the same values for one hyperparameter while varying another. For problems with many hyperparameters, random search often finds better configurations with fewer evaluations than grid search.
Nested cross-Validation
Using the same data for hyperparameter tuning and performance evaluation produces optimistically biased performance estimates. The tuning process selects hyperparameters that look good on the evaluation data, which inflates the apparent performance. Nested cross-validation addresses this by using an outer loop for performance evaluation and an inner loop for hyperparameter tuning. The outer loop ensures that tuning and evaluation are on different data.
In tidymodels, the tune_grid function performs the inner loop cross-validation for hyperparameter selection. Wrapping the entire tuning process in an outer cross-validation loop with nested_cv provides unbiased performance estimates. For small datasets where the bias is significant, nested cross-validation gives a more honest picture of expected performance on new data.
Next steps
Now that you understand hyperparameter tuning in r, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.
See also
- Cross-Validation in R — Resampling methods for reliable model evaluation
- Random Forests in R — The random forest algorithm and its hyperparameters
- Tidymodels Regression — Building regression models with the tidymodels framework