Cross-Validation in R
Why Cross-Validation?
When you train a machine learning model, you want to know how well it will perform on data it has never seen before. That’s called generalisation — and it’s the whole point of building a model in the first place.
A common mistake is to evaluate a model only on the data it was trained with. This always looks great on paper (your model “knows” those answers!), but it tells you nothing about real-world performance. Your model might have simply memorised the training set rather than learning patterns that transfer.
Cross-validation solves this by holding out portions of your data during training and using them to simulate how the model would perform on new data. Instead of a single unreliable estimate, you get a distribution of performance scores that gives you a much clearer picture.
In this tutorial, you’ll learn the main cross-validation strategies, when to use each one, and how to implement them in R using both caret and tidymodels.
Hold-Out Validation
The simplest form of validation is a single train/test split. You randomly partition your data — typically 80% for training and 20% for testing — then train on one set and evaluate on the other.
set.seed(42)
idx <- sample(nrow(mtcars), size = 0.8 * nrow(mtcars))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
model <- lm(mpg ~ wt + hp, data = train)
pred <- predict(model, newdata = test)
rmse <- sqrt(mean((pred - test$mpg)^2))
rmse
# [1] 2.87
```r
This is fast and easy, but it has a serious drawback: **high variance**. Your estimate of model performance depends entirely on which rows ended up in the training set. Run the same code with a different random seed and you might get a noticeably different RMSE. With small or medium datasets, that variance makes it hard to trust any single result.
**When to use hold-out validation:** Very large datasets (100k+ rows) where a single split is computationally convenient and the dataset is large enough that random variation averages out naturally.
## K-Fold Cross-Validation
K-fold cross-validation is the workhorse of model evaluation. The data is split into **k** roughly equal partitions called *folds*. The model is trained on k−1 folds and evaluated on the remaining fold. This process repeats k times, rotating which fold is held out each time.
All k evaluation scores are then averaged to produce a single performance estimate.
The most common choices are **k = 5** and **k = 10**. Here's why they dominate in practice:
### The Bias–Variance Trade-Off
The choice of k involves a fundamental trade-off between **bias** (how systematically wrong your estimate is) and **variance** (how much your estimate varies across different samples).
| k value | Bias | Variance | Computational cost |
|---------|------|----------|--------------------|
| Small (k=2, 3) | High | Low | Very low |
| Medium (k=5) | Moderate | Moderate | Moderate |
| Large (k=10, n) | Low | High | High |
With a small k like 2 or 3, each training set is only slightly larger than the full dataset, so your estimate is biased — it overestimates generalisation error because the model saw relatively little data during each training run. However, the training sets are quite different from each other, which means the variance across folds is low.
With a large k, each training set is almost the full dataset (in LOOCV where k=n, it's literally the full dataset minus one point). This reduces bias — you're training on almost everything each time — but the training sets overlap heavily, making them highly correlated. That correlation means your k evaluation scores tend to be similar to each other, but if the data changes, your estimate can swing significantly. That's high variance.
**k = 10** is the most widely recommended default because it hits a sweet spot: low enough bias to be trustworthy, and low enough variance to be stable. It was validated empirically by Breiman et al. (1984) and Kohavi (1995) as a strong choice across a wide range of problems.
**Rule of thumb:**
- Use **k = 10** as your default
- Increase k (or use LOOCV) when you have limited data and every training example matters
- Decrease k when model training is slow and you need faster iteration
## Implementing K-Fold in R with caret
The **caret** package (Classification And REgression Training) is the most widely used R package for model training and evaluation. It provides a consistent interface to hundreds of model types and handles cross-validation through two core functions:
- `trainControl()` — specifies the resampling scheme
- `train()` — fits the model using the specified scheme
Here's a complete example using the Boston housing dataset (from the MASS package) to predict median home value (`medv`) from several predictors:
```r
library(caret)
library(MASS)
train_control <- trainControl(method = "cv", number = 10)
model <- train(
medv ~ lstat + rm + age + tax,
data = Boston,
trControl = train_control,
method = "lm"
)
model
# Linear Regression
#
# 507 samples
# 4 predictor
#
# No pre-processing
# Resampling: Cross-Validated (10 fold)
# Summary of sample sizes: 456, 457, 456, 456, 456, 456, 457, 456, 456, 456
#
# Resampling results:
# RMSE Rsquared MAE
# 5.13 0.63 3.87
```r
`trainControl()` accepts a `method` argument that handles the main strategies. The key options are:
| Method | Description |
|--------|-------------|
| `"cv"` | K-fold cross-validation |
| `"repeatedcv"` | Repeated K-fold (more stable) |
| `"LOOCV"` | Leave-one-out cross-validation |
| `"boot"` | Bootstrap resampling |
| `"LGOCV"` | Leave-group-out (manual hold-out) |
## Implementing K-Fold with tidymodels and rsample
The **tidymodels** ecosystem offers a more modern, tidyverse-flavored approach through the **rsample** package. Instead of a single `train()` call, you explicitly create resampling objects, then map model fitting and prediction over them.
### Creating folds with `vfold_cv()`
```r
library(tidymodels)
set.seed(42)
folds <- vfold_cv(mtcars, v = 10)
folds
# # 10-fold cross-validation
# # A tibble: 10 × 2
# splits id
# <list> <chr>
# 1 <split [28/5]> Fold01
# 2 <split [28/5]> Fold02
# ...
```r
Each row of the tibble is one fold. The `splits` column contains `rsplit` objects — you extract the analysis (training) and assessment (test) data with `analysis()` and `assessment()`.
### Fitting and evaluating across folds
```r
fit_model <- function(split) {
lm(mpg ~ wt + hp, data = analysis(split))
}
folds %>%
mutate(
model = map(splits, fit_model),
pred = map2(model, splits, ~ predict(.x, newdata = assessment(.y)))
) %>%
mutate(
truth = map(splits, ~ assessment(.x)$mpg),
rmse = map2_dbl(pred, truth, ~ sqrt(mean((.x - .y)^2)))
) %>%
summarise(mean_rmse = mean(rmse))
# # A tibble: 1 × 1
# mean_rmse
# <dbl>
# 2.91
```r
### Single hold-out with `validation_split()`
When you just want a quick single split the tidymodels way:
```r
val_set <- validation_split(mtcars, prop = 0.8)
val_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
val_spec %>%
fit(mpg ~ wt + hp, data = analysis(val_set$splits[[1]]))
```r
The rsample package also provides `mc_cv()` for Monte Carlo (random hold-out repeated many times) and `group_vfold_cv()` for situations where groups must stay together — for example, when the same patient appears in multiple records.
## Repeated K-Fold for More Stable Estimates
A single k-fold estimate can still be noisy. To reduce variance, you repeat the entire k-fold procedure multiple times with different random partitions each time. The final estimate is averaged across all repeats.
With caret, this is simply `method = "repeatedcv"` with `number` (the k) and `repeats`:
```r
train_control <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5
)
set.seed(42)
model <- train(
mpg ~ wt + hp + disp,
data = mtcars,
trControl = train_control,
method = "lm"
)
model$results
# RMSE Rsquared MAE RMSESD RsquaredSD MAESD
# 1 2.97564 0.794521 2.40857 0.423107 0.068451 0.337567
```r
The `RMSESD` column tells you how much the RMSE varied across folds and repeats — a small value relative to the mean RMSE means your estimate is stable.
Typical choices are `number = 10, repeats = 5` or `number = 5, repeats = 10`. The total number of model fits is k × repeats, so keep an eye on computation time for slow models.
## Stratified K-Fold for Classification
For classification problems with imbalanced classes, random k-fold can produce folds where some classes are missing or severely underrepresented in certain splits. This makes your evaluation unreliable.
**Stratified k-fold** preserves the class distribution across folds, ensuring each fold looks roughly like the full dataset.
### With caret
```r
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
set.seed(42)
model <- train(
Species ~ Petal.Length + Petal.Width,
data = iris,
method = "rpart",
trControl = train_control,
metric = "ROC"
)
```r
`classProbs = TRUE` and `twoClassSummary` tell caret to compute class probabilities and evaluate using AUC, which is more informative than accuracy for imbalanced problems.
### With tidymodels
```r
folds <- vfold_cv(iris, v = 5, strata = Species)
```r
The `strata` argument handles everything automatically — each fold maintains roughly the same proportion of each Species as the full dataset.
## Choosing k and Avoiding Common Mistakes
Here's a practical decision guide:
| Situation | Recommended approach |
|-----------|----------------------|
| Default, no reason to change | k = 10 |
| Small dataset (< 500 rows) | k = 5 or LOOCV |
| Very large dataset (> 100k rows) | k = 5 (faster, still stable) |
| Binary classification, imbalanced | Stratified k = 10 |
| Slow model (e.g. deep learning) | k = 5, repeated 3 |
| Time series | Use `timeSlice` cross-validation |
| Repeated measures or clustering | `group_vfold_cv()` with group argument |
### Common mistakes to avoid
**1. Evaluating on training data.** Never report training set performance as your model's expected performance — it's always optimistically biased. Always hold out data for evaluation.
**2. Using the same split for hyperparameter tuning and final evaluation.** If you tune your model by repeatedly checking performance on a fixed test set, you'll overfit to that test set. Use nested cross-validation or a separate validation set for tuning.
**3. Ignoring variance in your CV results.** If your RMSE swings wildly across folds (high `RMSESD`), your estimate is unreliable. Consider repeated k-fold to get a more stable average.
**4. Choosing k too small to be meaningful.** k = 2 gives a training set that's only 50% of the data — the bias is enormous. k = 3 is acceptable for quick sanity checks but not for final evaluation.
**5. Forgetting to set a seed.** Cross-validation involves random partitioning. Without `set.seed()`, your results won't be reproducible, and different random states will give different estimates.
## Summary
Cross-validation is how you move from "the model fits the training data well" to "the model will generalise reliably." The key ideas:
- **Hold-out validation** is simple but high-variance — only use it for large datasets or quick checks.
- **K-fold cross-validation** (especially k = 10) gives a reliable, low-bias estimate with moderate computational cost.
- **Repeated k-fold** reduces variance further when you need more stability.
- **Stratified k-fold** is essential for classification with imbalanced classes.
- **caret** gives you a unified `train()` interface for most model types.
- **tidymodels/rsample** gives you explicit control over resampling with a tidy data workflow.
These techniques carry directly into hyperparameter tuning — when you combine cross-validation with a grid or random search over tuning parameters, you get the most robust model selection workflow available.
## See Also
- [Introduction to Supervised Learning in R](/tutorials/supervised-learning-intro-r/) — foundational concepts behind the models you're evaluating
- [Random Forests in R](/tutorials/random-forests-r/) — extending cross-validation for model comparison and ensemble methods