rguides

Model Evaluation and Cross-Validation

Building a machine learning model is only half the battle. Once you’ve trained a model, you need to know how well it performs on new, unseen data. Model evaluation and cross-validation are essential techniques that help you understand your model’s predictive power and generalization ability.

In this tutorial, you’ll learn how to properly evaluate machine learning models in R using train/test splits, k-fold cross-validation, and essential metrics like accuracy, precision, recall, F1 score, and ROC AUC.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Model Evaluation and Cross-Validation. By the end, you will know how to apply the core functions in real data analysis workflows.

Why model evaluation matters

When you build a model on your training data, it will naturally perform well on that data—after all, it has seen the answers. The true test is how the model performs on data it has never seen before. This is called generalization.

A model that performs well on training data but poorly on new data is said to be overfitting. Conversely, a model that performs poorly on both training and test data is underfitting. Proper evaluation helps you diagnose these problems and choose the right model.

Train/Test split

The simplest approach to model evaluation is to split your data into two sets:

  • Training set: Used to build the model
  • Test set: Used to evaluate the model

A common split is 70/30 or 80/20. The caret package makes this easy:

library(caret)
library(tidyverse)

# Load example data
data("iris")

# Set seed for reproducibility
set.seed(123)

# Create train/test split (70/30)
train_index <- createDataPartition(iris$Species, p = 0.7, list = FALSE)

train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

# Train a model
model <- train(Species ~ ., data = train_data, method = "rpart")

# Make predictions on test data
predictions <- predict(model, test_data)

# Calculate accuracy
confusionMatrix(predictions, test_data$Species)

The confusion matrix shows you not just accuracy, but also metrics like sensitivity and specificity for each class.

K-Fold cross-Validation

A single train/test split can be misleading—especially if your data happens to split in a favorable or unfavorable way. K-fold cross-validation addresses this by splitting data into k folds, training on k-1 folds, and testing on the remaining fold, rotating through all folds.

This gives you k different accuracy estimates, which you average:

# 5-fold cross-validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 5)

# Train with cross-validation
model_cv <- train(Species ~ ., 
                  data = iris, 
                  method = "rpart",
                  trControl = train_control)

# View cross-validation results
print(model_cv)

The output shows the average accuracy across all folds, plus the standard deviation—giving you a sense of how stable your model is.

You can also use k-fold cross-validation with the tidymodels framework:

library(tidymodels)

# Define the model
rf_spec <- rand_forest(trees = 100) |> 
  set_mode("classification")

# Define resampling
set.seed(123)
iris_folds <- vfold_cv(iris, v = 5)

# Fit with resampling
rf_res <- rf_spec |> 
  fit_resamples(Species ~ ., resamples = iris_folds)

# Collect metrics
collect_metrics(rf_res)

Evaluation metrics

Beyond accuracy, there are several important metrics to consider:

Accuracy

The proportion of correct predictions. Simple but can be misleading with imbalanced classes.

Precision

Of all predictions as positive, how many are actually positive?

Recall (Sensitivity)

Of all actual positives, how many did we correctly predict?

F1 score

The harmonic mean of precision and recall—useful when classes are imbalanced.

# Using caret for detailed metrics
confusionMatrix(predictions, test_data$Species, mode = "everything")

ROC AUC

The Area Under the ROC Curve measures the model’s ability to distinguish between classes. AUC of 0.5 is random; 1.0 is perfect.

# Get probability predictions
probabilities <- predict(model, test_data, type = "prob")

# Calculate ROC AUC for each class
roc_curve <- roc(test_data$Species, probabilities[, "setosa"])
auc(roc_curve)

For multiclass problems, use the one-vs-all approach or the multiclass.roc() function from the pROC package.

Choosing the right metric

The right metric depends on your problem:

  • Balanced classes: Accuracy is fine
  • Imbalanced classes: Use F1, precision, or recall
  • Ranking predictions: Use AUC
  • Cost-sensitive errors: Define custom loss functions

Putting it all together

Here’s a complete workflow combining everything:

library(caret)
library(tidyverse)

# Prepare data
data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes

# Create stratified split
set.seed(456)
train_index <- createDataPartition(df$diabetes, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Define cross-validation
train_control <- trainControl(method = "cv", number = 10, 
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

# Train multiple models
models <- list(
  rf = train(diabetes ~ ., data = train_data, method = "rf", 
             trControl = train_control, metric = "ROC"),
  gbm = train(diabetes ~ ., data = train_data, method = "gbm", 
              trControl = train_control, metric = "ROC"),
  logit = train(diabetes ~ ., data = train_data, method = "glm", 
                trControl = train_control, metric = "ROC")
)

# Compare on test set
map(models, ~ predict(.x, test_data) |> 
    confusionMatrix(test_data$diabetes))

Summary

Model evaluation is crucial for building reliable machine learning models:

  1. Train/test splits give you a quick estimate of performance
  2. K-fold cross-validation provides more reliable estimates
  3. Multiple metrics (accuracy, precision, recall, F1, AUC) give you a complete picture
  4. Choose metrics based on your problem and class distribution

With these techniques, you can confidently evaluate and compare models to find the best solution for your data science problems.

Regression metrics

For regression models: RMSE (root mean square error) is in the same units as the outcome and penalizes large errors; MAE (mean absolute error) is less sensitive to outliers; R-squared is the proportion of variance explained. caret::postResample(predicted, actual) or yardstick::metrics(df, truth, estimate) compute all three. R-squared can be negative for models worse than the mean, it is not bounded below at 0.

Classification metrics

For binary classification: accuracy is the fraction correct but misleads for imbalanced classes. Precision (positive predictive value), recall (sensitivity/TPR), and F1 (harmonic mean of precision and recall) are better for imbalanced problems. caret::confusionMatrix(predicted, actual, positive = "yes") returns the full confusion matrix and all metrics. AUC (area under the ROC curve) summarizes performance across all thresholds.

Cross-Validation

caret::trainControl(method = "cv", number = 10) sets 10-fold cross-validation. CV estimates generalization error by training on 9 folds and testing on 1, repeated 10 times. Stratified CV (method = "repeatedcv") is recommended for classification to ensure each fold has representative class proportions. tidymodels::vfold_cv(data, v = 10, strata = outcome) does the same with tidy workflow integration.

Learning curves

Learning curves plot training and validation performance as a function of training set size. A large gap between training and validation scores indicates high variance (overfitting); low scores for both indicate high bias (underfitting). Generate learning curves by training on subsets of increasing size and computing CV scores at each size. Learning curves diagnose whether adding more data would help (high variance) or whether a better model is needed (high bias).

Beyond accuracy

Model evaluation starts by choosing the right metrics for the problem. Accuracy is appropriate for balanced classification. For imbalanced problems (fraud detection, disease diagnosis, rare events), accuracy is misleading, a model predicting only the majority class achieves high accuracy.

Precision measures what fraction of positive predictions are actually positive. Recall measures what fraction of true positives the model catches. The F1 score (harmonic mean) balances both. Which matters more depends on the cost of errors: in fraud detection, high recall (catch most fraud) matters more than high precision (some false alarms are acceptable). In medical screening, the reverse may be true.

For regression, RMSE penalizes large errors more than MAE. RMSE is appropriate when large errors are especially costly. MAE is more interpretable and reliable to outliers. R-squared measures the proportion of variance explained but depends on the variance in the test set, making it hard to compare across datasets with different target variance.

Calibration

A well-calibrated model produces probabilities that match observed frequencies. If a model predicts 70% probability for 1000 examples, approximately 700 should actually be positive. probably::cal_plot_windowed(pred_df, .pred_positive) plots predicted probabilities against observed rates, showing calibration quality.

Logistic regression is usually well-calibrated. Random forests and gradient boosting tend to produce overconfident probabilities (too close to 0 or 1). Calibration correction (probably::cal_estimate_logistic()) fits a logistic regression on the predicted probabilities to correct the calibration.

Threshold selection

Classification thresholds are analytical choices, not fixed at 0.5. probably::threshold_perf(pred_df, truth, .pred_positive, thresholds = seq(0.1, 0.9, 0.05)) evaluates precision, recall, and F1 at multiple thresholds. Visualize with autoplot() to find the threshold that best balances the metrics for your use case.

pROC::roc() computes the full ROC curve. pROC::coords(roc_obj, "best") finds the threshold that maximizes the sum of sensitivity and specificity. pROC::auc() summarizes discriminative ability as a single number between 0.5 (random) and 1.0 (perfect).

Next steps

Now that you understand model evaluation and cross-validation, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.