Model Evaluation and Cross-Validation

March 7, 2026 · 4 min read · Updated March 7, 2026 · intermediate

machine-learning model-evaluation cross-validation caret tidymodels

Building a machine learning model is only half the battle. Once you’ve trained a model, you need to know how well it performs on new, unseen data. Model evaluation and cross-validation are essential techniques that help you understand your model’s predictive power and generalization ability.

In this tutorial, you’ll learn how to properly evaluate machine learning models in R using train/test splits, k-fold cross-validation, and essential metrics like accuracy, precision, recall, F1 score, and ROC AUC.

Why Model Evaluation Matters

When you build a model on your training data, it will naturally perform well on that data—after all, it has seen the answers. The true test is how the model performs on data it has never seen before. This is called generalization.

A model that performs well on training data but poorly on new data is said to be overfitting. Conversely, a model that performs poorly on both training and test data is underfitting. Proper evaluation helps you diagnose these problems and choose the right model.

Train/Test Split

The simplest approach to model evaluation is to split your data into two sets:

Training set: Used to build the model
Test set: Used to evaluate the model

A common split is 70/30 or 80/20. The caret package makes this easy:

library(caret)
library(tidyverse)

# Load example data
data("iris")

# Set seed for reproducibility
set.seed(123)

# Create train/test split (70/30)
train_index <- createDataPartition(iris$Species, p = 0.7, list = FALSE)

train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

# Train a model
model <- train(Species ~ ., data = train_data, method = "rpart")

# Make predictions on test data
predictions <- predict(model, test_data)

# Calculate accuracy
confusionMatrix(predictions, test_data$Species)

The confusion matrix shows you not just accuracy, but also metrics like sensitivity and specificity for each class.

K-Fold Cross-Validation

A single train/test split can be misleading—especially if your data happens to split in a favorable or unfavorable way. K-fold cross-validation addresses this by splitting data into k folds, training on k-1 folds, and testing on the remaining fold, rotating through all folds.

This gives you k different accuracy estimates, which you average:

# 5-fold cross-validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 5)

# Train with cross-validation
model_cv <- train(Species ~ ., 
                  data = iris, 
                  method = "rpart",
                  trControl = train_control)

# View cross-validation results
print(model_cv)

The output shows the average accuracy across all folds, plus the standard deviation—giving you a sense of how stable your model is.

You can also use k-fold cross-validation with the tidymodels framework:

library(tidymodels)

# Define the model
rf_spec <- rand_forest(trees = 100) |> 
  set_mode("classification")

# Define resampling
set.seed(123)
iris_folds <- vfold_cv(iris, v = 5)

# Fit with resampling
rf_res <- rf_spec |> 
  fit_resamples(Species ~ ., resamples = iris_folds)

# Collect metrics
collect_metrics(rf_res)

Evaluation Metrics

Beyond accuracy, there are several important metrics to consider:

Accuracy

The proportion of correct predictions. Simple but can be misleading with imbalanced classes.

Precision

Of all predictions as positive, how many are actually positive?

Recall (Sensitivity)

Of all actual positives, how many did we correctly predict?

F1 Score

The harmonic mean of precision and recall—useful when classes are imbalanced.

# Using caret for detailed metrics
confusionMatrix(predictions, test_data$Species, mode = "everything")

ROC AUC

The Area Under the ROC Curve measures the model’s ability to distinguish between classes. AUC of 0.5 is random; 1.0 is perfect.

# Get probability predictions
probabilities <- predict(model, test_data, type = "prob")

# Calculate ROC AUC for each class
roc_curve <- roc(test_data$Species, probabilities[, "setosa"])
auc(roc_curve)

For multiclass problems, use the one-vs-all approach or the multiclass.roc() function from the pROC package.

Choosing the Right Metric

The right metric depends on your problem:

Balanced classes: Accuracy is fine
Imbalanced classes: Use F1, precision, or recall
Ranking predictions: Use AUC
Cost-sensitive errors: Define custom loss functions

Putting It All Together

Here’s a complete workflow combining everything:

library(caret)
library(tidyverse)

# Prepare data
data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes

# Create stratified split
set.seed(456)
train_index <- createDataPartition(df$diabetes, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Define cross-validation
train_control <- trainControl(method = "cv", number = 10, 
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

# Train multiple models
models <- list(
  rf = train(diabetes ~ ., data = train_data, method = "rf", 
             trControl = train_control, metric = "ROC"),
  gbm = train(diabetes ~ ., data = train_data, method = "gbm", 
              trControl = train_control, metric = "ROC"),
  logit = train(diabetes ~ ., data = train_data, method = "glm", 
                trControl = train_control, metric = "ROC")
)

# Compare on test set
map(models, ~ predict(.x, test_data) |> 
    confusionMatrix(test_data$diabetes))

Summary

Model evaluation is crucial for building reliable machine learning models:

Train/test splits give you a quick estimate of performance
K-fold cross-validation provides more robust estimates
Multiple metrics (accuracy, precision, recall, F1, AUC) give you a complete picture
Choose metrics based on your problem and class distribution

With these techniques, you can confidently evaluate and compare models to find the best solution for your data science problems.