Model Evaluation and Cross-Validation
Building a machine learning model is only half the battle. Once you’ve trained a model, you need to know how well it performs on new, unseen data. Model evaluation and cross-validation are essential techniques that help you understand your model’s predictive power and generalization ability.
In this tutorial, you’ll learn how to properly evaluate machine learning models in R using train/test splits, k-fold cross-validation, and essential metrics like accuracy, precision, recall, F1 score, and ROC AUC.
Why Model Evaluation Matters
When you build a model on your training data, it will naturally perform well on that data—after all, it has seen the answers. The true test is how the model performs on data it has never seen before. This is called generalization.
A model that performs well on training data but poorly on new data is said to be overfitting. Conversely, a model that performs poorly on both training and test data is underfitting. Proper evaluation helps you diagnose these problems and choose the right model.
Train/Test Split
The simplest approach to model evaluation is to split your data into two sets:
- Training set: Used to build the model
- Test set: Used to evaluate the model
A common split is 70/30 or 80/20. The caret package makes this easy:
library(caret)
library(tidyverse)
# Load example data
data("iris")
# Set seed for reproducibility
set.seed(123)
# Create train/test split (70/30)
train_index <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
# Train a model
model <- train(Species ~ ., data = train_data, method = "rpart")
# Make predictions on test data
predictions <- predict(model, test_data)
# Calculate accuracy
confusionMatrix(predictions, test_data$Species)
The confusion matrix shows you not just accuracy, but also metrics like sensitivity and specificity for each class.
K-Fold Cross-Validation
A single train/test split can be misleading—especially if your data happens to split in a favorable or unfavorable way. K-fold cross-validation addresses this by splitting data into k folds, training on k-1 folds, and testing on the remaining fold, rotating through all folds.
This gives you k different accuracy estimates, which you average:
# 5-fold cross-validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 5)
# Train with cross-validation
model_cv <- train(Species ~ .,
data = iris,
method = "rpart",
trControl = train_control)
# View cross-validation results
print(model_cv)
The output shows the average accuracy across all folds, plus the standard deviation—giving you a sense of how stable your model is.
You can also use k-fold cross-validation with the tidymodels framework:
library(tidymodels)
# Define the model
rf_spec <- rand_forest(trees = 100) |>
set_mode("classification")
# Define resampling
set.seed(123)
iris_folds <- vfold_cv(iris, v = 5)
# Fit with resampling
rf_res <- rf_spec |>
fit_resamples(Species ~ ., resamples = iris_folds)
# Collect metrics
collect_metrics(rf_res)
Evaluation Metrics
Beyond accuracy, there are several important metrics to consider:
Accuracy
The proportion of correct predictions. Simple but can be misleading with imbalanced classes.
Precision
Of all predictions as positive, how many are actually positive?
Recall (Sensitivity)
Of all actual positives, how many did we correctly predict?
F1 Score
The harmonic mean of precision and recall—useful when classes are imbalanced.
# Using caret for detailed metrics
confusionMatrix(predictions, test_data$Species, mode = "everything")
ROC AUC
The Area Under the ROC Curve measures the model’s ability to distinguish between classes. AUC of 0.5 is random; 1.0 is perfect.
# Get probability predictions
probabilities <- predict(model, test_data, type = "prob")
# Calculate ROC AUC for each class
roc_curve <- roc(test_data$Species, probabilities[, "setosa"])
auc(roc_curve)
For multiclass problems, use the one-vs-all approach or the multiclass.roc() function from the pROC package.
Choosing the Right Metric
The right metric depends on your problem:
- Balanced classes: Accuracy is fine
- Imbalanced classes: Use F1, precision, or recall
- Ranking predictions: Use AUC
- Cost-sensitive errors: Define custom loss functions
Putting It All Together
Here’s a complete workflow combining everything:
library(caret)
library(tidyverse)
# Prepare data
data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes
# Create stratified split
set.seed(456)
train_index <- createDataPartition(df$diabetes, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Define cross-validation
train_control <- trainControl(method = "cv", number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Train multiple models
models <- list(
rf = train(diabetes ~ ., data = train_data, method = "rf",
trControl = train_control, metric = "ROC"),
gbm = train(diabetes ~ ., data = train_data, method = "gbm",
trControl = train_control, metric = "ROC"),
logit = train(diabetes ~ ., data = train_data, method = "glm",
trControl = train_control, metric = "ROC")
)
# Compare on test set
map(models, ~ predict(.x, test_data) |>
confusionMatrix(test_data$diabetes))
Summary
Model evaluation is crucial for building reliable machine learning models:
- Train/test splits give you a quick estimate of performance
- K-fold cross-validation provides more robust estimates
- Multiple metrics (accuracy, precision, recall, F1, AUC) give you a complete picture
- Choose metrics based on your problem and class distribution
With these techniques, you can confidently evaluate and compare models to find the best solution for your data science problems.