Model Evaluation Metrics in R

· 10 min read · Updated March 27, 2026 · intermediate
r machine-learning model-evaluation metrics caret yardstick

Model evaluation is where your analysis either proves its worth or reveals a hidden flaw. You can train a model that fits your training data perfectly, but if it generalizes poorly to new data, the numbers are misleading you. The metrics you choose to measure performance are not just a report card — they shape which model you ship, which hyperparameters you select, and ultimately what decisions get made.

This tutorial covers the full landscape of model evaluation in R. You’ll learn regression metrics like RMSE, MAE, and R², classification metrics like accuracy, precision, recall, F1, and AUC-ROC, how to read a confusion matrix, how to run k-fold cross-validation with caret, and how to compute metrics cleanly with yardstick. Every section includes runnable R code you can adapt immediately.

Why Model Evaluation Matters

A model that simply memorized the training data — predicting the average outcome for every input — would score 0 on any real evaluation. Yet without evaluation, you might never know. Evaluating on training data alone is one of the most common mistakes in applied machine learning, because it produces flattering but completely unreliable numbers.

Good evaluation serves three purposes. First, it protects against overfitting — when a model is too complex and picks up noise rather than signal. Second, it protects against underfitting — when a model is too simple to capture the underlying pattern. Third, it gives you a defensible basis for comparing candidate models.

Metrics also surface data problems. A sudden drop in accuracy often signals a distribution shift in your incoming data rather than a model going bad. In regulated industries like finance or healthcare, documented evaluation metrics are a prerequisite for deployment, not an afterthought.

Train/Test Split — The Baseline

The simplest evaluation strategy is a random hold-out. You shuffle your data, set aside a test fraction (typically 20–30%), train on the rest, and evaluate on the held-out set.

set.seed(42)
idx <- sample(nrow(mtcars), size = 0.8 * nrow(mtcars))
train <- mtcars[idx, ]
test  <- mtcars[-idx, ]

model <- lm(mpg ~ wt + hp, data = train)
pred  <- predict(model, newdata = test)

# Compute RMSE on held-out test set
library(yardstick)
rmse_vec(truth = test$mpg, estimate = pred)
# [1] 2.871

The single-split approach is fast and works well for large datasets where you have plenty of data to spare. Its weakness is high variance — a different random split can give meaningfully different results. The test set also cannot be reused without risking information leakage.

For model selection and hyperparameter tuning, you need a more robust approach (k-fold cross-validation). For final reporting, the held-out test set is your gold standard.

Regression Metrics: RMSE, MAE, R², MAPE

When your target variable is continuous, you’re doing regression and you’ll need regression-specific metrics.

RMSE — Root Mean Squared Error

RMSE is the square root of the average squared difference between predicted and actual values. It is in the same units as your target variable, which makes it interpretable. Because it squares the errors, RMSE penalizes large mistakes heavily.

library(yardstick)

rmse_vec(truth = c(3.1, 2.0, 4.5), estimate = c(3.0, 2.2, 4.0))
# [1] 0.2944

MAE — Mean Absolute Error

MAE averages the absolute differences. Unlike RMSE, every error contributes proportionally to its magnitude — MAE does not overweight outliers.

mae_vec(truth = c(3.1, 2.0, 4.5), estimate = c(3.0, 2.2, 4.0))
# [1] 0.2333

R² — Coefficient of Determination

R² measures the proportion of variance in the target that your model explains. It ranges from 0 to 1, where 1 means a perfect fit. R² can be negative if your model is worse than simply predicting the mean — this is a useful warning sign.

rsq_vec(truth = c(3.1, 2.0, 4.5), estimate = c(3.0, 2.2, 4.0))
# [1] 0.9496

MAPE — Mean Absolute Percentage Error

MAPE expresses errors as a percentage of the actual values. This makes it scale-independent — a MAPE of 10% means your predictions are off by 10% on average regardless of whether you’re predicting temperatures in Celsius or dollar amounts.

mape_vec <- function(actual, predicted) {
  mean(abs((actual - predicted) / actual)) * 100
}
mape_vec(c(100, 110, 115), c(102, 108, 114))
# [1] 1.8116

One caveat: MAPE is undefined when any actual value is zero, and it puts disproportionate weight on small actual values since a small absolute error represents a large percentage error there.

Computing Multiple Regression Metrics at Once

yardstick makes it easy to compute several metrics simultaneously using a tidy data frame:

library(dplyr)

eval_df <- tibble(
  truth    = c(3.1, 2.0, 4.5, 1.8, 5.2),
  estimate = c(3.0, 2.2, 4.0, 2.0, 5.0)
)

eval_df %>%
  metrics(truth, estimate)
# # A tibble: 3 × 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard       0.207
# 2 mae     standard       0.173
# 3 rsq     standard       0.987

Classification Metrics: Accuracy, Precision, Recall, F1, AUC-ROC

When your target is categorical — spam vs. not-spam, fraud vs. legitimate — you’re doing classification and you’ll need classification-specific metrics.

Confusion Matrix — Reading It in R

The confusion matrix is the foundation of classification evaluation. It tabulates every combination of predicted and actual class:

library(yardstick)

cm <- conf_mat(
  truth    = factor(c("no", "no", "yes", "yes", "no", "yes")),
  estimate = factor(c("no", "yes", "yes", "yes", "no", "yes"))
)
print(cm)
#          Truth
# Prediction no yes
#       no   2   1
#       yes  0   3

From this 2×2 table you derive every other classification metric. Each cell has a name: True Negatives (TN) are correct “no” predictions, True Positives (TP) are correct “yes” predictions, False Positives (FP) are “no” cases predicted as “yes”, and False Negatives (FN) are “yes” cases predicted as “no”.

Accuracy

Accuracy is the proportion of correct predictions overall:

accuracy_vec(
  truth    = factor(c("yes", "no", "yes", "yes", "no")),
  estimate = factor(c("yes", "no", "yes", "no",  "no"))
)
# [1] 0.8

Accuracy is intuitive but dangerously misleading on imbalanced data. A model that predicts the majority class 99% of the time achieves 99% accuracy on a 99/1 imbalanced dataset — completely useless.

Precision

Precision asks: of everything predicted as positive, how many are actually positive? High precision means few false positives.

precision_vec(
  truth    = factor(c("yes", "yes", "no", "no", "yes")),
  estimate = factor(c("yes", "no",  "no", "yes", "yes"))
)
# [1] 0.6667

Recall (Sensitivity)

Recall asks: of everything that is actually positive, how many did we find? High recall means few false negatives.

recall_vec(
  truth    = factor(c("yes", "yes", "no", "no", "yes")),
  estimate = factor(c("yes", "no",  "no", "yes", "yes"))
)
# [1] 0.6667

F1 Score

F1 is the harmonic mean of precision and recall — a single number that balances both. It penalizes extreme imbalance between them.

f_meas_vec(
  truth    = factor(c("yes", "yes", "no", "no", "yes")),
  estimate = factor(c("yes", "no",  "no", "yes", "yes"))
)
# [1] 0.6667

AUC-ROC

The ROC curve plots the True Positive Rate (recall) against the False Positive Rate at every classification threshold. AUC is the area under this curve, ranging from 0.5 (random classifier) to 1.0 (perfect classifier).

AUC is threshold-independent and works well for imbalanced datasets, since it evaluates ranking quality rather than a single threshold decision:

roc_auc_vec(
  truth    = factor(c("yes", "yes", "no", "no", "yes")),
  estimate = c(0.9, 0.3, 0.2, 0.8, 0.95)
)
# [1] 0.6667

K-Fold Cross-Validation with caret

A single train/test split gives you one noisy estimate. K-fold cross-validation divides your data into k equal folds, trains k times using k-1 folds for training and 1 for validation, then averages the results. This dramatically reduces estimate variance and is the standard approach for model selection.

library(caret)

set.seed(42)
train_control <- trainControl(method = "cv", number = 5)

model <- train(mpg ~ wt + hp + disp, data = mtcars,
               method = "lm",
               trControl = train_control)

print(model$results)
#         RMSE  Rsquared      MAE   RMSESD RsquaredSD      MAESD
# 1  3.051839 0.7538649 2.452836 0.9043507 0.1473156 0.7145961

The output shows the mean metric value across all 5 folds plus the standard deviation — the SD tells you how stable the estimate is. A large SD means the model’s performance varies a lot depending on which fold it sees.

Using yardstick for Tidy Metric Computation

yardstick is part of the tidymodels ecosystem and provides consistent, pipe-friendly functions for computing metrics. Its key advantage is that every function works identically on individual vectors or on grouped tibbles, making it trivial to compute metrics per-group for breakdown analysis.

library(yardstick)
library(dplyr)

# Regression example
eval_df <- tibble(
  truth    = c(3.1, 2.0, 4.5, 1.8, 5.2),
  estimate = c(3.0, 2.2, 4.0, 2.0, 5.0)
)

eval_df %>% rmse(truth, estimate)
# # A tibble: 1 × 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard       0.207

eval_df %>% rsq(truth, estimate)
# # A tibble: 1 × 3
#   .metric .estimator .estimate
#   <dbl>   <chr>          <dbl>
# 1 rsq     standard       0.987

For classification, yardstick works with factors and can also use probability columns for ROC-AUC:

clf_df <- tibble(
  truth     = factor(c("yes", "yes", "no", "no", "yes")),
  estimate  = factor(c("yes", "no",  "no", "yes", "yes")),
  prob_yes  = c(0.9, 0.4, 0.2, 0.6, 0.8)
)

clf_df %>% conf_mat(truth, estimate)
#          Truth
# Prediction no yes
#       no   1   1
#       yes  1   2

clf_df %>% roc_auc(truth, prob_yes)
# # A tibble: 1 × 3
#   .metric  .estimator .estimate
#   <chr>    <chr>          <dbl>
#   roc_auc  binary          0.875

Choosing the Right Metric

Metric selection is not arbitrary — it should reflect the actual costs of different error types in your problem domain.

For regression: use MAE when outliers should not dominate your evaluation. Use RMSE/MSE when large errors are especially costly and your error distribution is approximately normal. Use MAPE for business reporting where relative errors matter (forecasting demand, for instance). Always report adjusted R² when comparing models with different numbers of predictors — plain R² always increases with more predictors even when they’re irrelevant.

For classification: start by asking whether your classes are balanced. If they are roughly equal, accuracy is a reasonable baseline. For imbalanced data, AUC-ROC and F1 are far more informative. If a false positive is costly (flagging a legitimate email as spam), optimize for precision. If a false negative is costly (missing a cancer case), optimize for recall. If you need to balance both, F1 is the natural choice.

Complete Example: Logistic Regression Evaluation

Putting it all together, here is a full end-to-end evaluation of a logistic regression model using the tidymodels framework:

library(tidymodels)
library(yardstick)

set.seed(31)
df <- mtcars %>%
  mutate(mpg_high = ifelse(mpg > 20, "high", "low") %>% factor(levels = c("low", "high")))

split  <- initial_split(df, prop = 0.75, strata = mpg_high)
train  <- training(split)
test   <- testing(split)

model <- glm(mpg_high ~ wt + hp + disp, data = train, family = binomial)
pred  <- predict(model, test, type = "response")
pred_class <- ifelse(pred > 0.5, "high", "low") %>% factor(levels = c("low", "high"))

eval_df <- tibble(
  truth    = test$mpg_high,
  estimate = pred_class,
  prob     = pred
)

eval_df %>%
  conf_mat(truth, estimate) %>%
  print()

eval_df %>%
  metrics(truth, estimate) %>%
  print()
# # A tibble: 2 × 3
#   .metric  .estimator .estimate
#   <chr>    <chr>          <dbl>
# 1 accuracy binary          0.75
# 2 kap      binary          0.467

eval_df %>% roc_auc(truth, prob)
# # A tibble: 1 × 3
#   .metric  .estimator .estimate
#   <chr>    <chr>          <dbl>
#   roc_auc  binary          0.812

Summary

Model evaluation is not a single number — it is a discipline. The right evaluation setup starts with a proper train/test split, uses cross-validation for model selection and hyperparameter tuning, and picks metrics that reflect the real consequences of prediction errors in your specific problem.

R’s two main ecosystems for evaluation are caret, a mature unified interface for over 200 models with built-in resampling, and tidymodels with yardstick, the modern tidyverse-aligned suite that composes cleanly into reproducible pipelines. Both are worth knowing.

Always evaluate on held-out data, never on training data. Prefer k-fold cross-validation over a single split when you need reliable estimates. And choose your metric deliberately — the number you optimize for becomes the goal your model pursues.

See Also

  • Supervised Learning in R — an introduction to the supervised learning framework underlying all the models evaluated in this tutorial
  • Cross-Validation in R — a deeper look at k-fold CV, LOOCV, and resampling strategies for reliable performance estimates
  • Linear Regression in R — building and interpreting linear models, the foundation of regression evaluation
  • Logistic Regression in R — extending linear models to binary classification with probability outputs