Building Classification Models with tidymodels in R
Classification with tidymodels is one of the most productive workflows in the R machine learning ecosystem. When your target variable is categorical (spam or not spam, species A or B, churned or retained), you reach for a classifier. The tidymodels framework gives you a consistent, readable pipeline for this work: define your model, preprocess your data, bundle them into a workflow, fit, predict, and evaluate.
This tutorial walks through building classification models with tidymodels, from data splitting to final evaluation. You will see two algorithms in action (logistic regression and random forests), learn how to interpret probability predictions, and get comfortable with the evaluation metrics that actually matter for classification tasks.
The classification problem
In regression you predict a continuous number. In classification you predict a discrete label. The mechanics differ (most classifiers output class probabilities rather than raw numbers), but the overall structure is the same. You have features, you have a known outcome, and you want a model that maps the former to the latter.
tidymodels handles both regression and classification with the same pipeline design. The main difference is setting mode = "classification" when you define your model.
Setting up the tidymodels workflow
The tidymodels ecosystem spans several packages, each handling one stage of the pipeline. You load them all at once with the metapackage:
library(tidymodels)
If this is your first time, install with install.packages("tidymodels"). The metapackage pulls in parsnip for model specification, rsample for data splitting, recipes for preprocessing, workflows for bundling everything together, and yardstick for evaluation.
Split your data
Always hold out a test set before doing anything else. Use initial_split() from rsample, and always use strata when your classes are imbalanced:
split <- initial_split(my_data, prop = 0.8, strata = outcome_variable)
train_data <- training(split)
test_data <- testing(split)
Stratified sampling ensures both sets have roughly the same proportion of each class. Skipping this step on imbalanced data gives you a misleading test set.
Define a model specification
The parsnip package provides a unified interface to dozens of algorithms. You specify the model type, the computational engine, and the mode.
Logistic regression is the baseline classifier for binary outcomes:
log_reg_spec <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
Random forests handle nonlinearity and interactions without much tweaking. They vote across hundreds of decision trees, each trained on a bootstrap sample of the data with a random subset of predictors considered at each split. This gives you built-in protection against overfitting and naturally captures interactions between features that a linear model would miss:
rf_spec <- rand_forest(trees = 200, mode = "classification") |>
set_engine("ranger")
Each algorithm has its own tuning parameters (trees for random forests, penalty for regularised logistic regression), but the interface is identical. You can swap one specification for another without changing anything else in the pipeline. The trees argument controls how many trees to grow; 200 is a reasonable default that balances speed and stability.
Preprocess with a recipe
The recipes package handles feature engineering. A recipe is a blueprint describing what transformations to apply. It runs inside the workflow during fitting, which means preprocessing and modelling happen together and preprocessing never accidentally leaks information from the test set.
For most classification problems something like this works well:
rec <- recipe(outcome_variable ~ ., data = train_data) |>
step_normalize(all_numeric_predictors()) |>
step_impute_knn(all_predictors()) |>
step_zv(all_predictors()) |>
step_dummy(all_nominal_predictors())
step_normalize()centres and scales numeric predictors to mean zero and variance one. Many algorithms need this.step_impute_knn()fills missing values using k-nearest neighbours. Recipes do not handle NAs automatically; without imputation,fit()will fail.step_zv()removes predictors that have only one value across all rows.step_dummy()converts categorical variables to numeric dummy columns.
If your data is already clean and numeric, you can skip the recipe entirely and use a formula interface directly in the workflow. But for real-world datasets with missing values and categorical columns, a recipe is the safest default.
Bundle into a workflow
A workflow combines the model specification and the preprocessing recipe (or formula) into one object. This is the key design choice in tidymodels: the workflow owns both preprocessing and modelling as a single unit. When you call fit() on a workflow, it runs recipes::prep() on the training data to estimate preprocessing parameters, then fits the model on the preprocessed result. The tight coupling prevents data leakage because you can never accidentally preprocess test data with parameters learned from the training set.
Here is how you bundle a random forest specification with the preprocessing recipe:
wf <- workflow() |>
add_model(rf_spec) |>
add_recipe(rec)
The same recipe works with any model specification. You can just as easily swap in the logistic regression specification — the workflow does not care what algorithm sits inside, it only needs a model spec and a preprocessing blueprint that match the dataset. This composability is why swapping algorithms in tidymodels is a one-line change:
wf <- workflow() |>
add_model(log_reg_spec) |>
add_recipe(rec)
For simpler cases where you have clean numeric predictors and no missing values, you can use add_formula() instead of a recipe. The formula interface is more concise but skips all the transformation steps that recipes provide, so missing values or categorical predictors will cause the fit to fail:
wf <- workflow() |>
add_model(log_reg_spec) |>
add_formula(outcome_variable ~ predictor1 + predictor2 + predictor3)
Fitting and predicting
Once you have a workflow with a model specification and a preprocessing blueprint, the next step is to train the model on your data and generate predictions. The workflow object handles all the plumbing — you call fit() once and tidymodels takes care of preprocessing, model fitting, and keeping everything aligned.
Train the model
One call to fit() handles everything. It runs the recipe steps, estimates model parameters, and returns a fitted workflow that you can immediately use for prediction:
fitted_wf <- fit(wf, data = train_data)
Inside, fit() calls recipes::prep() to estimate preprocessing parameters from the training data (means, standard deviations, imputation neighbours), then fits the model. Never call bake() manually on training data before fitting; that causes double-processing and wrong parameters. The workflow object itself keeps track of whether it has been fitted, so you can always check fitted_wf$trained to confirm the model is ready before calling predict().
Hard class predictions
The default predict() call returns hard class labels — the single most likely category for each row. This is the output you use when you need a final yes/no or A/B/C decision:
predictions <- predict(fitted_wf, new_data = test_data)
# Returns a tibble with one column: .pred_class
Hard class predictions are the right choice when you need actionable decisions, but they throw away threshold information. If you want to adjust the decision threshold later (say, to favour recall over precision), you need probability predictions instead.
Probability predictions
Most classifiers can output probabilities. Pass type = "prob" to get the per-class probability estimates rather than a single label. This gives you the flexibility to choose your own classification threshold rather than accepting the default 0.5 cutoff:
predictions_prob <- predict(fitted_wf, new_data = test_data, type = "prob")
# Returns columns: .pred_class, .pred_level1, .pred_level2, ...
You need probability predictions to compute ROC AUC and other threshold-sensitive metrics. The default type = "class" only gives you labels. Probability columns are also essential if you plan to create calibration plots or set custom decision thresholds for imbalanced datasets.
Attach predictions to original data
augment() sticks the predictions back onto your test data as new columns. This keeps everything aligned by row and is the cleanest way to work with evaluation — you get the original features, the predicted class, and the probability estimates all in one tibble:
results <- augment(fitted_wf, new_data = test_data)
# Original columns + .pred_class, .pred_1, .pred_2, ...
The column names for probabilities depend on how your outcome variable is encoded. For a binary factor with levels c("no", "yes"), you’ll see .pred_no and .pred_yes. Check levels(train_data$outcome_variable) to confirm which column is which.
Evaluation metrics
The yardstick package provides tidymodels-compatible metric functions. All of them take a data frame, a truth column, and an estimate column.
Accuracy
Accuracy is the proportion of correct predictions:
accuracy(results, truth = outcome_variable, estimate = .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.87
Accuracy is intuitive but misleading on imbalanced data. If 95% of your cases are negative, a model that always predicts negative gets 95% accuracy. Always check additional metrics.
ROC AUC
ROC AUC (area under the receiver operating characteristic curve) measures how well your model separates classes across all possible thresholds. It ranges from 0.5 (random) to 1.0 (perfect):
roc_auc(results, truth = outcome_variable, .pred_yes)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# roc_auc binary 0.94
For multiclass problems, use .pred_class or specify the averaging method in roc_auc(..., options = list(averaging = "macro")). ROC AUC does not tell you where the model fails — it only summarises overall separability. To see which specific classes are confused, you need a confusion matrix.
Confusion matrix
A confusion matrix tabulates true classes against predicted classes, showing which misclassification patterns dominate. The diagonal entries are correct predictions and the off-diagonal entries reveal class pairs the model struggles to distinguish:
conf_mat(results, truth = outcome_variable, estimate = .pred_class)
# Truth
# Prediction no yes
# no 89 7
# yes 6 98
A confusion matrix is most useful when you need precision and recall for each class separately. The raw tabular output tells you the counts, but you often want standardised metrics like sensitivity and specificity. Use summary() on the confusion matrix to extract these in a tidy format:
conf_mat(results, truth = outcome_variable, estimate = .pred_class) |>
summary() |>
filter(.metric %in% c("accuracy", "sens", "spec", "ppv", "npv"))
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.935
# 2 sens binary 0.937
# 3 spec binary 0.926
# 4 ppv binary 0.933
# 5 npv binary 0.930
- Sensitivity (recall): proportion of actual positives correctly predicted
- Specificity: proportion of actual negatives correctly predicted
- PPV (precision): proportion of positive predictions that are correct
- NPV: proportion of negative predictions that are correct
Other useful metrics
Beyond the confusion matrix summary, yardstick provides individual functions for each metric. These are useful when you want to track a specific metric across resamples or need to report one number rather than a table:
# F1 score — harmonic mean of precision and recall
f_meas(results, truth = outcome_variable, estimate = .pred_class)
# Sensitivity (true positive rate)
sensitivity(results, truth = outcome_variable, estimate = .pred_class)
# Specificity (true negative rate)
specificity(results, truth = outcome_variable, estimate = .pred_class)
# Balanced accuracy — average of sensitivity and specificity
bal_accuracy(results, truth = outcome_variable, estimate = .pred_class)
All four functions share the same calling convention: pass the data frame, name the truth column, name the estimate column. The F1 score is especially useful when your classes are imbalanced because it penalises classifiers that sacrifice one class for the other. Balanced accuracy is the mean of sensitivity and specificity, giving equal weight to both classes regardless of their prevalence.
A complete example
Putting it all together with a random forest on simulated data. First, simulate the data and split it. We create three predictors with different characteristics (numeric with group separation, numeric with weaker separation, and categorical) plus a balanced binary outcome:
library(tidymodels)
set.seed(42)
n <- 800
my_data <- tibble(
predictor1 = rnorm(n, mean = c(rep(0, n/2), rep(3, n/2)), sd = 1.5),
predictor2 = rnorm(n, mean = c(rep(0, n/2), rep(2, n/2)), sd = 1.2),
predictor3 = sample(letters[1:3], n, replace = TRUE),
outcome_variable = factor(rep(c("no", "yes"), each = n/2))
)
split <- initial_split(my_data, prop = 0.8, strata = outcome_variable)
train_data <- training(split)
test_data <- testing(split)
Now build the workflow with a random forest, fit it to the training data, and generate predictions on the held-out test set. The recipe normalises numeric predictors and converts the categorical column to dummy variables. The random forest uses 200 trees, which is usually enough to stabilise predictions without excessive computation time:
rf_spec <- rand_forest(trees = 200, mode = "classification") |>
set_engine("ranger")
rec <- recipe(outcome_variable ~ ., data = train_data) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
wf <- workflow() |>
add_model(rf_spec) |>
add_recipe(rec)
fitted_wf <- fit(wf, data = train_data)
results <- augment(fitted_wf, new_data = test_data)
With the predictions in hand, compute accuracy and ROC AUC. On this simulated dataset the random forest should achieve accuracy above 0.90 and an ROC AUC near 1.0 because the group means were deliberately separated by three standard deviations. Checking both metrics together gives you a fuller picture than either one alone:
accuracy(results, truth = outcome_variable, estimate = .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# 1 accuracy binary 0.95
roc_auc(results, truth = outcome_variable, .pred_yes)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# 1 roc_auc binary 0.98
Finally, inspect the confusion matrix to see where the model made mistakes. The diagonal entries (74 and 78) are correct classifications, while the off-diagonal entries show the model confused 3 actual “yes” cases as “no” and 5 actual “no” cases as “yes”:
conf_mat(results, truth = outcome_variable, estimate = .pred_class)
# Truth
# Prediction no yes
# no 74 3
# yes 5 78
Common gotchas
Do not bake test data before predict(). The workflow handles preprocessing internally. If you manually call bake() on test data and then pass the result to predict(), your data gets preprocessed twice and your predictions will be wrong.
Factor levels must match between train and test. The outcome column in test data must have the same levels as in training data. If your test set subset is small and happens to be missing a level, use droplevels() on the test outcome column before comparison.
Use type = "prob" for ROC AUC. The default predict(..., type = "class") gives you hard labels, not probabilities. ROC AUC requires the probability columns.
Stratify your splits on imbalanced data. Without strata = outcome_variable, a random split can easily give one fold zero examples of the minority class. Most metrics will then be undefined or misleading.
Check which probability column is which. In binary classification, .pred_yes and .pred_no are the two probability columns. Which one corresponds to the positive class depends on how the factor levels are ordered. Call levels(train_data$outcome_variable) to check. For multiclass, you get one column per level.
Next steps
Now that you understand classification with tidymodels, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.
See also
- Regression with tidymodels — same framework, continuous outcomes
- Introduction to Supervised Learning in R — concepts behind the pipeline
- Model Evaluation in R — deeper dive into yardstick metrics
- Feature Engineering in R — building better predictors with recipes