Classification Models with tidymodels
Classification predicts categorical outcomes. You might predict whether a customer will churn, if an email is spam, or whether a tumor is malignant. tidymodels provides a unified interface for all these tasks. This guide shows you how to build, evaluate, and compare classification models.
The classification workflow
The basic steps mirror the regression workflow. You split data, define a model, create a workflow, and evaluate. Classification adds a few important differences: you deal with class labels instead of continuous values, you measure performance with different metrics, and you often face class imbalance.
You will work with the Titanic dataset. It records survival outcomes along with passenger details. The goal is to predict whether a passenger survived.
Data preparation
The Titanic dataset has missing values and categorical variables that need processing. Load the data and create a recipe:
library(tidymodels)
library(titanic)
data(titanic_train, package = "titanic")
titanic <- titanic_train |>
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked) |>
mutate(Survived = factor(Survived, c("0", "1")))
titanic_rec <- recipe(Survived ~ ., data = titanic) |>
step_impute_mode(Embarked) |>
step_impute_median(Age) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
This recipe imputes missing values with the mode for categorical columns and the median for numeric ones. It creates dummy variables for categorical predictors and removes zero-variance predictors.
Now split the data:
set.seed(123)
titanic_split <- initial_split(titanic, strata = Survived, prop = 0.8)
titanic_train <- training(titanic_split)
titanic_test <- testing(titanic_split)
The strata argument ensures both sets have similar survival proportions. This matters for classification with imbalanced classes.
Logistic regression
Start with logistic regression. It is interpretable, fast to train, and serves as an excellent baseline against which more complex models can be compared. Define the specification with parsnip:
log_spec <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
log_wf <- workflow() |>
add_recipe(titanic_rec) |>
add_model(log_spec)
log_fit <- fit(log_wf, titanic_train)
Fit on training data and predict on test data. The type = "class" argument returns hard class predictions rather than probability distributions, giving you the single most likely class for each observation:
predictions <- predict(log_fit, titanic_test, type = "class")
Confusion matrix
A confusion matrix shows how your predictions compare to actual outcomes:
confusion <- predictions |>
bind_cols(titanic_test |> select(Survived)) |>
conf_mat(truth = Survived, estimate = .pred_class)
confusion
The output shows true negatives, false positives, false negatives, and true positives. Each row represents the predicted class, each column the actual class.
Extract accuracy directly:
accuracy <- predictions |>
bind_cols(titanic_test |> select(Survived)) |>
accuracy(truth = Survived, estimate = .pred_class)
accuracy
Decision trees
Decision trees split data based on feature values. They are easy to interpret — you can trace the path from root to leaf — but individual trees often overfit to training data. Create a tree specification with cost_complexity controlling tree depth:
tree_spec <- decision_tree(cost_complexity = 0.01) |>
set_engine("rpart") |>
set_mode("classification")
The cost_complexity parameter controls tree depth. Higher values produce simpler trees.
Random forests
Random forests combine many decision trees. Each tree sees a random subset of features and data. The final prediction aggregates all trees:
rf_spec <- rand_forest(mtry = 3, trees = 200) |>
set_engine("ranger") |>
set_mode("classification")
rf_wf <- workflow() |>
add_recipe(titanic_rec) |>
add_model(rf_spec)
rf_fit <- fit(rf_wf, titanic_train)
The trees argument sets how many trees to grow. The mtry argument sets how many features each tree considers at each split.
Class probability predictions
For many applications you need probabilities instead of hard class predictions. Generate them:
prob_predictions <- predict(rf_fit, titanic_test, type = "prob")
head(prob_predictions)
Each tibble contains probability columns for each class. The column names match the factor levels.
ROC curves
An ROC curve plots the true positive rate against the false positive rate at every threshold:
roc_curve <- prob_predictions |>
bind_cols(titanic_test |> select(Survived)) |>
roc_curve(truth = Survived, .pred_1)
autoplot(roc_curve)
The curve shows how sensitivity and specificity change as you move the classification threshold. A perfect classifier hugs the top-left corner.
Calculate the area under the curve:
auc <- prob_predictions |>
bind_cols(titanic_test |> select(Survived)) |>
roc_auc(truth = Survived, .pred_1)
auc
An AUC of 0.5 means random guessing. A value of 1.0 means perfect classification.
Multiple metrics
Evaluate many metrics at once with metric_set(). bundling accuracy, sensitivity, specificity, and F1 into a single call returns a tidy data frame comparing all four metrics side by side:
multi_metrics <- metric_set(accuracy, sensitivity, specificity, f_meas)
predictions |>
bind_cols(titanic_test |> select(Survived)) |>
multi_metrics(truth = Survived, estimate = .pred_class)
The sensitivity function calculates true positive rate. The specificity function calculates true negative rate. The f_meas function balances precision and recall.
Cross-Validation
Use cross-validation to estimate how your model will perform on new data:
folds <- vfold_cv(titanic_train, v = 5, strata = Survived)
rf_res <- rf_wf |>
fit_resamples(folds)
collect_metrics(rf_res)
This evaluates the random forest on five different train/validation splits. The metrics show mean performance and standard error across folds.
Handling class imbalance
The Titanic data has more survivors than victims. Other datasets might have more extreme imbalances. Use class weights:
rf_balanced <- rand_forest(mtry = 3, trees = 200) |>
set_engine("ranger") |>
set_mode("classification") |>
This tells the model to penalize errors on the minority class more heavily.
Classification metrics and thresholds
For binary classification, the default prediction threshold is 0.5, probabilities above 0.5 are classified as positive. yardstick::threshold_perf(predictions, truth, estimate, thresholds = seq(0, 1, 0.01)) evaluates precision, recall, and F1 at each threshold. autoplot(threshold_perf) visualizes the tradeoff.
ROC curves (yardstick::roc_curve(predictions, truth, .pred_positive)) and precision-recall curves (yardstick::pr_curve()) are the standard tools for evaluating classifiers without committing to a threshold. ROC AUC is reliable to class imbalance; PR AUC is more informative when the positive class is rare.
For imbalanced classes, consider using class weights (set_mode("classification") |> set_engine("glm", weights = class_weights)) or resampling techniques (themis::step_downsample() in a recipe reduces the majority class, step_smote() oversamples the minority class).
Multi-class classification
tidymodels handles multi-class classification through the same workflow as binary. metric_set(accuracy, kap, bal_accuracy) collects appropriate multi-class metrics. For per-class metrics, conf_mat(predictions, truth, estimate) computes the confusion matrix, and autoplot(confusion_matrix) visualizes it as a heatmap. conf_mat_resampled(resamples) averages confusion matrices across cross-validation folds.
Preprocessing with recipes
Classification models typically require preprocessing: encoding categorical variables, imputing missing values, and scaling numeric features. recipes handles this in a declarative pipeline: recipe(outcome ~ ., data = train) starts a recipe, then step_* functions add preprocessing steps. step_normalize() scales numeric predictors. step_dummy() one-hot encodes factors. step_impute_mean() fills missing values with column means.
The recipe learns parameters (means, SDs, factor levels) from the training data only, it never sees the test set during fitting. prep() calculates those parameters; bake() applies them to new data. Bundling the recipe in a workflow() ensures the same preprocessing is applied consistently during training, cross-validation, and prediction.
Evaluating multiple models
A tidymodels classification workflow makes it easy to compare multiple algorithms. Define several model specifications with different set_engine() calls, bind them with workflow_set(), and fit_resamples() across all of them in one call. Compare results with collect_metrics() and autoplot(). This workflow avoids the manual bookkeeping of fitting models separately and reduces the risk of inconsistent preprocessing between model comparisons.
Classification vs regression
Classification predicts a discrete class label; regression predicts a continuous value. The modeling workflow in tidymodels is similar for both, the same split, recipe, workflow, and tuning infrastructure applies, but the model specification changes the model type argument, and performance metrics differ. Classification uses accuracy, AUC-ROC, precision, recall, and F1 score. Regression uses RMSE, MAE, and R-squared.
The outcome variable for classification must be a factor in R. tidymodels checks this and raises an informative error if the outcome is character or numeric. Converting the outcome to a factor before the training/test split ensures consistent levels across all splits and folds. The factor levels define the class names that appear in predictions and metrics.
Probability vs class predictions
Classification models can produce either class label predictions or class probability predictions. Class predictions are the most likely class for each observation. Probability predictions are the estimated probability of each class. Probability predictions are more informative, they contain the confidence of the prediction, not just the outcome. Thresholding probabilities at 0.5 to produce class predictions loses this information.
Setting type = “prob” in the predict call returns a data frame with one column per class, where each row sums to 1. These probability columns are the inputs to AUC-ROC computation and probability calibration checks. For imbalanced classes, adjusting the decision threshold away from 0.5, using a lower threshold for the minority class to increase recall at the cost of precision, requires the probability predictions.
Imbalanced classes
Class imbalance, where one class is much rarer than another, is common in classification problems: fraud detection, disease diagnosis, anomaly detection. A model that predicts the majority class for every observation achieves high accuracy but has no predictive value for the rare class. Addressing imbalance requires resampling strategies, adjusted class weights, or both.
The themis package extends tidymodels with over- and undersampling recipe steps. SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority class examples. Downsampling randomly removes majority class examples. These steps go into the recipe and are applied correctly, only to training data during cross-validation, when the recipe is used inside a workflow.
Summary
You now know how to build classification models with tidymodels. The key steps are:
- Prepare data with a recipe that handles missing values and categorical variables
- Split data with
initial_split(), optionally using stratification - Define a model with
parsnip— try logistic regression, decision trees, random forests, or gradient boosting - Create a workflow combining recipe and model
- Evaluate with
predict()on test data orfit_resamples()for cross-validation - Measure performance with accuracy, confusion matrices, ROC curves, and AUC
Classification with tidymodels follows the same patterns as regression. The main differences are the metrics you use and the attention you pay to class imbalance. Apply these techniques to your own classification problems.
See also
- dplyr Data Wrangling — Data preprocessing techniques
- purrr Functional Programming — Functional iteration patterns
- R Memory Management — Optimizing large datasets
last_fit(workflow, split)fits the final model on the full training set and evaluates on the test set in one call, giving final performance estimates.