Machine Learning in R with tidymodels: An Introduction
tidymodels is a collection of R packages for building and evaluating machine learning models. It brings the tidyverse philosophy to model fitting, giving you a consistent way to approach machine learning in R. If you have used dplyr or ggplot2, tidymodels will feel familiar. The framework handles data splitting, preprocessing, model training, and performance assessment with a unified interface.
Why tidymodels?
R has many modeling packages. Each has its own syntax. The caret package unified some of them, but tidymodels goes further. It separates the model specification from the data preprocessing and from the evaluation. This separation makes your code easier to read and maintain.
The core packages are:
- parsnip, defines models without touching the data
- recipes, handles preprocessing steps
- rsample, creates train/test splits and resamples
- yardstick, measures model performance
- workflows, combines preprocessing and models
You install everything with one command:
install.packages("tidymodels")
library(tidymodels)
Your first model
You will use the iris dataset. It contains measurements for 150 flowers across three species. The goal is to predict Species from the other columns.
Start by splitting the data into training and testing sets. initial_split() reserves a portion of the data for final evaluation, and training() / testing() extract the respective subsets:
set.seed(42)
iris_split <- initial_split(iris, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
This puts 80% of the data in training and keeps 20% for testing. The seed ensures reproducibility. Separating test data early prevents information leakage: your model must never see the test set during training or hyperparameter tuning, or the performance estimate becomes unrealistically optimistic.
Next, define a model specification using parsnip: rand_forest() declares the model type, set_engine() selects the implementation package, and set_mode() tells parsnip this is a classification task:
rf_spec <- rand_forest() |>
set_engine("ranger") |>
set_mode("classification")
This does not fit the model yet. It only specifies the algorithm and the engine. The set_engine tells parsnip which package to use for the actual computation.
Now create a workflow that combines a formula and the model specification. The workflow object bundles the model and formula together, ensuring they stay linked throughout training, prediction, and evaluation:
iris_wf <- workflow() |>
add_formula(Species ~ .) |>
add_model(rf_spec)
Fit the model on the training data with fit(). This function triggers the full training pipeline: it prepares the formula, passes data to the engine through parsnip’s translation layer, and returns a fitted model object ready for prediction. Until this call, the workflow is a blueprint; fit() is the moment the blueprint becomes a trained artifact:
iris_fit <- fit(iris_wf, iris_train)
With the model now trained, you can generate predictions on unseen data. Passing the fitted workflow and the held-out test set to predict() applies the same formula and preprocessing to new observations without ever touching the training labels. The output is a tidy tibble with a .pred_class column of predicted species:
predictions <- predict(iris_fit, iris_test)
Evaluating performance
Raw predictions are only useful when compared against ground truth. To measure how often the model’s predicted species matches the actual species, you bind the predictions to the test labels and count the matches with yardstick’s built-in accuracy metric:
results <- iris_test |>
select(Species) |>
bind_cols(predict(iris_fit, iris_test)) |>
mutate(correct = .pred_class == Species)
accuracy(results, truth = Species, estimate = .pred_class)
The output shows the accuracy as a proportion. For most random seeds, you will see accuracy above 0.95. The random forest separates the three iris species with minimal error.
You can also generate predictions with probability scores by specifying type = "prob". This returns a tibble with one column per class, where each row sums to 1:
predict(iris_fit, iris_test, type = "prob")
This returns the probability for each class. Use it with roc_auc() from yardstick to measure how well the model ranks the classes.
Preprocessing with recipes
Real data needs cleaning before modeling. Missing values, transformations, and encoding categorical variables are common tasks. The recipes package handles these with a standardized pipeline.
Consider a modified iris dataset with missing values:
iris_missing <- iris
iris_missing$Sepal.Length[1:10] <- NA
Creating intentional missing values simulates the kind of real-world data you will encounter. A recipe specifies the preprocessing steps declaratively: you list what to do without executing anything. This deferred execution means the same recipe can be applied to training and test data consistently:
iris_rec <- recipe(Species ~ ., data = iris_missing) |>
step_impute_mean(all_numeric_predictors()) |>
step_normalize(all_numeric_predictors())
This recipe first replaces NA values in numeric columns with the column mean, then centers and scales each numeric predictor to have mean zero and standard deviation one. Declaring steps does not transform data yet — calling prep() trains the preprocessor by computing means and standard deviations from the training data, and bake() applies those learned parameters to produce the cleaned dataset:
iris_prep <- prep(iris_rec)
iris_cleaned <- bake(iris_prep, new_data = NULL)
The bake(new_data = NULL) step applies the transformations to the original training data. To transform new data, pass it to the new_data argument instead. For production use, embedding the recipe inside a workflow is the recommended approach because it guarantees that preprocessing and modeling stay paired and cannot be accidentally mismatched:
iris_wf_rec <- workflow() |>
add_recipe(iris_rec) |>
add_model(rf_spec)
The rest of the code stays the same. The workflow now handles preprocessing automatically.
Cross-Validation
A single train/test split gives one accuracy estimate. Cross-validation gives a more reliable picture. It splits the training data into folds, trains on some folds, and validates on the rest. This process repeats across all folds.
Create a 5-fold cross-validation resample:
iris_folds <- vfold_cv(iris_train, v = 5)
Creating the resamples object splits training data into 5 disjoint sets, each serving as a validation fold in turn. fit_resamples() then trains the model 5 times, each time holding out one fold for evaluation, and collects the performance metrics across all folds to give you mean accuracy with a standard error:
iris_res <- iris_wf_rec |>
fit_resamples(resamples = iris_folds, metrics = metric_set(accuracy))
Cross-validation produces metrics for each fold, but you want a single summary. collect_metrics() averages the per-fold results and calculates the standard error, giving you both the expected accuracy and a measure of how much it varies across different data splits:
collect_metrics(iris_res)
This shows the mean accuracy and its standard error across the five folds. A lower standard error means your accuracy estimate is more stable and the model generalizes consistently. With cross-validation confirming the random forest works, you may now ask whether a simpler model would do just as well.
Comparing models
You might want to compare random forest against logistic regression. Define both specifications:
rf_spec <- rand_forest() |>
set_engine("ranger") |>
set_mode("classification")
log_spec <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
Both model specifications look similar because parsnip abstracts away engine-specific syntax. fit_resamples() applies each model to every cross-validation fold and collects accuracy scores, enabling a fair comparison since both models see identical data splits and use the same recipe. This side-by-side workflow is the standard pattern for model selection in tidymodels:
rf_res <- iris_wf_rec |>
add_model(rf_spec) |>
fit_resamples(resamples = iris_folds, metrics = metric_set(accuracy))
log_res <- workflow() |>
add_recipe(iris_rec) |>
add_model(log_spec) |>
fit_resamples(resamples = iris_folds, metrics = metric_set(accuracy))
The fitted resample objects contain per-fold metrics for both models. collect_metrics() extracts them as tidy data frames so you can compare the random forest and logistic regression side by side, checking both mean accuracy and standard error to determine whether the extra complexity of a random forest is justified:
collect_metrics(rf_res)
collect_metrics(log_res)
The random forest typically outperforms logistic regression on this dataset. The comparison helps you justify your model choice with data.
See also
- Model Deployment with Vetiver in R: serve tidymodels workflows as production APIs
- Testing R Code with testthat: validate model pipelines with automated tests
- Reproducible Environments with renv: lock package versions for model reproducibility