Introduction to tidymodels

· 4 min read · Updated March 10, 2026 · intermediate
tidymodels machine-learning parsnip recipes rsample yardstick

tidymodels is a collection of R packages for building and evaluating machine learning models. It brings the tidyverse philosophy to model fitting. If you have used dplyr or ggplot2, tidymodels will feel familiar. The framework handles data splitting, preprocessing, model training, and performance assessment with a consistent interface.

Why tidymodels?

R has many modeling packages. Each has its own syntax. The caret package unified some of them, but tidymodels goes further. It separates the model specification from the data preprocessing and from the evaluation. This separation makes your code easier to read and maintain.

The core packages are:

  • parsnip — defines models without touching the data
  • recipes — handles preprocessing steps
  • rsample — creates train/test splits and resamples
  • yardstick — measures model performance
  • workflows — combines preprocessing and models

You install everything with one command:

install.packages("tidymodels")
library(tidymodels)

Your First Model

You will use the iris dataset. It contains measurements for 150 flowers across three species. The goal is to predict Species from the other columns.

Start by splitting the data into training and testing sets:

set.seed(42)
iris_split <- initial_split(iris, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)

This puts 80% of the data in training and keeps 20% for testing. The seed ensures reproducibility.

Next, define a model. You will use a random forest classifier:

rf_spec <- rand_forest() |>
  set_engine("ranger") |>
  set_mode("classification")

This does not fit the model yet. It only specifies the algorithm and the engine. The set_engine tells parsnip which package to use for the actual computation.

Now create a workflow that combines a formula and the model specification:

iris_wf <- workflow() |>
  add_formula(Species ~ .) |>
  add_model(rf_spec)

Fit the model on the training data:

iris_fit <- fit(iris_wf, iris_train)

The model is trained. You can now predict on new data:

predictions <- predict(iris_fit, iris_test)

Evaluating Performance

Accuracy is a simple metric for classification. Compare the predictions to the actual species:

results <- iris_test |>
  select(Species) |>
  bind_cols(predict(iris_fit, iris_test)) |>
  mutate(correct = .pred_class == Species)

accuracy(results, truth = Species, estimate = .pred_class)

The output shows the accuracy as a proportion. For most random seeds, you will see accuracy above 0.95. The random forest separates the three iris species with minimal error.

You can also generate predictions with probability scores:

predict(iris_fit, iris_test, type = "prob")

This returns the probability for each class. Use it with roc_auc() from yardstick to measure how well the model ranks the classes.

Preprocessing with Recipes

Real data needs cleaning before modeling. Missing values, transformations, and encoding categorical variables are common tasks. The recipes package handles these with a standardized pipeline.

Consider a modified iris dataset with missing values:

iris_missing <- iris
iris_missing$Sepal.Length[1:10] <- NA

Create a recipe that imputes the missing values and scales the numeric columns:

iris_rec <- recipe(Species ~ ., data = iris_missing) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

This recipe first replaces NA values in numeric columns with the column mean. Then it centers and scales each numeric predictor to have mean zero and standard deviation one.

Apply the recipe to your data using prep() and bake():

iris_prep <- prep(iris_rec)
iris_cleaned <- bake(iris_prep, new_data = NULL)

The bake(new_data = NULL) step applies the transformations to the original training data. To transform new data, pass it to the new_data argument instead.

Add the recipe to your workflow instead of the formula:

iris_wf_rec <- workflow() |>
  add_recipe(iris_rec) |>
  add_model(rf_spec)

The rest of the code stays the same. The workflow now handles preprocessing automatically.

Cross-Validation

A single train/test split gives one accuracy estimate. Cross-validation gives a more reliable picture. It splits the training data into folds, trains on some folds, and validates on the rest. This process repeats across all folds.

Create a 5-fold cross-validation resample:

iris_folds <- vfold_cv(iris_train, v = 5)

Fit the model on each fold using fit_resamples():

iris_res <- iris_wf_rec |>
  fit_resamples(resamples = iris_folds, metrics = metric_set(accuracy))

Collect the results:

collect_metrics(iris_res)

This shows the mean accuracy and its standard error across the five folds. A lower standard error means your accuracy estimate is more stable.

Comparing Models

You might want to compare random forest against logistic regression. Define both specifications:

rf_spec <- rand_forest() |>
  set_engine("ranger") |>
  set_mode("classification")

log_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

Fit both models using fit_resamples():

rf_res <- iris_wf_rec |>
  add_model(rf_spec) |>
  fit_resamples(resamples = iris_folds, metrics = metric_set(accuracy))

log_res <- workflow() |>
  add_recipe(iris_rec) |>
  add_model(log_spec) |>
  fit_resamples(resamples = iris_folds, metrics = metric_set(accuracy))

Compare the results:

collect_metrics(rf_res)
collect_metrics(log_res)

The random forest typically outperforms logistic regression on this dataset. The comparison helps you justify your model choice with data.

Summary

tidymodels provides a unified interface for machine learning in R. You define what you want to do, not how each package implements it. The key steps are:

  1. Split your data with initial_split()
  2. Define a model with parsnip
  3. Create preprocessing recipes
  4. Combine them in a workflow
  5. Fit and evaluate with fit() or fit_resamples()
  6. Measure performance with yardstick

The framework scales from simple linear regression to complex ensembles. It integrates with the tidyverse, so you can pipe operations together. As your models grow more sophisticated, tidymodels keeps your code organized.