Tidymodels Regression: Build, Tune, and Evaluate Models in R

March 7, 2026 · 8 min read ·Updated May 28, 2026 ·intermediate

tidymodelsregressionmachine-learningmodelingpredictive

Tidymodels regression gives R users a unified framework for building predictive models. Whether you are forecasting house prices, estimating customer lifetime value, or modeling sales trends, the tidymodels ecosystem (parsnip for model specification, recipes for preprocessing, workflows for pipeline assembly, and yardstick for evaluation) turns regression from a one-off script into a reproducible process. This tutorial walks through the complete tidymodels regression workflow: splitting data, specifying a model, building a preprocessing recipe, assembling a workflow, fitting the model, evaluating performance with cross-validation, and tuning hyperparameters for better results.

Prerequisites

Before starting this tutorial, make sure you have the following packages installed:

install.packages(c("tidymodels", "tidyverse", "modeldata"))

The install.packages() call fetches everything you need in one go: the tidymodels metapackage bundles parsnip, recipes, workflows, tune, and yardstick into a single install target, while tidyverse provides the data manipulation tools (dplyr, ggplot2, tidyr) you will use alongside your models. Once installed, loading the libraries makes all the modeling and data-wrangling functions available in your current R session.

library(tidymodels)
library(tidyverse)
library(modeldata)

The tidymodels metapackage loads a collection of packages including parsnip (for model specification), recipes (for preprocessing), workflows (for pipelines), tune (for hyperparameter tuning), and yardstick (for metrics).

Understanding the data

For this tutorial, we’ll use the ames dataset from the modeldata package. This dataset contains information about houses in Ames, Iowa, with the goal of predicting sale price (Sale_Price).

data(ames)
glimpse(ames)

The dataset contains 2,930 observations and 74 variables. The target variable is Sale_Price, and we’ll use various features like the number of rooms, square footage, location, and quality ratings to predict it.

Let’s examine the distribution of the target variable:

ames %>%
  ggplot(aes(x = Sale_Price)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  scale_x_log10(labels = scales::dollar_format()) +
  labs(title = "Distribution of House Sale Prices",
       x = "Sale Price (log scale)") +
  theme_minimal()

The sale prices are right-skewed, which suggests we might want to log-transform the target variable for modeling.

Building your first regression model

Step 1: split the data

The first step in any modeling project is to split your data into training and testing sets. The training set is used to build and tune the model, while the testing set provides an unbiased estimate of model performance on new data.

set.seed(123)
ames_split <- initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

We use stratified sampling based on Sale_Price to ensure that the distribution of house prices is similar in both sets.

Step 2: define the model

In tidymodels, we use the parsnip package to specify models. For linear regression, we use linear_reg() with the lm engine:

lm_model <- linear_reg() %>%
  set_engine("lm")

This creates a linear regression model specification. The model hasn’t been fit yet; we have just defined what type of model we want.

Step 3: create a recipe

Recipes define the preprocessing steps for your data. This includes handling missing values, encoding categorical variables, scaling numeric features, and more.

lm_recipe <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built + Neighborhood + Overall_Qual,
                    data = ames_train) %>%
  step_log(Sale_Price, skip = TRUE) %>%
  step_other(Neighborhood, threshold = 30) %>%
  step_dummy(all_nominal_predictors())

Let’s break down what’s happening:

recipe(Sale_Price ~ ...) defines the model formula
step_log() log-transforms the sale price (remember, prices are right-skewed)
step_other() groups rare neighborhoods into an “Other” category
step_dummy() converts categorical variables to numeric dummy variables

Step 4: create a workflow

A workflow object bundles the model specification and preprocessing recipe together so you can fit and predict with a single, consistent interface. Without a workflow, you would need to manually apply the recipe steps before passing data to the model. The workflow automates that coordination.

lm_workflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(lm_recipe)

The add_model() and add_recipe() calls register each component. At this point the workflow knows what preprocessing to apply and what model to fit, but neither has been executed yet; the data has not been touched.

Step 5: fit the model

Calling fit() on the workflow runs the recipe steps on the training data, then passes the preprocessed result to the linear regression engine. Everything happens inside the workflow, so there is no risk of applying a transformation to test data that was estimated on training data. The recipe stores those parameter estimates internally.

lm_fit <- lm_workflow %>%
  fit(data = ames_train)

You have now built your first tidymodels regression model. The fitted workflow object contains both the trained linear model coefficients and the recipe’s learned preprocessing parameters, all ready for prediction on new data without manual retransformation. Examining the model summary confirms the coefficient estimates and overall fit diagnostics:

lm_fit %>%
  extract_fit_engine() %>%
  summary()

Evaluating model performance

Making predictions

Once the workflow is fitted, predict() applies the same recipe preprocessing to the test data and generates predictions. Binding the predictions to the original sale prices with bind_cols() creates a single data frame that yardstick can consume: every row contains both the true Sale_Price and the model’s .pred estimate, side by side.

ames_test_preds <- lm_fit %>%
  predict(new_data = ames_test) %>%
  bind_cols(ames_test %>% select(Sale_Price))

Calculating metrics

The yardstick package provides a standard interface for computing regression performance metrics. metrics() returns RMSE, R-squared, and MAE in one call, but computing each metric individually gives you finer control over what you report:

ames_test_preds %>%
  metrics(truth = Sale_Price, estimate = .pred)

RMSE measures the average prediction error in the original units of the target variable. Lower is better, but what counts as “low” depends on the scale of your outcome. R-squared tells you what proportion of variance the model captures (values near 1 are ideal, but a high R-squared on training data often signals overfitting). MAE is less sensitive to outliers than RMSE, making it a better choice when your data has extreme values.

rmse <- rmse(ames_test_preds, truth = Sale_Price, estimate = .pred)
rsq <- rsq(ames_test_preds, truth = Sale_Price, estimate = .pred)
mae <- mae(ames_test_preds, truth = Sale_Price, estimate = .pred)

bind_rows(rmse, rsq, mae)

Visualizing predictions

A residual plot helps us understand where our model is making errors:

ames_test_preds %>%
  mutate(.resid = Sale_Price - .pred) %>%
  ggplot(aes(x = .pred, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Predicted Sale Price",
       y = "Residual") +
  theme_minimal()

If residuals are randomly scattered around zero, our model is capturing the underlying patterns well. Patterns in residuals (like curvature) suggest we might need additional features or a different model type.

Building more complex models

Regularized regression

Ridge regression adds an L2 penalty that shrinks coefficients toward zero without eliminating any, which helps when predictors are correlated. Lasso applies an L1 penalty that can drive coefficients all the way to zero, effectively performing variable selection. The mixture parameter in linear_reg() controls the balance: 0 selects ridge, 1 selects lasso, and values in between blend both penalties (elastic net).

# Ridge regression (L2 penalty)
ridge_spec <- linear_reg(mode = "regression",
                         penalty = 0.1,
                         mixture = 0) %>%
  set_engine("glmnet")

# Lasso regression (L1 penalty)
lasso_spec <- linear_reg(mode = "regression",
                         penalty = 0.1,
                         mixture = 1) %>%
  set_engine("glmnet")

Cross-Validation

Cross-validation gives you an honest estimate of out-of-sample performance. Instead of evaluating your model on a single train/test split (which can give optimistic results if you got lucky with the split), k-fold CV trains and evaluates the model k times on different partitions of the data. Each observation ends up in a validation fold exactly once, and the average metric across all folds is your performance estimate.

ames_folds <- vfold_cv(ames_train, v = 5, strata = Sale_Price)

The strata argument ensures each fold maintains the same distribution of sale prices as the full training set, which is especially important when the outcome is skewed. With the folds defined, fitting the ridge model across all resamples and collecting the metrics tells you whether the regularized model actually improves on the basic linear regression:

ridge_res <- lm_workflow %>%
  update_model(ridge_spec) %>%
  fit_resamples(resamples = ames_folds)

collect_metrics(ridge_res)

Hyperparameter tuning

For regularized models, we often want to tune the penalty parameter. The tune_grid() function helps us find the optimal value:

ridge_tune_spec <- linear_reg(mode = "regression",
                              penalty = tune(),
                              mixture = 0) %>%
  set_engine("glmnet")

ridge_wf <- lm_workflow %>%
  update_model(ridge_tune_spec)

param_grid <- grid_regular(penalty(), levels = 50)

tune_res <- ridge_wf %>%
  tune_grid(resamples = ames_folds,
            grid = param_grid)

show_best(tune_res, metric = "rmse")

The show_best() function displays the top models based on the specified metric.

Interpreting your model

Variable importance

Understanding which features drive predictions is crucial:

lm_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  mutate(abs_estimate = abs(estimate)) %>%
  slice_max(abs_estimate, n = 10) %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(x = "Feature",
       y = "Coefficient Estimate",
       title = "Top 10 Most Important Features") +
  theme_minimal()

Understanding coefficients

Coefficients tell us how each feature affects the target variable. For log-transformed targets, we interpret coefficients as percentage changes:

# If Gr_Liv_Area increases by 1 unit, Sale_Price increases by approximately (exp(coefficient) - 1) * 100%

Best practices

Start simple: Begin with a basic linear regression before moving to more complex models
Preprocess consistently: Use recipes to ensure the same transformations are applied to training and test data
Validate properly: Always use cross-validation to estimate model performance
Tune hyperparameters: Don’t assume default parameters are optimal
Interpret results: Understand what your model is telling you, not just the predictions

Summary

You have now worked through the complete tidymodels regression workflow:

Split your data into training and testing sets
Specify your model using parsnip
Create a recipe for preprocessing with recipes
Combine model and recipe in a workflow
Fit the model to training data
Evaluate performance using yardstick metrics
Tune hyperparameters for better performance
Interpret coefficients to understand feature effects

The tidymodels framework gives you a consistent, tidy approach to regression modeling in R. The same patterns (split, recipe, model spec, workflow, fit, evaluate) carry over directly to classification, clustering, and survival analysis. Learning the workflow once makes the entire tidymodels ecosystem accessible.

Next steps

Now that you understand regression with tidymodels, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.