Tidymodels Regression: Build, Tune, and Evaluate Models in R
Tidymodels regression gives R users a unified framework for building predictive models. Whether you are forecasting house prices, estimating customer lifetime value, or modeling sales trends, the tidymodels ecosystem (parsnip for model specification, recipes for preprocessing, workflows for pipeline assembly, and yardstick for evaluation) turns regression from a one-off script into a reproducible process. This tutorial walks through the complete tidymodels regression workflow: splitting data, specifying a model, building a preprocessing recipe, assembling a workflow, fitting the model, evaluating performance with cross-validation, and tuning hyperparameters for better results.
Prerequisites
Before starting this tutorial, make sure you have the following packages installed:
install.packages(c("tidymodels", "tidyverse", "modeldata"))
The install.packages() call fetches everything you need in one go: the tidymodels metapackage bundles parsnip, recipes, workflows, tune, and yardstick into a single install target, while tidyverse provides the data manipulation tools (dplyr, ggplot2, tidyr) you will use alongside your models. Once installed, loading the libraries makes all the modeling and data-wrangling functions available in your current R session.
library(tidymodels)
library(tidyverse)
library(modeldata)
The tidymodels metapackage loads a collection of packages including parsnip (for model specification), recipes (for preprocessing), workflows (for pipelines), tune (for hyperparameter tuning), and yardstick (for metrics).
Understanding the data
For this tutorial, we’ll use the ames dataset from the modeldata package. This dataset contains information about houses in Ames, Iowa, with the goal of predicting sale price (Sale_Price).
data(ames)
glimpse(ames)
The dataset contains 2,930 observations and 74 variables. The target variable is Sale_Price, and we’ll use various features like the number of rooms, square footage, location, and quality ratings to predict it.
Let’s examine the distribution of the target variable:
ames %>%
ggplot(aes(x = Sale_Price)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
scale_x_log10(labels = scales::dollar_format()) +
labs(title = "Distribution of House Sale Prices",
x = "Sale Price (log scale)") +
theme_minimal()
The sale prices are right-skewed, which suggests we might want to log-transform the target variable for modeling.
Building your first regression model
Step 1: split the data
The first step in any modeling project is to split your data into training and testing sets. The training set is used to build and tune the model, while the testing set provides an unbiased estimate of model performance on new data.
set.seed(123)
ames_split <- initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
We use stratified sampling based on Sale_Price to ensure that the distribution of house prices is similar in both sets.
Step 2: define the model
In tidymodels, we use the parsnip package to specify models. For linear regression, we use linear_reg() with the lm engine:
lm_model <- linear_reg() %>%
set_engine("lm")
This creates a linear regression model specification. The model hasn’t been fit yet; we have just defined what type of model we want.
Step 3: create a recipe
Recipes define the preprocessing steps for your data. This includes handling missing values, encoding categorical variables, scaling numeric features, and more.
lm_recipe <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built + Neighborhood + Overall_Qual,
data = ames_train) %>%
step_log(Sale_Price, skip = TRUE) %>%
step_other(Neighborhood, threshold = 30) %>%
step_dummy(all_nominal_predictors())
Let’s break down what’s happening:
recipe(Sale_Price ~ ...)defines the model formulastep_log()log-transforms the sale price (remember, prices are right-skewed)step_other()groups rare neighborhoods into an “Other” categorystep_dummy()converts categorical variables to numeric dummy variables
Step 4: create a workflow
A workflow object bundles the model specification and preprocessing recipe together so you can fit and predict with a single, consistent interface. Without a workflow, you would need to manually apply the recipe steps before passing data to the model. The workflow automates that coordination.
lm_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
The add_model() and add_recipe() calls register each component. At this point the workflow knows what preprocessing to apply and what model to fit, but neither has been executed yet; the data has not been touched.
Step 5: fit the model
Calling fit() on the workflow runs the recipe steps on the training data, then passes the preprocessed result to the linear regression engine. Everything happens inside the workflow, so there is no risk of applying a transformation to test data that was estimated on training data. The recipe stores those parameter estimates internally.
lm_fit <- lm_workflow %>%
fit(data = ames_train)
You have now built your first tidymodels regression model. The fitted workflow object contains both the trained linear model coefficients and the recipe’s learned preprocessing parameters, all ready for prediction on new data without manual retransformation. Examining the model summary confirms the coefficient estimates and overall fit diagnostics:
lm_fit %>%
extract_fit_engine() %>%
summary()
Evaluating model performance
Making predictions
Once the workflow is fitted, predict() applies the same recipe preprocessing to the test data and generates predictions. Binding the predictions to the original sale prices with bind_cols() creates a single data frame that yardstick can consume: every row contains both the true Sale_Price and the model’s .pred estimate, side by side.
ames_test_preds <- lm_fit %>%
predict(new_data = ames_test) %>%
bind_cols(ames_test %>% select(Sale_Price))
Calculating metrics
The yardstick package provides a standard interface for computing regression performance metrics. metrics() returns RMSE, R-squared, and MAE in one call, but computing each metric individually gives you finer control over what you report:
ames_test_preds %>%
metrics(truth = Sale_Price, estimate = .pred)
RMSE measures the average prediction error in the original units of the target variable. Lower is better, but what counts as “low” depends on the scale of your outcome. R-squared tells you what proportion of variance the model captures (values near 1 are ideal, but a high R-squared on training data often signals overfitting). MAE is less sensitive to outliers than RMSE, making it a better choice when your data has extreme values.
rmse <- rmse(ames_test_preds, truth = Sale_Price, estimate = .pred)
rsq <- rsq(ames_test_preds, truth = Sale_Price, estimate = .pred)
mae <- mae(ames_test_preds, truth = Sale_Price, estimate = .pred)
bind_rows(rmse, rsq, mae)
Visualizing predictions
A residual plot helps us understand where our model is making errors:
ames_test_preds %>%
mutate(.resid = Sale_Price - .pred) %>%
ggplot(aes(x = .pred, y = .resid)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(x = "Predicted Sale Price",
y = "Residual") +
theme_minimal()
If residuals are randomly scattered around zero, our model is capturing the underlying patterns well. Patterns in residuals (like curvature) suggest we might need additional features or a different model type.
Building more complex models
Regularized regression
Ridge regression adds an L2 penalty that shrinks coefficients toward zero without eliminating any, which helps when predictors are correlated. Lasso applies an L1 penalty that can drive coefficients all the way to zero, effectively performing variable selection. The mixture parameter in linear_reg() controls the balance: 0 selects ridge, 1 selects lasso, and values in between blend both penalties (elastic net).
# Ridge regression (L2 penalty)
ridge_spec <- linear_reg(mode = "regression",
penalty = 0.1,
mixture = 0) %>%
set_engine("glmnet")
# Lasso regression (L1 penalty)
lasso_spec <- linear_reg(mode = "regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
Cross-Validation
Cross-validation gives you an honest estimate of out-of-sample performance. Instead of evaluating your model on a single train/test split (which can give optimistic results if you got lucky with the split), k-fold CV trains and evaluates the model k times on different partitions of the data. Each observation ends up in a validation fold exactly once, and the average metric across all folds is your performance estimate.
ames_folds <- vfold_cv(ames_train, v = 5, strata = Sale_Price)
The strata argument ensures each fold maintains the same distribution of sale prices as the full training set, which is especially important when the outcome is skewed. With the folds defined, fitting the ridge model across all resamples and collecting the metrics tells you whether the regularized model actually improves on the basic linear regression:
ridge_res <- lm_workflow %>%
update_model(ridge_spec) %>%
fit_resamples(resamples = ames_folds)
collect_metrics(ridge_res)
Hyperparameter tuning
For regularized models, we often want to tune the penalty parameter. The tune_grid() function helps us find the optimal value:
ridge_tune_spec <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 0) %>%
set_engine("glmnet")
ridge_wf <- lm_workflow %>%
update_model(ridge_tune_spec)
param_grid <- grid_regular(penalty(), levels = 50)
tune_res <- ridge_wf %>%
tune_grid(resamples = ames_folds,
grid = param_grid)
show_best(tune_res, metric = "rmse")
The show_best() function displays the top models based on the specified metric.
Interpreting your model
Variable importance
Understanding which features drive predictions is crucial:
lm_fit %>%
extract_fit_engine() %>%
tidy() %>%
filter(term != "(Intercept)") %>%
mutate(abs_estimate = abs(estimate)) %>%
slice_max(abs_estimate, n = 10) %>%
ggplot(aes(x = reorder(term, abs_estimate), y = estimate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(x = "Feature",
y = "Coefficient Estimate",
title = "Top 10 Most Important Features") +
theme_minimal()
Understanding coefficients
Coefficients tell us how each feature affects the target variable. For log-transformed targets, we interpret coefficients as percentage changes:
# If Gr_Liv_Area increases by 1 unit, Sale_Price increases by approximately (exp(coefficient) - 1) * 100%
Best practices
- Start simple: Begin with a basic linear regression before moving to more complex models
- Preprocess consistently: Use recipes to ensure the same transformations are applied to training and test data
- Validate properly: Always use cross-validation to estimate model performance
- Tune hyperparameters: Don’t assume default parameters are optimal
- Interpret results: Understand what your model is telling you, not just the predictions
Summary
You have now worked through the complete tidymodels regression workflow:
- Split your data into training and testing sets
- Specify your model using
parsnip - Create a recipe for preprocessing with
recipes - Combine model and recipe in a workflow
- Fit the model to training data
- Evaluate performance using
yardstickmetrics - Tune hyperparameters for better performance
- Interpret coefficients to understand feature effects
The tidymodels framework gives you a consistent, tidy approach to regression modeling in R. The same patterns (split, recipe, model spec, workflow, fit, evaluate) carry over directly to classification, clustering, and survival analysis. Learning the workflow once makes the entire tidymodels ecosystem accessible.
Next steps
Now that you understand regression with tidymodels, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.
See also
- Classification with tidymodels: the same workflow applied to categorical outcomes
- Feature engineering in R: advanced preprocessing with recipes
- Cross-validation in R: deeper dive into resampling strategies