Regression with tidymodels

March 7, 2026 · 6 min read · Updated March 7, 2026 · intermediate

tidymodels regression machine-learning modeling predictive

Regression is one of the most fundamental techniques in machine learning. Whether you’re predicting house prices, forecasting sales, or estimating customer lifetime value, regression models provide a solid foundation for understanding relationships between variables. The tidymodels framework offers a consistent, tidy approach to building regression models in R.

In this tutorial, you’ll learn how to use tidymodels to build regression models from scratch. We’ll cover the complete workflow: data preprocessing, model specification, training, evaluation, and interpretation.

Prerequisites

Before starting this tutorial, make sure you have the following packages installed:

install.packages(c("tidymodels", "tidyverse", "modeldata"))

Load the libraries:

library(tidymodels)
library(tidyverse)
library(modeldata)

The tidymodels metapackage loads a collection of packages including parsnip (for model specification), recipes (for preprocessing), workflows (for pipelines), tune (for hyperparameter tuning), and yardstick (for metrics).

Understanding the Data

For this tutorial, we’ll use the ames dataset from the modeldata package. This dataset contains information about houses in Ames, Iowa, with the goal of predicting sale price (Sale_Price).

data(ames)
glimpse(ames)

The dataset contains 2,930 observations and 74 variables. The target variable is Sale_Price, and we’ll use various features like the number of rooms, square footage, location, and quality ratings to predict it.

Let’s examine the distribution of the target variable:

ames %>%
  ggplot(aes(x = Sale_Price)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  scale_x_log10(labels = scales::dollar_format()) +
  labs(title = "Distribution of House Sale Prices",
       x = "Sale Price (log scale)") +
  theme_minimal()

The sale prices are right-skewed, which suggests we might want to log-transform the target variable for modeling.

Building Your First Regression Model

Step 1: Split the Data

The first step in any modeling project is to split your data into training and testing sets. The training set is used to build and tune the model, while the testing set provides an unbiased estimate of model performance on new data.

set.seed(123)
ames_split <- initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

We use stratified sampling based on Sale_Price to ensure that the distribution of house prices is similar in both sets.

Step 2: Define the Model

In tidymodels, we use the parsnip package to specify models. For linear regression, we use linear_reg() with the lm engine:

lm_model <- linear_reg() %>%
  set_engine("lm")

This creates a linear regression model specification. The model hasn’t been fit yet—we’ve just defined what type of model we want.

Step 3: Create a Recipe

Recipes define the preprocessing steps for your data. This includes handling missing values, encoding categorical variables, scaling numeric features, and more.

lm_recipe <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built + Neighborhood + Overall_Qual,
                    data = ames_train) %>%
  step_log(Sale_Price, skip = TRUE) %>%
  step_other(Neighborhood, threshold = 30) %>%
  step_dummy(all_nominal_predictors())

Let’s break down what’s happening:

recipe(Sale_Price ~ ...) defines the model formula
step_log() log-transforms the sale price (remember, prices are right-skewed)
step_other() groups rare neighborhoods into an “Other” category
step_dummy() converts categorical variables to numeric dummy variables

Step 4: Create a Workflow

Workflows combine the model and recipe into a single object, making it easy to fit and predict:

lm_workflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(lm_recipe)

Step 5: Fit the Model

Now we can fit the model to our training data:

lm_fit <- lm_workflow %>%
  fit(data = ames_train)

That’s it! We’ve built our first tidymodels regression model. Let’s examine the results:

lm_fit %>%
  extract_fit_engine() %>%
  summary()

Evaluating Model Performance

Making Predictions

We can use the fitted workflow to make predictions on new data:

ames_test_preds <- lm_fit %>%
  predict(new_data = ames_test) %>%
  bind_cols(ames_test %>% select(Sale_Price))

Calculating Metrics

The yardstick package provides various metrics for evaluating regression models:

ames_test_preds %>%
  metrics(truth = Sale_Price, estimate = .pred)

Key metrics include:

RMSE (Root Mean Squared Error): The average prediction error in the same units as the target variable
RSQ (R-squared): The proportion of variance explained by the model
MAE (Mean Absolute Error): The average absolute difference between predictions and actual values

Let’s calculate each metric separately:

rmse <- rmse(ames_test_preds, truth = Sale_Price, estimate = .pred)
rsq <- rsq(ames_test_preds, truth = Sale_Price, estimate = .pred)
mae <- mae(ames_test_preds, truth = Sale_Price, estimate = .pred)

bind_rows(rmse, rsq, mae)

Visualizing Predictions

A residual plot helps us understand where our model is making errors:

ames_test_preds %>%
  mutate(.resid = Sale_Price - .pred) %>%
  ggplot(aes(x = .pred, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Predicted Sale Price",
       y = "Residual") +
  theme_minimal()

If residuals are randomly scattered around zero, our model is capturing the underlying patterns well. Patterns in residuals (like curvature) suggest we might need additional features or a different model type.

Building More Complex Models

Regularized Regression

Ridge regression adds a penalty term to prevent overfitting, while Lasso can perform variable selection by shrinking coefficients to zero. Let’s try both:

# Ridge regression (L2 penalty)
ridge_spec <- linear_reg(mode = "regression",
                         penalty = 0.1,
                         mixture = 0) %>%
  set_engine("glmnet")

# Lasso regression (L1 penalty)
lasso_spec <- linear_reg(mode = "regression",
                         penalty = 0.1,
                         mixture = 1) %>%
  set_engine("glmnet")

Cross-Validation

Cross-validation helps us estimate how well our model will perform on new data. K-fold cross-validation splits the training data into K subsets and trains the model K times, using each subset once for validation:

ames_folds <- vfold_cv(ames_train, v = 5, strata = Sale_Price)

Now we can fit our model across all folds:

ridge_res <- lm_workflow %>%
  update_model(ridge_spec) %>%
  fit_resamples(resamples = ames_folds)

collect_metrics(ridge_res)

Hyperparameter Tuning

For regularized models, we often want to tune the penalty parameter. The tune_grid() function helps us find the optimal value:

ridge_tune_spec <- linear_reg(mode = "regression",
                              penalty = tune(),
                              mixture = 0) %>%
  set_engine("glmnet")

ridge_wf <- lm_workflow %>%
  update_model(ridge_tune_spec)

param_grid <- grid_regular(penalty(), levels = 50)

tune_res <- ridge_wf %>%
  tune_grid(resamples = ames_folds,
            grid = param_grid)

show_best(tune_res, metric = "rmse")

The show_best() function displays the top models based on the specified metric.

Interpreting Your Model

Variable Importance

Understanding which features drive predictions is crucial:

lm_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  mutate(abs_estimate = abs(estimate)) %>%
  slice_max(abs_estimate, n = 10) %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(x = "Feature",
       y = "Coefficient Estimate",
       title = "Top 10 Most Important Features") +
  theme_minimal()

Understanding Coefficients

Coefficients tell us how each feature affects the target variable. For log-transformed targets, we interpret coefficients as percentage changes:

# If Gr_Liv_Area increases by 1 unit, Sale_Price increases by approximately (exp(coefficient) - 1) * 100%

Best Practices

Start simple: Begin with a basic linear regression before moving to more complex models
Preprocess consistently: Use recipes to ensure the same transformations are applied to training and test data
Validate properly: Always use cross-validation to estimate model performance
Tune hyperparameters: Don’t assume default parameters are optimal
Interpret results: Understand what your model is telling you, not just the predictions

Summary

In this tutorial, you’ve learned the complete tidymodels workflow for regression:

Split your data into training and testing sets
Specify your model using parsnip
Create a recipe for preprocessing with recipes
Combine model and recipe in a workflow
Fit the model to training data
Evaluate performance using yardstick metrics
Tune hyperparameters for better performance
Interpret coefficients to understand feature effects

The tidymodels framework provides a consistent, tidy approach to regression modeling in R. As you continue your machine learning journey, you’ll find these same patterns apply to classification models, clustering, and more advanced techniques.