Regression with tidymodels
Regression is one of the most fundamental techniques in machine learning. Whether you’re predicting house prices, forecasting sales, or estimating customer lifetime value, regression models provide a solid foundation for understanding relationships between variables. The tidymodels framework offers a consistent, tidy approach to building regression models in R.
In this tutorial, you’ll learn how to use tidymodels to build regression models from scratch. We’ll cover the complete workflow: data preprocessing, model specification, training, evaluation, and interpretation.
Prerequisites
Before starting this tutorial, make sure you have the following packages installed:
install.packages(c("tidymodels", "tidyverse", "modeldata"))
Load the libraries:
library(tidymodels)
library(tidyverse)
library(modeldata)
The tidymodels metapackage loads a collection of packages including parsnip (for model specification), recipes (for preprocessing), workflows (for pipelines), tune (for hyperparameter tuning), and yardstick (for metrics).
Understanding the Data
For this tutorial, we’ll use the ames dataset from the modeldata package. This dataset contains information about houses in Ames, Iowa, with the goal of predicting sale price (Sale_Price).
data(ames)
glimpse(ames)
The dataset contains 2,930 observations and 74 variables. The target variable is Sale_Price, and we’ll use various features like the number of rooms, square footage, location, and quality ratings to predict it.
Let’s examine the distribution of the target variable:
ames %>%
ggplot(aes(x = Sale_Price)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
scale_x_log10(labels = scales::dollar_format()) +
labs(title = "Distribution of House Sale Prices",
x = "Sale Price (log scale)") +
theme_minimal()
The sale prices are right-skewed, which suggests we might want to log-transform the target variable for modeling.
Building Your First Regression Model
Step 1: Split the Data
The first step in any modeling project is to split your data into training and testing sets. The training set is used to build and tune the model, while the testing set provides an unbiased estimate of model performance on new data.
set.seed(123)
ames_split <- initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
We use stratified sampling based on Sale_Price to ensure that the distribution of house prices is similar in both sets.
Step 2: Define the Model
In tidymodels, we use the parsnip package to specify models. For linear regression, we use linear_reg() with the lm engine:
lm_model <- linear_reg() %>%
set_engine("lm")
This creates a linear regression model specification. The model hasn’t been fit yet—we’ve just defined what type of model we want.
Step 3: Create a Recipe
Recipes define the preprocessing steps for your data. This includes handling missing values, encoding categorical variables, scaling numeric features, and more.
lm_recipe <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built + Neighborhood + Overall_Qual,
data = ames_train) %>%
step_log(Sale_Price, skip = TRUE) %>%
step_other(Neighborhood, threshold = 30) %>%
step_dummy(all_nominal_predictors())
Let’s break down what’s happening:
recipe(Sale_Price ~ ...)defines the model formulastep_log()log-transforms the sale price (remember, prices are right-skewed)step_other()groups rare neighborhoods into an “Other” categorystep_dummy()converts categorical variables to numeric dummy variables
Step 4: Create a Workflow
Workflows combine the model and recipe into a single object, making it easy to fit and predict:
lm_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
Step 5: Fit the Model
Now we can fit the model to our training data:
lm_fit <- lm_workflow %>%
fit(data = ames_train)
That’s it! We’ve built our first tidymodels regression model. Let’s examine the results:
lm_fit %>%
extract_fit_engine() %>%
summary()
Evaluating Model Performance
Making Predictions
We can use the fitted workflow to make predictions on new data:
ames_test_preds <- lm_fit %>%
predict(new_data = ames_test) %>%
bind_cols(ames_test %>% select(Sale_Price))
Calculating Metrics
The yardstick package provides various metrics for evaluating regression models:
ames_test_preds %>%
metrics(truth = Sale_Price, estimate = .pred)
Key metrics include:
- RMSE (Root Mean Squared Error): The average prediction error in the same units as the target variable
- RSQ (R-squared): The proportion of variance explained by the model
- MAE (Mean Absolute Error): The average absolute difference between predictions and actual values
Let’s calculate each metric separately:
rmse <- rmse(ames_test_preds, truth = Sale_Price, estimate = .pred)
rsq <- rsq(ames_test_preds, truth = Sale_Price, estimate = .pred)
mae <- mae(ames_test_preds, truth = Sale_Price, estimate = .pred)
bind_rows(rmse, rsq, mae)
Visualizing Predictions
A residual plot helps us understand where our model is making errors:
ames_test_preds %>%
mutate(.resid = Sale_Price - .pred) %>%
ggplot(aes(x = .pred, y = .resid)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(x = "Predicted Sale Price",
y = "Residual") +
theme_minimal()
If residuals are randomly scattered around zero, our model is capturing the underlying patterns well. Patterns in residuals (like curvature) suggest we might need additional features or a different model type.
Building More Complex Models
Regularized Regression
Ridge regression adds a penalty term to prevent overfitting, while Lasso can perform variable selection by shrinking coefficients to zero. Let’s try both:
# Ridge regression (L2 penalty)
ridge_spec <- linear_reg(mode = "regression",
penalty = 0.1,
mixture = 0) %>%
set_engine("glmnet")
# Lasso regression (L1 penalty)
lasso_spec <- linear_reg(mode = "regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
Cross-Validation
Cross-validation helps us estimate how well our model will perform on new data. K-fold cross-validation splits the training data into K subsets and trains the model K times, using each subset once for validation:
ames_folds <- vfold_cv(ames_train, v = 5, strata = Sale_Price)
Now we can fit our model across all folds:
ridge_res <- lm_workflow %>%
update_model(ridge_spec) %>%
fit_resamples(resamples = ames_folds)
collect_metrics(ridge_res)
Hyperparameter Tuning
For regularized models, we often want to tune the penalty parameter. The tune_grid() function helps us find the optimal value:
ridge_tune_spec <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 0) %>%
set_engine("glmnet")
ridge_wf <- lm_workflow %>%
update_model(ridge_tune_spec)
param_grid <- grid_regular(penalty(), levels = 50)
tune_res <- ridge_wf %>%
tune_grid(resamples = ames_folds,
grid = param_grid)
show_best(tune_res, metric = "rmse")
The show_best() function displays the top models based on the specified metric.
Interpreting Your Model
Variable Importance
Understanding which features drive predictions is crucial:
lm_fit %>%
extract_fit_engine() %>%
tidy() %>%
filter(term != "(Intercept)") %>%
mutate(abs_estimate = abs(estimate)) %>%
slice_max(abs_estimate, n = 10) %>%
ggplot(aes(x = reorder(term, abs_estimate), y = estimate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(x = "Feature",
y = "Coefficient Estimate",
title = "Top 10 Most Important Features") +
theme_minimal()
Understanding Coefficients
Coefficients tell us how each feature affects the target variable. For log-transformed targets, we interpret coefficients as percentage changes:
# If Gr_Liv_Area increases by 1 unit, Sale_Price increases by approximately (exp(coefficient) - 1) * 100%
Best Practices
- Start simple: Begin with a basic linear regression before moving to more complex models
- Preprocess consistently: Use recipes to ensure the same transformations are applied to training and test data
- Validate properly: Always use cross-validation to estimate model performance
- Tune hyperparameters: Don’t assume default parameters are optimal
- Interpret results: Understand what your model is telling you, not just the predictions
Summary
In this tutorial, you’ve learned the complete tidymodels workflow for regression:
- Split your data into training and testing sets
- Specify your model using
parsnip - Create a recipe for preprocessing with
recipes - Combine model and recipe in a workflow
- Fit the model to training data
- Evaluate performance using
yardstickmetrics - Tune hyperparameters for better performance
- Interpret coefficients to understand feature effects
The tidymodels framework provides a consistent, tidy approach to regression modeling in R. As you continue your machine learning journey, you’ll find these same patterns apply to classification models, clustering, and more advanced techniques.