Introduction to Supervised Learning in R

March 26, 2026 · 6 min read · Updated March 26, 2026 · beginner

r machine-learning tidymodels parsnip regression classification

Supervised learning is a branch of machine learning where you build a model to predict an outcome from input data. The “supervised” part means your training data comes with the correct answers already labeled — you give the model examples of inputs paired with their known outputs, and it learns the relationship between them.

When the dust settles, supervised learning splits into two main problems. If the target you want to predict is a continuous number — house prices, temperature, fuel efficiency — you are doing regression. If the target is a categorical label — spam or not spam, species A or species B — you are doing classification. Every algorithm, metric, and interpretation decision flows from this distinction.

This tutorial walks through the supervised learning workflow in R using the tidymodels framework. You will see how to split data, define a preprocessing recipe, fit a model, and evaluate it. Two fully runnable examples cover the regression and classification cases separately.

The tidymodels Framework

R has no shortage of machine learning packages. The problem is they all have different APIs. lm() uses formula syntax, randomForest() wants a matrix, glm() needs family = arguments. tidymodels solves this by giving you a consistent interface across hundreds of algorithms. It also separates the mechanics of preprocessing from modeling, which makes your code easier to reproduce and audit.

tidymodels is a meta-package that loads several focused libraries:

parsnip — defines the model type (linear regression, logistic regression, decision tree) independently of the underlying engine (lm, glmnet, ranger).
recipes — defines the preprocessing steps (scaling, encoding, imputation) as a reusable blueprint.
workflows — combines a recipe and a model into a single pipeline so you fit and predict in one place.
rsample — creates train/test splits and cross-validation folds.
yardstick — computes evaluation metrics.

You install everything with:

install.packages("tidymodels")

Your First Regression Model

You will predict a car’s fuel efficiency (mpg) from its weight (wt) using the mtcars dataset that ships with R. This is a classic regression problem.

Step 1 — Split the Data

Reserve 80% of the data for training and hold out 20% for testing. Avoid tuning your model on the test set.

library(tidymodels)

data(mtcars)

split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)

Step 2 — Define a Recipe

A recipe captures every transformation you apply to the data. Here you declare that mpg is the target and wt is the predictor. For this simple example no additional preprocessing is needed, but recipes become powerful when you need to centre, scale, or encode variables.

rec <- recipe(mpg ~ wt, data = train)

Step 3 — Specify the Model

With parsnip you state what kind of model you want and which computational engine to use. linear_reg() with set_engine("lm") gives you ordinary least squares regression.

model <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

Step 4 — Combine into a Workflow

A workflow keeps the recipe and model together. This prevents preprocessing decisions from drifting between training and prediction.

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(model)

Step 5 — Fit and Predict

fitted <- fit(wf, data = train)

predictions <- predict(fitted, test) |>
  bind_cols(test)

predictions
#    .pred  mpg cyl  disp hp drat    wt  qsec vs am gear carb
#  1 23.148 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#  2 17.385 15.2   8 301.0 335 3.54 3.570 17.14  0  1    5    8
#  ...

Step 6 — Evaluate

Use yardstick metrics to see how far the predictions stray from the true values.

metrics(predictions, truth = mpg, estimate = .pred)
#   .metric .estimator .value
# 1 rmse    standard   2.85
# 2 rsq     standard   0.76
# 3 mae     standard   2.32

RMSE (root mean squared error) is the average prediction error in the same units as the target. Lower is better. R² is the proportion of variance in mpg explained by the model — 0.76 means the weight variable captures a strong majority of the variation in fuel efficiency.

Your First Classification Model

Now switch to a classification problem. Predict whether a car has an automatic or manual transmission (am) from its fuel efficiency (mpg) and weight (wt).

mtcars$am <- factor(mtcars$am, labels = c("automatic", "manual"))

split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)

rec <- recipe(am ~ mpg + wt, data = train)

model <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(model)

fitted <- fit(wf, data = train)

Generate both class predictions and probability estimates, then evaluate with metrics suited to classification.

predictions <- predict(fitted, test, type = "prob") |>
  bind_cols(predict(fitted, test)) |>
  bind_cols(test)

accuracy(predictions, truth = am, estimate = .pred_class)
#   .metric   .estimator .value
# 1 accuracy  binary     0.875

roc_auc(predictions, truth = am, .pred_manual)
#   .metric .estimator .value
# 1 roc_auc binary     0.917

Accuracy tells you the proportion of correct predictions — 87.5% here. ROC AUC measures the model’s ability to discriminate between classes across all decision thresholds. A score of 0.917 is well above the 0.5 random baseline, indicating strong separation between automatic and manual cars based on fuel efficiency and weight.

Getting a More Reliable Performance Estimate

A single train/test split gives you one noisy estimate of performance. Cross-validation repeats the train-evaluate cycle across multiple splits of the data, producing a more stable picture.

library(caret)

train_control <- trainControl(method = "cv", number = 5)

model <- train(mpg ~ wt + hp + cyl,
               data = mtcars,
               method = "lm",
               trControl = train_control)

print(model)
# Linear Regression
#
# 32 samples
# 3 predictor
#
# No pre-processing
# Resampling: Cross-Validated (5 fold)
# Summary of sample sizes: 26, 25, 26, 26, 26
#
# Resampling results:
#   RMSE      Rsquared   MAE
#   2.65      0.83       2.19

vfold_cv() from rsample does the same within the tidymodels universe:

folds <- vfold_cv(train, v = 5)

fitted_cv <- fit_resamples(wf, resamples = folds)
collect_metrics(fitted_cv)
#   .metric .estimator .value     .std_err .config
# 1 rmse    standard   2.73       0.31     Preprocessor1_Model1
# 2 rsq     standard   0.79       0.06     Preprocessor1_Model1

Five-fold cross-validation trains on 80% of the training set and validates on the remaining 20%, repeating this five times with different folds. The reported metric is the average across all five rounds, and the standard error tells you how much the metric varies between folds.

Choosing Between Regression and Classification

The choice is driven by your target variable. Ask yourself: does the target take ordered numeric values that could be any number in a range? Use regression. Is the target a discrete label with no meaningful ordering? Use classification.

A common source of confusion is the logistic regression name. Despite the “regression” in its name, logistic regression is a classification algorithm. It outputs probabilities that an observation belongs to a particular class, which you then threshold to produce a predicted label.

For regression problems common algorithms include linear regression, decision trees, random forests, and gradient boosting machines. For classification the list includes logistic regression, k-nearest neighbours, support vector machines, and neural networks. The tidymodels framework lets you swap between these without changing your preprocessing pipeline.

What Comes Next

With the basic workflow in place, you can expand in several directions. Adding more features through additional recipe steps — step_scale(), step_center(), step_dummy() for categorical encoding — lets you handle real-world messy data. Switching the engine to glmnet, ranger, or xgboost lets you try more sophisticated algorithms without rewriting your pipeline. And using tune() with cross-validation gives you a principled way to select hyperparameters rather than guessing.