Introduction to Machine Learning in R

March 7, 2026 · 7 min read ·Updated May 29, 2026 ·beginner

machine-learningintroductioncarettidymodelsprediction

Machine learning (ML) is transforming how we extract insights from data. In this tutorial, you’ll learn what machine learning is, why R is an excellent choice for ML work, and how to build your first predictive model.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Introduction to Machine Learning in R. By the end, you will know how to apply the core functions in real data analysis workflows.

What is machine learning?

Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Instead of writing rigid rules, we give algorithms examples, and they find patterns themselves.

There are three main types of machine learning:

Supervised Learning: Learning from labeled data to make predictions (e.g., predicting house prices)
Unsupervised Learning: Finding patterns in unlabeled data (e.g., customer segmentation)
Reinforcement Learning: Learning through trial and error (e.g., game-playing agents)

Most practical ML work falls under supervised learning, which is what we’ll focus on here.

Why use r for machine learning?

R was designed for statistical computing, making it naturally suited for machine learning. Here’s why data scientists love R for ML:

Rich ecosystem: Hundreds of ML packages for every algorithm
Statistical foundation: Built-in support for statistical modeling
Visualization: Smooth integration with ggplot2 for model diagnostics
Community: Strong academic and data science community

Popular machine learning packages in r

R has several frameworks for machine learning:

Caret

The caret package (Classification And REgression Training) provides a unified interface to over 200 ML algorithms. It’s excellent for beginners because the same syntax works across different models.

# Install and load caret
install.packages("caret")
library(caret)

# Train a simple model
model <- train(Species ~ ., data = iris, method = "rf")

Tidymodels

The tidymodels framework is the modern successor to caret, built around the tidyverse principles. It provides a collection of packages for modeling and machine learning.

# Install tidymodels
install.packages("tidymodels")
library(tidymodels)

# Define a model specification
rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

# Fit the model
rf_spec %>% fit(Species ~ ., data = iris)

Other important packages

Package	Purpose
`xgboost`	Gradient boosting implementation
`randomForest`	Random forest algorithms
`glmnet`	Regularized regression
`rpart`	Decision trees
`e1071`	Support vector machines

Building your first model

Let’s walk through the complete machine learning workflow in R. We’ll predict whether a tumor is malignant or benign using the Wisconsin Breast Cancer dataset.

Step 1: load and explore data

# Load the data
data("WisconsinBreastCancer", package = "mlbench")
df <- WisconsinBreastCancer

# Quick look at the data
head(df)
#    Class   ClumpThickness CellSize CellShape Margesion BareNuclei 
# 1 benign             5       1         1          1          1
# 2 benign             1       1         1          1          1

# Check class distribution
table(df$Class)
# benign  malignant 
#    458       241

Step 2: prepare the data

# Remove rows with missing values
df <- na.omit(df)

# Convert to numeric (some algorithms need this)
df$Class <- ifelse(df$Class == "malignant", 1, 0)

# Split into training and testing sets
set.seed(123)
split <- df$Class %>% 
  createDataPartition(p = 0.8, list = FALSE)

train_data <- df[split, ]
test_data <- df[-split, ]

Step 3: train a model

# Using caret to train a logistic regression
set.seed(123)
model <- train(
  Class ~ ., 
  data = train_data,
  method = "glm",
  trControl = trainControl(method = "cv", number = 5)
)

# View model results
print(model)

Step 4: make predictions

# Predict on test set
predictions <- predict(model, newdata = test_data)

# Evaluate performance
confusionMatrix(predictions, test_data$Class)
# Confusion Matrix and Statistics
# 
#           Reference
# Prediction   0   1
#          0  89   3
#          1   0  48
#                                         
#                Accuracy : 0.9786

Understanding model evaluation

When evaluating ML models, you’ll encounter several key metrics:

Accuracy: Percentage of correct predictions
Precision: Of predicted positives, how many are truly positive
Recall: Of actual positives, how many did we catch
F1-Score: Harmonic mean of precision and recall
AUC-ROC: Ability to distinguish between classes

The confusion matrix above shows 97.86% accuracy—excellent for a first model!

Common machine learning algorithms

Here are algorithms you’ll encounter frequently:

Linear and logistic regression

For continuous targets (regression) or binary classification (logistic regression). Simple, interpretable, and fast.

train(Class ~ ., data = train_data, method = "glm")

Decision trees

Splits data based on feature values to make predictions. Easy to visualize and interpret.

train(Class ~ ., data = train_data, method = "rpart")

Random forests

Ensemble of decision trees that votes for the final prediction. Generally more accurate than single trees.

train(Class ~ ., data = train_data, method = "rf")

Gradient boosting

Sequentially builds trees that correct errors from previous trees. Often achieves state-of-the-art performance.

train(Class ~ ., data = train_data, method = "gbm")

Model types overview

Regression models predict continuous outcomes: linear regression (lm), regularized regression (ridge, lasso with glmnet), gradient boosting (xgboost), random forest (ranger).

Classification models predict categorical outcomes: logistic regression, random forest, XGBoost, support vector machines (kernlab). The model interface is the same as regression, change mode = "regression" to mode = "classification" in the model spec.

Unsupervised methods include k-means clustering (kmeans()), hierarchical clustering (hclust()), and PCA for dimensionality reduction (prcomp()).

Data splitting

A test set held out from all training and tuning gives an unbiased estimate of final model performance. set.seed(42) before splitting ensures reproducibility. For time series data, split by time rather than randomly, future data cannot be used to predict the past. rsample::initial_time_split(data, prop = 0.8) creates a temporal split.

Cross-validation within the training set estimates performance during tuning: vfold_cv(train, v = 10) creates 10 folds. Only use the test set once, at the very end to report final performance.

The machine learning workflow

Machine learning in R follows a consistent workflow: define the problem (classification, regression, clustering), prepare data (clean, encode, split), train a model, evaluate performance, and iterate. The tidymodels ecosystem standardizes each step with a coherent API.

The train/test split is foundational. rsample::initial_split(df, prop = 0.8, strata = outcome) creates a stratified split, the training and test sets have the same proportion of outcome classes. training(split) and testing(split) extract the sets. Never evaluate a model on data used for training.

Feature engineering goes in a recipes::recipe(). This records preprocessing steps that are fit on training data and applied to test data, preventing data leakage. step_normalize() centers and scales numeric features using means and standard deviations computed from the training set only.

Supervised learning overview

Supervised learning trains on labeled examples, each row has a known outcome. Classification predicts categories (spam/not spam, species, customer segment). Regression predicts continuous values (house price, revenue, temperature).

Algorithm families: linear methods (logistic regression, linear regression, elastic net) are fast, interpretable, and work well when relationships are approximately linear. Tree-based methods (decision trees, random forests, gradient boosting) handle nonlinear relationships and interactions without feature engineering. Support vector machines find maximum-margin boundaries. Neural networks learn arbitrary function approximations given enough data.

No single algorithm dominates. Gradient boosting (XGBoost, LightGBM) wins most tabular data competitions. Linear models are faster to train and easier to interpret. For a new problem, try linear models first — if they perform adequately, the simplicity is valuable.

Getting started with tidymodels

The tidymodels ecosystem provides a unified interface to dozens of R modeling packages. Install the full set with install.packages("tidymodels"). The core packages: rsample for splitting data, recipes for preprocessing, parsnip for model specification, workflows for bundling recipes and models, tune for hyperparameter search, and yardstick for evaluation. These packages share a consistent API design — learning one transfers directly to the others. The tidymodels website provides extensive documentation and worked examples across common machine learning tasks.

Model evaluation

Use cross-validation for performance estimates on small-to-medium datasets. rsample::vfold_cv(train, v = 10) creates 10-fold CV splits. tune::fit_resamples(workflow, resamples) fits the model on each fold. collect_metrics() averages performance across folds.

Avoid accuracy for imbalanced problems — a model that predicts the majority class every time has high accuracy but no utility. Use precision, recall, F1, and AUC-ROC for classification; RMSE, MAE, and R-squared for regression.

yardstick::metric_set(accuracy, roc_auc, f_meas) creates a multi-metric evaluator. conf_mat(pred_df, truth, estimate) shows the confusion matrix with per-class statistics. Understanding where the model fails (which classes it confuses) guides feature engineering.

Summary

Machine learning in R is accessible and powerful. In this tutorial, you learned:

What machine learning is and its main types
Why R is an excellent choice for ML work
Key packages: caret and tidymodels
The complete ML workflow: load data, prepare, train, predict, evaluate
Common algorithms and how to implement them

In the next tutorial, we’ll dive deeper into classification with the caret package, exploring how to compare multiple algorithms and tune model parameters for better performance.

Next steps

Now that you’ve built your first ML model, here’s what to explore next:

Try different algorithms: Compare random forests, gradient boosting, and SVMs
Tune hyperparameters: Use grid search to find optimal settings
Learn feature engineering: Create new predictive features from existing data
Explore tidymodels: The modern framework for production-ready ML pipelines