Introduction to Machine Learning in R

· 4 min read · Updated March 7, 2026 · beginner
machine-learning introduction caret tidymodels prediction

Machine learning (ML) is transforming how we extract insights from data. In this tutorial, you’ll learn what machine learning is, why R is an excellent choice for ML work, and how to build your first predictive model.

What is Machine Learning?

Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Instead of writing rigid rules, we give algorithms examples, and they find patterns themselves.

There are three main types of machine learning:

  • Supervised Learning: Learning from labeled data to make predictions (e.g., predicting house prices)
  • Unsupervised Learning: Finding patterns in unlabeled data (e.g., customer segmentation)
  • Reinforcement Learning: Learning through trial and error (e.g., game-playing agents)

Most practical ML work falls under supervised learning, which is what we’ll focus on here.

Why Use R for Machine Learning?

R was designed for statistical computing, making it naturally suited for machine learning. Here’s why data scientists love R for ML:

  • Rich ecosystem: Hundreds of ML packages for every algorithm
  • Statistical foundation: Built-in support for statistical modeling
  • Visualization: Seamless integration with ggplot2 for model diagnostics
  • Community: Strong academic and data science community

R has several frameworks for machine learning:

caret

The caret package (Classification And REgression Training) provides a unified interface to over 200 ML algorithms. It’s excellent for beginners because the same syntax works across different models.

# Install and load caret
install.packages("caret")
library(caret)

# Train a simple model
model <- train(Species ~ ., data = iris, method = "rf")

tidymodels

The tidymodels framework is the modern successor to caret, built around the tidyverse principles. It provides a collection of packages for modeling and machine learning.

# Install tidymodels
install.packages("tidymodels")
library(tidymodels)

# Define a model specification
rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

# Fit the model
rf_spec %>% fit(Species ~ ., data = iris)

Other Important Packages

PackagePurpose
xgboostGradient boosting implementation
randomForestRandom forest algorithms
glmnetRegularized regression
rpartDecision trees
e1071Support vector machines

Building Your First Model

Let’s walk through the complete machine learning workflow in R. We’ll predict whether a tumor is malignant or benign using the Wisconsin Breast Cancer dataset.

Step 1: Load and Explore Data

# Load the data
data("WisconsinBreastCancer", package = "mlbench")
df <- WisconsinBreastCancer

# Quick look at the data
head(df)
#    Class   ClumpThickness CellSize CellShape Margesion BareNuclei 
# 1 benign             5       1         1          1          1
# 2 benign             1       1         1          1          1

# Check class distribution
table(df$Class)
# benign  malignant 
#    458       241

Step 2: Prepare the Data

# Remove rows with missing values
df <- na.omit(df)

# Convert to numeric (some algorithms need this)
df$Class <- ifelse(df$Class == "malignant", 1, 0)

# Split into training and testing sets
set.seed(123)
split <- df$Class %>% 
  createDataPartition(p = 0.8, list = FALSE)

train_data <- df[split, ]
test_data <- df[-split, ]

Step 3: Train a Model

# Using caret to train a logistic regression
set.seed(123)
model <- train(
  Class ~ ., 
  data = train_data,
  method = "glm",
  trControl = trainControl(method = "cv", number = 5)
)

# View model results
print(model)

Step 4: Make Predictions

# Predict on test set
predictions <- predict(model, newdata = test_data)

# Evaluate performance
confusionMatrix(predictions, test_data$Class)
# Confusion Matrix and Statistics
# 
#           Reference
# Prediction   0   1
#          0  89   3
#          1   0  48
#                                         
#                Accuracy : 0.9786

Understanding Model Evaluation

When evaluating ML models, you’ll encounter several key metrics:

  • Accuracy: Percentage of correct predictions
  • Precision: Of predicted positives, how many are truly positive
  • Recall: Of actual positives, how many did we catch
  • F1-Score: Harmonic mean of precision and recall
  • AUC-ROC: Ability to distinguish between classes

The confusion matrix above shows 97.86% accuracy—excellent for a first model!

Common Machine Learning Algorithms

Here are algorithms you’ll encounter frequently:

Linear and Logistic Regression

For continuous targets (regression) or binary classification (logistic regression). Simple, interpretable, and fast.

train(Class ~ ., data = train_data, method = "glm")

Decision Trees

Splits data based on feature values to make predictions. Easy to visualize and interpret.

train(Class ~ ., data = train_data, method = "rpart")

Random Forests

Ensemble of decision trees that votes for the final prediction. Generally more accurate than single trees.

train(Class ~ ., data = train_data, method = "rf")

Gradient Boosting

Sequentially builds trees that correct errors from previous trees. Often achieves state-of-the-art performance.

train(Class ~ ., data = train_data, method = "gbm")

Next Steps

Now that you’ve built your first ML model, here’s what to explore next:

  1. Try different algorithms: Compare random forests, gradient boosting, and SVMs
  2. Tune hyperparameters: Use grid search to find optimal settings
  3. Learn feature engineering: Create new predictive features from existing data
  4. Explore tidymodels: The modern framework for production-ready ML pipelines

Summary

Machine learning in R is accessible and powerful. In this tutorial, you learned:

  • What machine learning is and its main types
  • Why R is an excellent choice for ML work
  • Key packages: caret and tidymodels
  • The complete ML workflow: load data, prepare, train, predict, evaluate
  • Common algorithms and how to implement them

In the next tutorial, we’ll dive deeper into classification with the caret package, exploring how to compare multiple algorithms and tune model parameters for better performance.