Introduction to Machine Learning in R
Machine learning (ML) is transforming how we extract insights from data. In this tutorial, you’ll learn what machine learning is, why R is an excellent choice for ML work, and how to build your first predictive model.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Introduction to Machine Learning in R. By the end, you will know how to apply the core functions in real data analysis workflows.
What is machine learning?
Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Instead of writing rigid rules, we give algorithms examples, and they find patterns themselves.
There are three main types of machine learning:
- Supervised Learning: Learning from labeled data to make predictions (e.g., predicting house prices)
- Unsupervised Learning: Finding patterns in unlabeled data (e.g., customer segmentation)
- Reinforcement Learning: Learning through trial and error (e.g., game-playing agents)
Most practical ML work falls under supervised learning, which is what we’ll focus on here.
Why use r for machine learning?
R was designed for statistical computing, making it naturally suited for machine learning. Here’s why data scientists love R for ML:
- Rich ecosystem: Hundreds of ML packages for every algorithm
- Statistical foundation: Built-in support for statistical modeling
- Visualization: Smooth integration with ggplot2 for model diagnostics
- Community: Strong academic and data science community
Popular machine learning packages in r
R has several frameworks for machine learning:
Caret
The caret package (Classification And REgression Training) provides a unified interface to over 200 ML algorithms. It’s excellent for beginners because the same syntax works across different models.
# Install and load caret
install.packages("caret")
library(caret)
# Train a simple model
model <- train(Species ~ ., data = iris, method = "rf")
Tidymodels
The tidymodels framework is the modern successor to caret, built around the tidyverse principles. It provides a collection of packages for modeling and machine learning.
# Install tidymodels
install.packages("tidymodels")
library(tidymodels)
# Define a model specification
rf_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
# Fit the model
rf_spec %>% fit(Species ~ ., data = iris)
Other important packages
| Package | Purpose |
|---|---|
xgboost | Gradient boosting implementation |
randomForest | Random forest algorithms |
glmnet | Regularized regression |
rpart | Decision trees |
e1071 | Support vector machines |
Building your first model
Let’s walk through the complete machine learning workflow in R. We’ll predict whether a tumor is malignant or benign using the Wisconsin Breast Cancer dataset.
Step 1: load and explore data
# Load the data
data("WisconsinBreastCancer", package = "mlbench")
df <- WisconsinBreastCancer
# Quick look at the data
head(df)
# Class ClumpThickness CellSize CellShape Margesion BareNuclei
# 1 benign 5 1 1 1 1
# 2 benign 1 1 1 1 1
# Check class distribution
table(df$Class)
# benign malignant
# 458 241
Step 2: prepare the data
# Remove rows with missing values
df <- na.omit(df)
# Convert to numeric (some algorithms need this)
df$Class <- ifelse(df$Class == "malignant", 1, 0)
# Split into training and testing sets
set.seed(123)
split <- df$Class %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- df[split, ]
test_data <- df[-split, ]
Step 3: train a model
# Using caret to train a logistic regression
set.seed(123)
model <- train(
Class ~ .,
data = train_data,
method = "glm",
trControl = trainControl(method = "cv", number = 5)
)
# View model results
print(model)
Step 4: make predictions
# Predict on test set
predictions <- predict(model, newdata = test_data)
# Evaluate performance
confusionMatrix(predictions, test_data$Class)
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 89 3
# 1 0 48
#
# Accuracy : 0.9786
Understanding model evaluation
When evaluating ML models, you’ll encounter several key metrics:
- Accuracy: Percentage of correct predictions
- Precision: Of predicted positives, how many are truly positive
- Recall: Of actual positives, how many did we catch
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Ability to distinguish between classes
The confusion matrix above shows 97.86% accuracy—excellent for a first model!
Common machine learning algorithms
Here are algorithms you’ll encounter frequently:
Linear and logistic regression
For continuous targets (regression) or binary classification (logistic regression). Simple, interpretable, and fast.
train(Class ~ ., data = train_data, method = "glm")
Decision trees
Splits data based on feature values to make predictions. Easy to visualize and interpret.
train(Class ~ ., data = train_data, method = "rpart")
Random forests
Ensemble of decision trees that votes for the final prediction. Generally more accurate than single trees.
train(Class ~ ., data = train_data, method = "rf")
Gradient boosting
Sequentially builds trees that correct errors from previous trees. Often achieves state-of-the-art performance.
train(Class ~ ., data = train_data, method = "gbm")
Next steps
Now that you’ve built your first ML model, here’s what to explore next:
- Try different algorithms: Compare random forests, gradient boosting, and SVMs
- Tune hyperparameters: Use grid search to find optimal settings
- Learn feature engineering: Create new predictive features from existing data
- Explore tidymodels: The modern framework for production-ready ML pipelines
Summary
Machine learning in R is accessible and powerful. In this tutorial, you learned:
- What machine learning is and its main types
- Why R is an excellent choice for ML work
- Key packages: caret and tidymodels
- The complete ML workflow: load data, prepare, train, predict, evaluate
- Common algorithms and how to implement them
In the next tutorial, we’ll dive deeper into classification with the caret package, exploring how to compare multiple algorithms and tune model parameters for better performance.
Model types overview
Regression models predict continuous outcomes: linear regression (lm), regularized regression (ridge, lasso with glmnet), gradient boosting (xgboost), random forest (ranger).
Classification models predict categorical outcomes: logistic regression, random forest, XGBoost, support vector machines (kernlab). The model interface is the same as regression, change mode = "regression" to mode = "classification" in the model spec.
Unsupervised methods include k-means clustering (kmeans()), hierarchical clustering (hclust()), and PCA for dimensionality reduction (prcomp()).
Data splitting
A test set held out from all training and tuning gives an unbiased estimate of final model performance. set.seed(42) before splitting ensures reproducibility. For time series data, split by time rather than randomly, future data cannot be used to predict the past. rsample::initial_time_split(data, prop = 0.8) creates a temporal split.
Cross-validation within the training set estimates performance during tuning: vfold_cv(train, v = 10) creates 10 folds. Only use the test set once, at the very end to report final performance.
The machine learning workflow
Machine learning in R follows a consistent workflow: define the problem (classification, regression, clustering), prepare data (clean, encode, split), train a model, evaluate performance, and iterate. The tidymodels ecosystem standardizes each step with a coherent API.
The train/test split is foundational. rsample::initial_split(df, prop = 0.8, strata = outcome) creates a stratified split, the training and test sets have the same proportion of outcome classes. training(split) and testing(split) extract the sets. Never evaluate a model on data used for training.
Feature engineering goes in a recipes::recipe(). This records preprocessing steps that are fit on training data and applied to test data, preventing data leakage. step_normalize() centers and scales numeric features using means and standard deviations computed from the training set only.
Supervised learning overview
Supervised learning trains on labeled examples, each row has a known outcome. Classification predicts categories (spam/not spam, species, customer segment). Regression predicts continuous values (house price, revenue, temperature).
Algorithm families: linear methods (logistic regression, linear regression, elastic net) are fast, interpretable, and work well when relationships are approximately linear. Tree-based methods (decision trees, random forests, gradient boosting) handle nonlinear relationships and interactions without feature engineering. Support vector machines find maximum-margin boundaries. Neural networks learn arbitrary function approximations given enough data.
No single algorithm dominates. Gradient boosting (XGBoost, LightGBM) wins most tabular data competitions. Linear models are faster to train and easier to interpret. For a new problem, try linear models first — if they perform adequately, the simplicity is valuable.
Getting started with tidymodels
The tidymodels ecosystem provides a unified interface to dozens of R modeling packages. Install the full set with install.packages("tidymodels"). The core packages: rsample for splitting data, recipes for preprocessing, parsnip for model specification, workflows for bundling recipes and models, tune for hyperparameter search, and yardstick for evaluation. These packages share a consistent API design — learning one transfers directly to the others. The tidymodels website provides extensive documentation and worked examples across common machine learning tasks.
Model evaluation
Use cross-validation for performance estimates on small-to-medium datasets. rsample::vfold_cv(train, v = 10) creates 10-fold CV splits. tune::fit_resamples(workflow, resamples) fits the model on each fold. collect_metrics() averages performance across folds.
Avoid accuracy for imbalanced problems — a model that predicts the majority class every time has high accuracy but no utility. Use precision, recall, F1, and AUC-ROC for classification; RMSE, MAE, and R-squared for regression.
yardstick::metric_set(accuracy, roc_auc, f_meas) creates a multi-metric evaluator. conf_mat(pred_df, truth, estimate) shows the confusion matrix with per-class statistics. Understanding where the model fails (which classes it confuses) guides feature engineering.