Logistic Regression in R
Logistic regression is a fundamental classification technique used when the target variable is binary (0/1, TRUE/FALSE, yes/no). Unlike linear regression which predicts continuous values, logistic regression predicts probabilities between 0 and 1. In this tutorial, you’ll learn how to build, interpret, and evaluate logistic regression models in R.
When to Use Logistic Regression
Logistic regression is ideal when you need to:
- Predict binary outcomes (will a customer churn or not?)
- Understand the effect of predictors on odds of success
- Get probability scores for ranking observations
- Build a baseline model before trying more complex algorithms
The glm() Function in R
R’s base glm() function fits generalized linear models. For logistic regression, we set family = "binomial".
Basic Logistic Regression
Let’s fit a simple logistic regression model using the builtin mtcars dataset. We’ll predict whether a car has automatic (0) or manual (1) transmission based on miles per gallon and number of cylinders:
# Load data and prepare
data(mtcars)
# Convert am (transmission) to factor for clarity
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
# Fit logistic regression model
model <- glm(am ~ mpg + cyl, data = mtcars, family = "binomial")
# View model summary
summary(model)
# Coefficients (log-odds)
coef(model)
# (Intercept) mpg cyl
# 19.42588 -1.40911 -1.33045
The coefficients are in log-odds scale. To interpret them, we need to exponentiate.
Interpreting Coefficients
Odds Ratios
The exponentiated coefficients give us odds ratios:
# Calculate odds ratios
exp(coef(model))
# (Intercept) mpg cyl
# 2.968408e+08 0.245257 0.264375
# More readable: confidence intervals for odds ratios
exp(confint(model))
# Waiting for profiling the intervals... done
# 2.5 % 97.5 %
# (Intercept) 2.036e+01 4.329e+15
# mpg 0.1429889 0.4147277
# cyl 0.1068283 0.6536044
Interpretation:
- For every 1-unit increase in mpg, the odds of having a manual transmission multiply by 0.245 (decrease by 75%)
- For every 1-unit increase in cylinders, the odds multiply by 0.264 (decrease by 74%)
Predicted Probabilities
Use predict() with type = "response" to get probabilities:
# Predicted probabilities for original data
predicted_probs <- predict(model, type = "response")
# Create data frame with actual and predicted
results <- data.frame(
actual = mtcars$am,
predicted_prob = predicted_probs,
predicted_class = ifelse(predicted_probs > 0.5, "Manual", "Automatic")
)
head(results, 10)
# actual predicted_prob predicted_class
# 1 Automatic 0.9650490 Automatic
# 2 Automatic 0.9650490 Automatic
# 3 Automatic 0.9607077 Automatic
# 4 Automatic 0.6656524 Automatic
# 5 Automatic 0.9866218 Automatic
# 6 Manual 0.1016491 Automatic
# ...
Model Evaluation
Confusion Matrix
# Create predicted class
predicted_class <- ifelse(predicted_probs > 0.5, "Manual", "Automatic")
# Confusion matrix
table(Predicted = predicted_class, Actual = mtcars$am)
# Calculate accuracy manually
mean(predicted_class == mtcars$am)
# [1] 0.8125
Using caret for Metrics
The caret package provides comprehensive evaluation:
library(caret)
# Confusion matrix with caret
confusionMatrix(as.factor(predicted_class), mtcars$am)
ROC Curve and AUC
Visualize model performance with an ROC curve:
library(pROC)
# Calculate ROC curve
roc_obj <- roc(mtcars$am, predicted_probs, levels = c("Automatic", "Manual"))
# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Logistic Regression")
# Calculate AUC
auc(roc_obj)
# Area under the curve: 0.9375
An AUC of 0.94 indicates excellent discrimination.
Multiple Logistic Regression
In practice, you’ll include multiple predictors. Here’s a more complete example:
# Fit model with more predictors
model_full <- glm(am ~ mpg + cyl + disp + hp + wt,
data = mtcars,
family = "binomial")
# Compare models with AIC
AIC(model, model_full)
# df AIC
# model 4 9.036475
# model_full 7 13.684785
# The simpler model is better (lower AIC)
Step-by-Step Example: Customer Churn Prediction
Let’s walk through a complete example with simulated customer data:
# Create sample customer churn data
set.seed(123)
n <- 200
churn_data <- data.frame(
tenure = runif(n, 0, 48),
monthly_charge = runif(n, 30, 150),
contract = sample(c("Month-to-Month", "One-Year", "Two-Year"), n, replace = TRUE),
payment_method = sample(c("Credit Card", "Bank Transfer", "Electronic Check"), n, replace = TRUE)
)
# Simulate churn (more likely for month-to-month, higher charges, lower tenure)
churn_data$churn <- ifelse(
runif(n) < 0.1 + 0.3 * (churn_data$contract == "Month-to-Month") +
0.002 * churn_data$monthly_charge - 0.01 * churn_data$tenure,
1, 0
)
# Fit model
churn_model <- glm(churn ~ tenure + monthly_charge + contract + payment_method,
data = churn_data,
family = "binomial",
na.action = na.exclude)
# Check coefficients
summary(churn_model)$coefficients
Best Practices
-
Check for multicollinearity using
car::vif()— high VIF values indicate problematic predictors -
Assess linearity — logistic regression assumes a linear relationship between predictors and log-odds. Use Box-Tidwell test or visualize predictor vs. logit
-
Don’t overfit — AIC, cross-validation, or held-out test sets help prevent overfitting
-
Check for outliers — influential observations can distort coefficients
-
Consider regularization — for high-dimensional data, use
glmnetwith elastic net
Summary
Logistic regression in R is straightforward with glm(family = "binomial"). Key points:
- Coefficients are in log-odds; exponentiate for odds ratios
- Use
type = "response"for predicted probabilities - Evaluate with confusion matrices, ROC curves, and AUC
- Check assumptions before trusting results
With these fundamentals, you can build classification models for real-world prediction problems.