Logistic Regression in R
Logistic regression is a fundamental classification technique used when the target variable is binary (0/1, TRUE/FALSE, yes/no). Unlike linear regression which predicts continuous values, logistic regression predicts probabilities between 0 and 1. In this tutorial, you’ll learn how to build, interpret, and evaluate logistic regression models in R.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Logistic Regression in R. By the end, you will know how to apply the core functions in real data analysis workflows.
When to use logistic regression
Logistic regression is ideal when you need to:
- Predict binary outcomes (will a customer churn or not?)
- Understand the effect of predictors on odds of success
- Get probability scores for ranking observations
- Build a baseline model before trying more complex algorithms
The glm() function in R
R’s base glm() function fits generalized linear models. For logistic regression, we set family = "binomial".
Basic logistic regression
Let’s fit a simple logistic regression model using the builtin mtcars dataset. We’ll predict whether a car has automatic (0) or manual (1) transmission based on miles per gallon and number of cylinders:
# Load data and prepare
data(mtcars)
# Convert am (transmission) to factor for clarity
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
# Fit logistic regression model
model <- glm(am ~ mpg + cyl, data = mtcars, family = "binomial")
# View model summary
summary(model)
# Coefficients (log-odds)
coef(model)
# (Intercept) mpg cyl
# 19.42588 -1.40911 -1.33045
The coefficients are in log-odds scale. To interpret them, we need to exponentiate.
Interpreting coefficients
Odds ratios
The exponentiated coefficients give us odds ratios:
# Calculate odds ratios
exp(coef(model))
# (Intercept) mpg cyl
# 2.968408e+08 0.245257 0.264375
# More readable: confidence intervals for odds ratios
exp(confint(model))
# Waiting for profiling the intervals... done
# 2.5 % 97.5 %
# (Intercept) 2.036e+01 4.329e+15
# mpg 0.1429889 0.4147277
# cyl 0.1068283 0.6536044
Interpretation:
- For every 1-unit increase in mpg, the odds of having a manual transmission multiply by 0.245 (decrease by 75%)
- For every 1-unit increase in cylinders, the odds multiply by 0.264 (decrease by 74%)
Predicted probabilities
Use predict() with type = "response" to get probabilities:
# Predicted probabilities for original data
predicted_probs <- predict(model, type = "response")
# Create data frame with actual and predicted
results <- data.frame(
actual = mtcars$am,
predicted_prob = predicted_probs,
predicted_class = ifelse(predicted_probs > 0.5, "Manual", "Automatic")
)
head(results, 10)
# actual predicted_prob predicted_class
# 1 Automatic 0.9650490 Automatic
# 2 Automatic 0.9650490 Automatic
# 3 Automatic 0.9607077 Automatic
# 4 Automatic 0.6656524 Automatic
# 5 Automatic 0.9866218 Automatic
# 6 Manual 0.1016491 Automatic
# ...
Model evaluation
Confusion matrix
# Create predicted class
predicted_class <- ifelse(predicted_probs > 0.5, "Manual", "Automatic")
# Confusion matrix
table(Predicted = predicted_class, Actual = mtcars$am)
# Calculate accuracy manually
mean(predicted_class == mtcars$am)
# [1] 0.8125
Using caret for metrics
The caret package provides comprehensive evaluation:
library(caret)
# Confusion matrix with caret
confusionMatrix(as.factor(predicted_class), mtcars$am)
ROC curve and AUC
Visualize model performance with an ROC curve:
library(pROC)
# Calculate ROC curve
roc_obj <- roc(mtcars$am, predicted_probs, levels = c("Automatic", "Manual"))
# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Logistic Regression")
# Calculate AUC
auc(roc_obj)
# Area under the curve: 0.9375
An AUC of 0.94 indicates excellent discrimination.
Multiple logistic regression
In practice, you’ll include multiple predictors. Here’s a more complete example:
# Fit model with more predictors
model_full <- glm(am ~ mpg + cyl + disp + hp + wt,
data = mtcars,
family = "binomial")
# Compare models with AIC
AIC(model, model_full)
# df AIC
# model 4 9.036475
# model_full 7 13.684785
# The simpler model is better (lower AIC)
Step-by-Step example: customer churn prediction
Let’s walk through a complete example with simulated customer data:
# Create sample customer churn data
set.seed(123)
n <- 200
churn_data <- data.frame(
tenure = runif(n, 0, 48),
monthly_charge = runif(n, 30, 150),
contract = sample(c("Month-to-Month", "One-Year", "Two-Year"), n, replace = TRUE),
payment_method = sample(c("Credit Card", "Bank Transfer", "Electronic Check"), n, replace = TRUE)
)
# Simulate churn (more likely for month-to-month, higher charges, lower tenure)
churn_data$churn <- ifelse(
runif(n) < 0.1 + 0.3 * (churn_data$contract == "Month-to-Month") +
0.002 * churn_data$monthly_charge - 0.01 * churn_data$tenure,
1, 0
)
# Fit model
churn_model <- glm(churn ~ tenure + monthly_charge + contract + payment_method,
data = churn_data,
family = "binomial",
na.action = na.exclude)
# Check coefficients
summary(churn_model)$coefficients
Best practices
-
Check for multicollinearity using
car::vif(), high VIF values indicate problematic predictors -
Assess linearity, logistic regression assumes a linear relationship between predictors and log-odds. Use Box-Tidwell test or visualize predictor vs. logit
-
Don’t overfit, AIC, cross-validation, or held-out test sets help prevent overfitting
-
Check for outliers, influential observations can distort coefficients
-
Consider regularization, for high-dimensional data, use
glmnetwith elastic net
Summary
Logistic regression in R is straightforward with glm(family = "binomial"). Key points:
- Coefficients are in log-odds; exponentiate for odds ratios
- Use
type = "response"for predicted probabilities - Evaluate with confusion matrices, ROC curves, and AUC
- Check assumptions before trusting results
With these fundamentals, you can build classification models for real-world prediction problems.
Model fitting
Logistic regression in R uses glm() with family = binomial. The binary outcome must be coded as 0/1 or as a factor with exactly two levels. glm(outcome ~ x1 + x2, data = train, family = binomial) estimates log-odds coefficients. predict(model, newdata, type = "response") returns predicted probabilities (0 to 1), not log-odds.
Model diagnostics
Logistic regression does not have the same residual diagnostics as linear regression. Check for convergence warnings from glm(). hoslem.test() from the ResourceSelection package tests the Hosmer-Lemeshow goodness-of-fit. The pseudo R-squared (1 - deviance/null.deviance) measures improvement over the null model, values of 0.2-0.4 are considered reasonable for most applications.
Classification performance
predict(model, type = "response") > 0.5 converts probabilities to binary predictions. caret::confusionMatrix(predicted, actual) computes accuracy, sensitivity, and specificity. The AUC-ROC curve (pROC::roc()) summarizes performance across all thresholds, AUC of 0.5 is random; 1.0 is perfect. For imbalanced classes, precision-recall curves are more informative than ROC curves.
Logistic regression fundamentals
Logistic regression models the probability of a binary outcome. The linear predictor is passed through the logistic function to constrain predictions to [0, 1]: P(Y=1) = 1 / (1 + exp(-Xb)). Coefficients are on the log-odds scale; exp(coef(model)) converts them to odds ratios.
glm(outcome ~ predictor1 + predictor2, data = df, family = binomial(link = "logit")) fits the model. The family = binomial argument specifies the distribution; link = "logit" is the default and most common. link = "probit" or link = "cloglog" are alternatives with different tail behavior.
Interpreting coefficients: a coefficient of 0.5 means a one-unit increase in the predictor multiplies the odds of the outcome by exp(0.5) ≈ 1.65. For a categorical predictor, the coefficient compares one level to the reference level. relevel(factor, ref = "level_name") changes the reference level.
Model fit and diagnostics
summary(model) reports coefficients, standard errors, z-values, and p-values. The AIC (Akaike Information Criterion) compares models on the same data, lower AIC indicates better fit relative to model complexity.
The null deviance (intercept-only model) and residual deviance (fitted model) quantify the improvement from adding predictors. The difference follows a chi-squared distribution; pchisq(null_deviance - residual_deviance, df = num_predictors, lower.tail = FALSE) tests the overall model significance.
Residual diagnostics differ from linear regression. Deviance residuals are the standard output in residuals(model, type = "deviance"). plot(model) produces diagnostic plots but they are harder to interpret than for linear regression. The binned residual plot from arm::binnedplot() is more informative for logistic regression.
Predictions and classification
predict(model, newdata, type = "response") returns predicted probabilities. type = "link" returns predictions on the log-odds scale. A threshold (typically 0.5) converts probabilities to class predictions: predicted_class <- ifelse(predicted_prob > 0.5, "yes", "no").
The 0.5 threshold is arbitrary. If false negatives are more costly than false positives, lower the threshold. The ROC curve shows the tradeoff between sensitivity and specificity across all thresholds. pROC::roc(actual, predicted_prob) computes the ROC; pROC::auc() summarizes it as a single number. AUC of 0.5 is random; 1.0 is perfect.
caret::confusionMatrix(predicted_class, actual) or yardstick::conf_mat(df, truth, estimate) summarize classification performance with a confusion matrix, accuracy, sensitivity, specificity, PPV, and NPV.
Regularized logistic regression
For high-dimensional data (many predictors relative to observations), ordinary logistic regression overfits. Ridge (L2) and lasso (L1) regularization penalize large coefficients.
glmnet::glmnet(x_matrix, y_vector, family = "binomial", alpha = 0) fits ridge logistic regression. alpha = 1 fits lasso (which performs variable selection by shrinking some coefficients to zero). alpha between 0 and 1 is elastic net.
glmnet::cv.glmnet() chooses the regularization strength (lambda) via cross-validation. coef(cv_model, s = "lambda.min") extracts coefficients at the lambda that minimizes cross-validated error. lambda.1se uses a larger penalty (more regularization) that is within one standard error of the minimum.
Multinomial and ordinal logistic regression
For outcomes with more than two categories, multinomial logistic regression fits a separate log-odds equation for each category relative to a reference. nnet::multinom(y ~ x, data = df) fits this model in R.
Ordinal logistic regression (proportional odds model) handles ordered categories. MASS::polr(ordered_factor ~ x, data = df) fits the model. The proportional odds assumption, that the effect of each predictor is the same across all category thresholds — should be tested with brant::brant(model).
Logistic regression for binary outcomes
Logistic regression models the probability of a binary outcome as a function of predictors. Unlike linear regression, the response variable is not a continuous quantity but a probability constrained between 0 and 1. The logistic transformation (log-odds or logit) converts this probability to an unbounded quantity that linear predictors can model without constraint. The model predicts log-odds; exponentiating the coefficients gives odds ratios.
The interpretive currency of logistic regression is the odds ratio. An odds ratio of 2 means the odds of the outcome are twice as high for a one-unit increase in the predictor. For predictors with small effects, odds ratios are approximately equal to relative risks, but they diverge for larger effects or common outcomes. The distinction matters for interpretation in clinical and public health applications.
Model fit assessment
Logistic regression does not have an R-squared equivalent that is universally accepted. Pseudo-R-squared measures — McFadden’s, Cox-Snell, Nagelkerke — all attempt to quantify the proportion of variance explained, but each has different properties and none is directly interpretable as R-squared is in linear regression. For comparing models, likelihood ratio tests and AIC are more principled than pseudo-R-squared.
The Hosmer-Lemeshow test evaluates calibration — whether the predicted probabilities agree with observed event rates across deciles of the predicted probability distribution. A significant Hosmer-Lemeshow test indicates miscalibration: the model’s probability estimates are systematically wrong for some range of predicted values. Calibration is distinct from discrimination (AUC-ROC) — a model can discriminate well between cases and non-cases while being poorly calibrated.
Predictions and decision thresholds
Logistic regression outputs probabilities. Converting probabilities to class labels requires choosing a threshold. The default threshold of 0.5 maximizes overall accuracy but may not be appropriate for imbalanced classes or when the cost of false positives and false negatives differs. Moving the threshold affects the sensitivity/specificity tradeoff: lower thresholds increase sensitivity (fewer missed cases) at the cost of specificity (more false positives).
ROC curves visualize the tradeoff across all possible thresholds. The area under the ROC curve (AUC) summarizes discriminative performance as a single number. An AUC of 0.5 is no better than chance; 1.0 is perfect discrimination. For imbalanced classes, the precision-recall curve shows the precision/recall tradeoff, which is more informative than ROC when the negative class is much larger than the positive class.
Next steps
Now that you understand logistic regression in r, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.