rguides

Logistic Regression in R

Logistic regression is a fundamental classification technique used when the target variable is binary (0/1, TRUE/FALSE, yes/no). Unlike linear regression which predicts continuous values, logistic regression predicts probabilities between 0 and 1. In this tutorial, you’ll learn how to build, interpret, and evaluate logistic regression models in R.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Logistic Regression in R. By the end, you will know how to apply the core functions in real data analysis workflows.

When to use logistic regression

Logistic regression is ideal when you need to:

  • Predict binary outcomes (will a customer churn or not?)
  • Understand the effect of predictors on odds of success
  • Get probability scores for ranking observations
  • Build a baseline model before trying more complex algorithms

The glm() function in R

R’s base glm() function fits generalized linear models. For logistic regression, we set family = "binomial".

Basic logistic regression

Let’s fit a simple logistic regression model using the builtin mtcars dataset. We’ll predict whether a car has automatic (0) or manual (1) transmission based on miles per gallon and number of cylinders:

# Load data and prepare
data(mtcars)

# Convert am (transmission) to factor for clarity
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))

# Fit logistic regression model
model <- glm(am ~ mpg + cyl, data = mtcars, family = "binomial")

# View model summary
summary(model)

# Coefficients (log-odds)
coef(model)
# (Intercept)         mpg         cyl 
#   19.42588    -1.40911    -1.33045

The coefficients are in log-odds scale. To interpret them, we need to exponentiate.

Interpreting coefficients

Odds ratios

The exponentiated coefficients give us odds ratios:

# Calculate odds ratios
exp(coef(model))
#   (Intercept)          mpg          cyl 
# 2.968408e+08     0.245257     0.264375

# More readable: confidence intervals for odds ratios
exp(confint(model))
# Waiting for profiling the intervals... done
#                   2.5 %      97.5 %
# (Intercept)  2.036e+01  4.329e+15
# mpg          0.1429889   0.4147277
# cyl          0.1068283   0.6536044

Interpretation:

  • For every 1-unit increase in mpg, the odds of having a manual transmission multiply by 0.245 (decrease by 75%)
  • For every 1-unit increase in cylinders, the odds multiply by 0.264 (decrease by 74%)

Predicted probabilities

Use predict() with type = "response" to get probabilities:

# Predicted probabilities for original data
predicted_probs <- predict(model, type = "response")

# Create data frame with actual and predicted
results <- data.frame(
  actual = mtcars$am,
  predicted_prob = predicted_probs,
  predicted_class = ifelse(predicted_probs > 0.5, "Manual", "Automatic")
)

head(results, 10)
#    actual predicted_prob predicted_class
# 1 Automatic      0.9650490        Automatic
# 2 Automatic      0.9650490        Automatic
# 3 Automatic      0.9607077        Automatic
# 4 Automatic      0.6656524        Automatic
# 5 Automatic      0.9866218        Automatic
# 6       Manual      0.1016491        Automatic
# ...

Model evaluation

Confusion matrix

# Create predicted class
predicted_class <- ifelse(predicted_probs > 0.5, "Manual", "Automatic")

# Confusion matrix
table(Predicted = predicted_class, Actual = mtcars$am)

# Calculate accuracy manually
mean(predicted_class == mtcars$am)
# [1] 0.8125

Using caret for metrics

The caret package provides comprehensive evaluation:

library(caret)

# Confusion matrix with caret
confusionMatrix(as.factor(predicted_class), mtcars$am)

ROC curve and AUC

Visualize model performance with an ROC curve:

library(pROC)

# Calculate ROC curve
roc_obj <- roc(mtcars$am, predicted_probs, levels = c("Automatic", "Manual"))

# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Logistic Regression")

# Calculate AUC
auc(roc_obj)
# Area under the curve: 0.9375

An AUC of 0.94 indicates excellent discrimination.

Multiple logistic regression

In practice, you’ll include multiple predictors. Here’s a more complete example:

# Fit model with more predictors
model_full <- glm(am ~ mpg + cyl + disp + hp + wt, 
                  data = mtcars, 
                  family = "binomial")

# Compare models with AIC
AIC(model, model_full)
#         df      AIC
# model    4  9.036475
# model_full  7 13.684785

# The simpler model is better (lower AIC)

Step-by-Step example: customer churn prediction

Let’s walk through a complete example with simulated customer data:

# Create sample customer churn data
set.seed(123)
n <- 200
churn_data <- data.frame(
  tenure = runif(n, 0, 48),
  monthly_charge = runif(n, 30, 150),
  contract = sample(c("Month-to-Month", "One-Year", "Two-Year"), n, replace = TRUE),
  payment_method = sample(c("Credit Card", "Bank Transfer", "Electronic Check"), n, replace = TRUE)
)

# Simulate churn (more likely for month-to-month, higher charges, lower tenure)
churn_data$churn <- ifelse(
  runif(n) < 0.1 + 0.3 * (churn_data$contract == "Month-to-Month") + 
    0.002 * churn_data$monthly_charge - 0.01 * churn_data$tenure,
  1, 0
)

# Fit model
churn_model <- glm(churn ~ tenure + monthly_charge + contract + payment_method,
                   data = churn_data,
                   family = "binomial",
                   na.action = na.exclude)

# Check coefficients
summary(churn_model)$coefficients

Best practices

  1. Check for multicollinearity using car::vif(), high VIF values indicate problematic predictors

  2. Assess linearity, logistic regression assumes a linear relationship between predictors and log-odds. Use Box-Tidwell test or visualize predictor vs. logit

  3. Don’t overfit, AIC, cross-validation, or held-out test sets help prevent overfitting

  4. Check for outliers, influential observations can distort coefficients

  5. Consider regularization, for high-dimensional data, use glmnet with elastic net

Summary

Logistic regression in R is straightforward with glm(family = "binomial"). Key points:

  • Coefficients are in log-odds; exponentiate for odds ratios
  • Use type = "response" for predicted probabilities
  • Evaluate with confusion matrices, ROC curves, and AUC
  • Check assumptions before trusting results

With these fundamentals, you can build classification models for real-world prediction problems.

Model fitting

Logistic regression in R uses glm() with family = binomial. The binary outcome must be coded as 0/1 or as a factor with exactly two levels. glm(outcome ~ x1 + x2, data = train, family = binomial) estimates log-odds coefficients. predict(model, newdata, type = "response") returns predicted probabilities (0 to 1), not log-odds.

Model diagnostics

Logistic regression does not have the same residual diagnostics as linear regression. Check for convergence warnings from glm(). hoslem.test() from the ResourceSelection package tests the Hosmer-Lemeshow goodness-of-fit. The pseudo R-squared (1 - deviance/null.deviance) measures improvement over the null model, values of 0.2-0.4 are considered reasonable for most applications.

Classification performance

predict(model, type = "response") > 0.5 converts probabilities to binary predictions. caret::confusionMatrix(predicted, actual) computes accuracy, sensitivity, and specificity. The AUC-ROC curve (pROC::roc()) summarizes performance across all thresholds, AUC of 0.5 is random; 1.0 is perfect. For imbalanced classes, precision-recall curves are more informative than ROC curves.

Logistic regression fundamentals

Logistic regression models the probability of a binary outcome. The linear predictor is passed through the logistic function to constrain predictions to [0, 1]: P(Y=1) = 1 / (1 + exp(-Xb)). Coefficients are on the log-odds scale; exp(coef(model)) converts them to odds ratios.

glm(outcome ~ predictor1 + predictor2, data = df, family = binomial(link = "logit")) fits the model. The family = binomial argument specifies the distribution; link = "logit" is the default and most common. link = "probit" or link = "cloglog" are alternatives with different tail behavior.

Interpreting coefficients: a coefficient of 0.5 means a one-unit increase in the predictor multiplies the odds of the outcome by exp(0.5) ≈ 1.65. For a categorical predictor, the coefficient compares one level to the reference level. relevel(factor, ref = "level_name") changes the reference level.

Model fit and diagnostics

summary(model) reports coefficients, standard errors, z-values, and p-values. The AIC (Akaike Information Criterion) compares models on the same data, lower AIC indicates better fit relative to model complexity.

The null deviance (intercept-only model) and residual deviance (fitted model) quantify the improvement from adding predictors. The difference follows a chi-squared distribution; pchisq(null_deviance - residual_deviance, df = num_predictors, lower.tail = FALSE) tests the overall model significance.

Residual diagnostics differ from linear regression. Deviance residuals are the standard output in residuals(model, type = "deviance"). plot(model) produces diagnostic plots but they are harder to interpret than for linear regression. The binned residual plot from arm::binnedplot() is more informative for logistic regression.

Predictions and classification

predict(model, newdata, type = "response") returns predicted probabilities. type = "link" returns predictions on the log-odds scale. A threshold (typically 0.5) converts probabilities to class predictions: predicted_class <- ifelse(predicted_prob > 0.5, "yes", "no").

The 0.5 threshold is arbitrary. If false negatives are more costly than false positives, lower the threshold. The ROC curve shows the tradeoff between sensitivity and specificity across all thresholds. pROC::roc(actual, predicted_prob) computes the ROC; pROC::auc() summarizes it as a single number. AUC of 0.5 is random; 1.0 is perfect.

caret::confusionMatrix(predicted_class, actual) or yardstick::conf_mat(df, truth, estimate) summarize classification performance with a confusion matrix, accuracy, sensitivity, specificity, PPV, and NPV.

Regularized logistic regression

For high-dimensional data (many predictors relative to observations), ordinary logistic regression overfits. Ridge (L2) and lasso (L1) regularization penalize large coefficients.

glmnet::glmnet(x_matrix, y_vector, family = "binomial", alpha = 0) fits ridge logistic regression. alpha = 1 fits lasso (which performs variable selection by shrinking some coefficients to zero). alpha between 0 and 1 is elastic net.

glmnet::cv.glmnet() chooses the regularization strength (lambda) via cross-validation. coef(cv_model, s = "lambda.min") extracts coefficients at the lambda that minimizes cross-validated error. lambda.1se uses a larger penalty (more regularization) that is within one standard error of the minimum.

Multinomial and ordinal logistic regression

For outcomes with more than two categories, multinomial logistic regression fits a separate log-odds equation for each category relative to a reference. nnet::multinom(y ~ x, data = df) fits this model in R.

Ordinal logistic regression (proportional odds model) handles ordered categories. MASS::polr(ordered_factor ~ x, data = df) fits the model. The proportional odds assumption, that the effect of each predictor is the same across all category thresholds — should be tested with brant::brant(model).

Logistic regression for binary outcomes

Logistic regression models the probability of a binary outcome as a function of predictors. Unlike linear regression, the response variable is not a continuous quantity but a probability constrained between 0 and 1. The logistic transformation (log-odds or logit) converts this probability to an unbounded quantity that linear predictors can model without constraint. The model predicts log-odds; exponentiating the coefficients gives odds ratios.

The interpretive currency of logistic regression is the odds ratio. An odds ratio of 2 means the odds of the outcome are twice as high for a one-unit increase in the predictor. For predictors with small effects, odds ratios are approximately equal to relative risks, but they diverge for larger effects or common outcomes. The distinction matters for interpretation in clinical and public health applications.

Model fit assessment

Logistic regression does not have an R-squared equivalent that is universally accepted. Pseudo-R-squared measures — McFadden’s, Cox-Snell, Nagelkerke — all attempt to quantify the proportion of variance explained, but each has different properties and none is directly interpretable as R-squared is in linear regression. For comparing models, likelihood ratio tests and AIC are more principled than pseudo-R-squared.

The Hosmer-Lemeshow test evaluates calibration — whether the predicted probabilities agree with observed event rates across deciles of the predicted probability distribution. A significant Hosmer-Lemeshow test indicates miscalibration: the model’s probability estimates are systematically wrong for some range of predicted values. Calibration is distinct from discrimination (AUC-ROC) — a model can discriminate well between cases and non-cases while being poorly calibrated.

Predictions and decision thresholds

Logistic regression outputs probabilities. Converting probabilities to class labels requires choosing a threshold. The default threshold of 0.5 maximizes overall accuracy but may not be appropriate for imbalanced classes or when the cost of false positives and false negatives differs. Moving the threshold affects the sensitivity/specificity tradeoff: lower thresholds increase sensitivity (fewer missed cases) at the cost of specificity (more false positives).

ROC curves visualize the tradeoff across all possible thresholds. The area under the ROC curve (AUC) summarizes discriminative performance as a single number. An AUC of 0.5 is no better than chance; 1.0 is perfect discrimination. For imbalanced classes, the precision-recall curve shows the precision/recall tradeoff, which is more informative than ROC when the negative class is much larger than the positive class.

Next steps

Now that you understand logistic regression in r, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.