Generalised Linear Models in R
Generalised Linear Models (GLMs) are the workhorse of statistical modeling when your response variable isn’t nicely normally distributed. If you’ve got binary outcomes (yes/no, success/failure), counts (number of events), or proportions, GLMs are your friend. In R, fitting a GLM takes just one line of code, but understanding what comes out the other end takes a bit more work, which is exactly what this guide covers.
What are gLMs and when should you use them?
Ordinary linear regression assumes your outcome is continuous and normally distributed. But what if you’re modeling whether a customer churns, how many clicks an ad gets, or the number of defects in a batch? Those don’t fit the normal assumption.
GLMs extend linear regression to handle non-normal response variables by combining three components:
- Distribution family: The probability distribution of your response (binomial for binary, Poisson for counts, etc.)
- Linear predictor: A linear combination of your predictors (like regular regression)
- Link function: The mathematical bridge that connects the linear predictor to the expected value of the response
The key insight is that you model the expected value through a function, not directly. For logistic regression, you model the log-odds; for Poisson regression, you model the log of the expected count.
GLM theory brief
A GLM has three parts:
-
Random component: The distribution of Y (your response) from the exponential family: binomial, Poisson, Gaussian, Gamma, etc.
-
Systematic component: The linear predictor ( \eta = \beta_0 + \beta_1 X_1 + … + \beta_p X_p )
-
Link function: ( g(\mu) = \eta ), maps the expected value ( \mu ) to the linear predictor
Common link functions:
- Identity link: ( \mu = \eta ), just regular linear regression
- Logit link: ( \log(\mu/(1-\mu)) = \eta ), logistic regression
- Log link: ( \log(\mu) = \eta ), Poisson regression
The flexibility to choose both distribution and link function independently is what makes GLMs so powerful.
Logistic regression for binary outcomes
When your outcome is binary (0/1, TRUE/FALSE, yes/no), logistic regression is the standard approach. You’re modeling the probability that ( Y = 1 ).
Here’s a working example with the classic iris dataset (converting it to binary for demonstration):
# Create binary outcome: setosa vs. not setosa
iris$is_setosa <- ifelse(iris$Species == "setosa", 1, 0)
# Fit logistic regression
logit_model <- glm(is_setosa ~ Sepal.Length + Sepal.Width,
data = iris,
family = binomial(link = "logit"))
# View model summary
summary(logit_model)
The output shows coefficients on the log-odds scale. To interpret them more intuitively, you will want odds ratios — exponentiating the coefficients transforms them from log-odds to multiplicative effects on the odds. A positive coefficient corresponds to an odds ratio above 1, meaning the predictor increases the probability of the outcome. The next section demonstrates the Poisson counterpart for counting outcomes, where the interpretation shifts from odds to rate ratios.
Poisson regression for count data
When you are counting events — customer purchases, website visits, defects per batch — Poisson regression is usually the right tool. The log link ensures predicted counts stay positive, and the model naturally handles the variance structure of count data:
# Simulate count data
set.seed(42)
n <- 200
count_data <- data.frame(
visits = rpois(n, lambda = 5),
ad_spend = runif(n, 0, 100),
category = sample(c("A", "B", "C"), n, replace = TRUE)
)
# Fit Poisson regression
poisson_model <- glm(visits ~ ad_spend + category,
data = count_data,
family = poisson(link = "log"))
summary(poisson_model)
Model diagnostics
Once you have fit a model, the next step is evaluating whether it actually describes the data well. GLM diagnostics focus on a few key areas: deviance for overall fit, residuals for spotting patterns, and overdispersion for detecting when the variance structure is wrong. Skipping diagnostics is the most common mistake in applied GLM work — a model with significant coefficients can still be a poor fit if the distributional assumptions are violated.
Deviance
Deviance measures how well your model fits the data compared to a saturated model that has one parameter per observation. Lower deviance is better, but you need a reference point — comparing residual deviance to null deviance tells you how much explanatory power your predictors add:
# Null deviance vs. residual deviance
logit_model$deviance
logit_model$null.deviance
Residuals
Several types of residuals help you spot problems:
# Pearson residuals
residuals(logit_model, type = "pearson")
# Deviance residuals
residuals(logit_model, type = "deviance")
# Plot residuals
plot(logit_model)
Look for patterns in the residuals plot — systematic patterns indicate model misspecification. For GLMs, the standardised deviance residuals should appear approximately normally distributed with constant variance if the model is well-specified. A funnel shape in the residuals-vs-fitted plot suggests the variance function is wrong, and a curved trend suggests the link function should be changed.
The standard plot(model) diagnostic plots are less interpretable for GLMs than for linear models. For GLM-appropriate residual diagnostics, use DHARMa::simulateResiduals() which simulates from the fitted model and compares observed residuals to the simulation distribution, producing standardized residual plots that are comparable across model families.
Overdispersion
Poisson models assume variance equals the mean. When variance exceeds mean (overdispersion), your standard errors will be too small and your p-values will be misleadingly significant. Check this by comparing the residual deviance to the residual degrees of freedom:
# Calculate dispersion statistic
dispersion <- poisson_model$deviance / poisson_model$df.residual
dispersion
Overdispersion above 1.5 or 2 is a red flag. You might need negative binomial regression (see the MASS package) or a quasi-likelihood model.
Interpreting coefficients
The raw coefficients aren’t always intuitive. Here’s how to make sense of them:
Odds ratios (Logistic regression)
Exponentiate coefficients to get odds ratios:
# Calculate odds ratios
odds_ratios <- exp(coef(logit_model))
odds_ratios
# With confidence intervals
exp(confint(logit_model))
An odds ratio of 1.5 means the odds of the outcome increase by 50% for a one-unit increase in that predictor. An odds ratio of 0.7 means the odds decrease by 30%.
Incidence rate ratios (Poisson regression)
Same idea: exponentiate for interpretation:
# Calculate incidence rate ratios
irr <- exp(coef(poisson_model))
irr
A practical note: IRRs close to 1 mean the predictor has minimal effect. If you’re seeing IRRs of 0.99 vs 1.5, the practical impact matters more than statistical significance.
Interpreting coefficients in gLMs
GLM coefficients are on the scale of the linear predictor, not the response scale. For logistic regression, coefficients are log-odds, coef(model)["age"] is the change in log-odds per unit of age. exp(coef(model)["age"]) is the odds ratio: a coefficient of 0.05 means each additional year of age multiplies the odds of the outcome by exp(0.05) ≈ 1.05.
For Poisson regression with a log link, exp(coef(model)["treatment"]) is the ratio of expected counts between groups. A coefficient of 0.3 means the treatment group has exp(0.3) ≈ 1.35 times the expected count of the control group.
Overdispersion
Poisson regression assumes the mean and variance of counts are equal. When the observed variance exceeds the mean (overdispersion), standard errors are underestimated and inference is too optimistic. Test for overdispersion with AER::dispersiontest(model). Address it with a quasi-Poisson model (family = quasipoisson) which estimates a dispersion parameter, or with a negative binomial model (MASS::glm.nb(formula, data)) which directly models overdispersion.
Regularization with glmnet
glmnet fits regularized GLMs with lasso (L1) and ridge (L2) penalties. For logistic regression, glmnet(X, y, family = "binomial", alpha = 1) fits a lasso model that performs variable selection. The regularization path (coefficients as a function of lambda) shows which variables enter the model at different regularization levels. cv.glmnet() selects the optimal lambda via cross-validation.
For high-dimensional data (many features, relatively few observations), regularization prevents overfitting and improves prediction accuracy. alpha = 0 gives ridge regression (all variables stay in the model, but coefficients are shrunk). alpha = 0.5 gives elastic net (a mix of lasso and ridge, useful when predictors are correlated).
Conclusion
GLMs open up a huge range of modeling possibilities beyond ordinary least squares. The key is matching your distribution family and link function to your data type: binomial with logit for binary outcomes, Poisson with log for counts. Once you have fit a model, don’t skip diagnostics: check for overdispersion, examine residuals, and verify your model assumptions hold.
The glm() function in R handles all of this with a consistent interface. Swap out the family argument and you’re done. That’s the beauty of the GLM framework: same syntax, different statistical assumptions.
Summary
glm() extends linear modeling to non-Gaussian response distributions through the exponential family and link functions. Logistic regression for binary outcomes and Poisson regression for count data cover the majority of practical cases. Model evaluation requires distribution-appropriate metrics, deviance and AIC for model comparison, residual diagnostics for assumption checking. The tidymodels framework wraps glm() in a consistent interface that makes cross-validation and comparison with other model types straightforward.
See also
- Linear Regression in R, for continuous outcomes where standard regression applies
- Hypothesis Testing in R, for making inferences about your model coefficients
- R for Statistics, broader overview of statistical modeling in R