Regression Models in R: lm(), glm(), and Diagnostics

May 17, 2026 · 13 min read ·Updated May 28, 2026 ·intermediate

rregressionstatisticslinear-modelsglm

Regression models are the workhorse of statistical analysis. They let you quantify relationships between variables, predict outcomes, and understand which factors actually matter. R’s built-in lm() and glm() functions handle most of what you need for ordinary linear regression and generalized linear models—logistic, Poisson, gamma, and more.

This guide walks through fitting regression models, interpreting output, and checking assumptions so you can use these tools confidently on real data.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Regression Models in R. By the end, you will know how to apply the core functions in real data analysis workflows.

Fitting linear regression with `lm()`

The lm() function fits ordinary least squares (OLS) linear regression. It expects a formula describing the relationship you want to model, plus a data frame.

# Simple linear regression
model <- lm(mpg ~ wt, data = mtcars)
summary(model)

The formula mpg ~ wt reads as “mpg predicted by weight.” Running summary() on the fitted model object prints the coefficient table, residual standard error, R-squared, and the overall F-test. This single function call is your starting point for understanding the strength and direction of the relationship between your predictor and outcome.

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  37.285      1.878   19.86  < 2e-16 ***
wt           -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.3e-10

Intercept (37.285) is the predicted mpg when weight is zero, which is physically meaningless here but needed for the model. wt coefficient (-5.344) means each additional 1000 lb reduces mpg by about 5.3. The p-value (< 0.001) tells you this relationship is unlikely to be noise.

R-squared (0.753) means weight explains 75% of the variance in mpg. Adjusted R-squared (0.745) penalizes for adding predictors, so it stays honest when you add more variables.

Adding multiple predictors

Include more variables with +:

# Multiple regression: mpg predicted by weight and horsepower
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)

When you add horsepower alongside weight, each coefficient estimates the effect of that predictor while holding the other constant. This partialling-out separates multiple regression from fitting separate simple models, because each coefficient now represents a unique contribution rather than a raw bivariate relationship. The adjusted R-squared increases only when the new predictor adds explanatory power beyond what the existing predictors already capture. You can also check the significance of individual coefficients with the t-tests printed by summary(). Adding too many correlated predictors inflates standard errors, so check variance inflation factors before trusting the coefficient estimates.

# Main effects plus interaction
model <- lm(mpg ~ wt * hp, data = mtcars)

An interaction term lets the effect of one predictor depend on the level of another. In practical terms, wt * hp asks whether the penalty of extra weight on fuel efficiency changes at different horsepower levels. Always center continuous predictors before fitting interactions so the main effects are interpretable at meaningful values rather than at zero. When you have many potential predictors and want a quick exploratory model, the dot shortcut includes every column in the data frame except the outcome variable.

# All variables except mpg
model <- lm(mpg ~ ., data = mtcars)

Generalized linear models with `glm()`

glm() extends lm() to handle non-normal outcomes—binary events, counts, positive continuous values, through different error distributions and link functions. For a deeper treatment of binary outcome modeling, see the dedicated Logistic Regression in R tutorial.

Logistic regression for binary outcomes

When your outcome is 0/1 (admitted or not, purchased or not), logistic regression models the probability:

# Predict admission based on GRE score and GPA
adm_model <- glm(admitted ~ gre + gpa, data = admission_df,
                 family = binomial)
summary(adm_model)

The coefficients from logistic regression live on the log-odds scale, which is not directly intuitive for most readers. A value of 0 means no effect, positive values increase the odds, and negative values decrease them. Because the scale is logarithmic, small coefficient differences translate to meaningful probability changes near the center of the distribution and become negligible near the extremes. Transforming these coefficients into more interpretable metrics makes the results accessible to non-statistical audiences. The quickest way to get readable numbers is to exponentiate the coefficients, which converts them from log-odds to odds ratios.

exp(coef(adm_model))
# gre coefficient: exp(0.002) = 1.002
# Each 1-point GRE increase multiplies odds by ~1.002

Exponentiating the coefficients gives you odds ratios, showing the multiplicative change in odds for a one-unit increase in the predictor. An odds ratio of 1.002 for GRE means each additional point multiplies the odds of admission by 1.002, a modest effect that accumulates across the full score range. Odds ratios are easier to report than raw log-odds coefficients and appear frequently in medical and social science literature.

predict(adm_model, type = "response")

The type = "response" argument converts the log-odds predictions through the logistic function, giving you probabilities between 0 and 1. Logistic regression handles binary outcomes well, but many real-world problems involve count data instead. When your response variable counts occurrences rather than flags events, you need a different distributional family.

Poisson regression for count data

Count outcomes (number of visits, accidents, days absent) often follow a Poisson distribution:

# Model days absent by age and sex
absences_model <- glm(days_absent ~ age + sex,
                      data = employee,
                      family = poisson)
summary(absences_model)

Poisson regression assumes the mean and variance of the outcome are equal, an assumption that real-world count data frequently violates through overdispersion. When the variance exceeds the mean, standard errors become too small and p-values become artificially significant. Checking the ratio of residual deviance to degrees of freedom gives a quick diagnostic; values well above 1 suggest switching to family = quasipoisson to adjust the standard errors upward.

# Offset by log(exposure) for rate comparison
model <- glm(accidents ~ age + offset(log_exposure),
             data = drivers,
             family = poisson)

When to use `gaussian` VS other families

Outcome Type	Distribution	Link
Continuous, normal	`gaussian` (default)	identity
Binary (0/1)	`binomial`	logit
Count, Poisson	`poisson`	log
Positive continuous (skewed)	`Gamma`	inverse
Overdispersed counts	`quasipoisson`	log

gaussian with lm() and glm(family = "gaussian") give identical results; lm() is just more convenient for OLS.

Extracting results from model objects

Both lm() and glm() return model objects with useful components:

# Coefficient table
coef(model)
# (Intercept)         wt
#    37.2851     -5.3445

# Confidence intervals for coefficients
confint(model, level = 0.95)
#                   2.5 %    97.5 %
# (Intercept)  33.4505   41.1196
# wt           -6.4740   -4.2151

# Fitted values (predicted y)
fitted(model)

# Residuals (observed - predicted)
residuals(model)

Base R extraction functions give you raw vectors and matrices, which works fine for quick checks but becomes cumbersome when you need to merge results with the original data or format tables for publication. The broom package wraps these extraction functions into tidy data frames that slot naturally into ggplot2 visualizations and dplyr pipelines, turning model diagnostics from a separate workflow into a natural extension of your data analysis.

library(broom)

# Coefficient table as data.frame
tidy(model)
#   term        estimate std.error statistic  p.value
# 1 (Intercept)    37.28      1.878    19.857 2.6e-19
# 2 wt              -5.34     0.559    -9.563 1.3e-10

# Model-level statistics
glance(model)
#   r.squared adj.r.squared   sigma statistic p.value    df
# 1     0.753         0.745    3.046      91.4  1.3e-10     1

# Augment: add predictions and residuals to your data
augment(model, data = mtcars)
#   .rownames  mpg    wt  .fitted .resid    .hat .sigma
# 1     Mazda RX4  21.0 2.620   23.28   -2.280  0.115   3.14

The .fitted column has predictions, .resid has residuals, .hat shows influence on fitted values, and .sigma is the residual standard error when that observation is excluded. Having predictions and residuals alongside the original data makes it easy to spot which specific observations the model fits poorly. Once you are satisfied with how well the model captures the training data patterns, the natural next step is to apply it to observations that the model has never seen before.

Making predictions on new data

Use predict() with a newdata argument. The function applies your fitted model to fresh observations and can return point estimates along with confidence or prediction intervals, depending on the interval argument you choose. Predicting on new data is where your model transitions from a descriptive summary to a tool you can actually use for decision-making. Always make sure the new data frame contains columns with the same names and types that the model was originally fitted on.

# Predict mpg for a car weighing 3000 lbs
predict(model, newdata = data.frame(wt = 3.0))
#        1
#  21.251

# Get confidence intervals for the predicted mean
predict(model, newdata = df, interval = "confidence")
#      fit   lwr   upr
# 1  21.25  19.8  22.7

# Get prediction intervals for new observations
predict(model, newdata = df, interval = "prediction")
#      fit   lwr   upr
# 1  21.25  15.1  27.4

Prediction intervals are wider than confidence intervals because they account for both uncertainty in the mean and the irreducible observation-level noise. Generating predictions is useful, but before you trust those numbers, you need to verify that the model’s underlying assumptions actually hold for your data. Skipping diagnostics is the most common reason regression results fail to generalize.

Checking model assumptions

Linear regression makes four key assumptions. The built-in diagnostic plot surfaces them. Calling plot() on a fitted model object produces the four standard diagnostic charts in a single command, making it easy to spot violations at a glance.

plot(model, which = 1:4)

1. Residuals vs Fitted (linearity): Points should scatter randomly around the horizontal line at 0. A curved pattern means the relationship isn’t linear; consider transforming predictors or adding polynomial terms.

2. Normal Q-Q (normality): Points should follow the diagonal. Deviations in the tails suggest outliers or skewed residuals.

3. Scale-Location (homoscedasticity): The line should be roughly flat and horizontal. A funnel shape (spread widening) means non-constant variance; glm(family = "Gamma") or MASS::rlm() may help.

4. Residuals vs Use: Points outside the dashed Cook’s distance lines are influential—removing them would substantially change the model. Investigate these cases.

Multicollinearity and outliers

library(car)

# Variance Inflation Factor - values > 5 or 10 indicate problems
vif(model)
#        wt       hp
#     2.94     2.94

# Formal outlier test
outlierTest(model)

If VIF is high, the correlated predictors are redundant; consider dropping one or using principal components. An outlier test flags observations with unusually large residuals, which merit investigation before trusting the fitted model.

Comparing nested models

Use anova() to test whether adding variables significantly improves fit. The function compares a reduced model against a full model and reports whether the additional terms explain enough variance to justify their inclusion.

# Compare full model to reduced model
reduced <- lm(mpg ~ wt, data = mtcars)
full    <- lm(mpg ~ wt + hp + disp, data = mtcars)
anova(reduced, full)

The p-value tests the null hypothesis that the additional coefficients are all zero. A small value means at least one of them matters.

Common pitfalls

Logistic regression gives log-odds coefficients. If you want odds ratios, exp(coef(model)). If you want probabilities, predict(model, type = "response"). Reporting raw coefficients without context is confusing.

glm() doesn’t produce R-squared. Use deviance-based pseudo-R² instead:

pseudo_r2 <- 1 - model$deviance / model$null.deviance

Or rely on glance(model)$deviance for model comparison.

Character variables get dummy-coded automatically. lm(y ~ sex) creates a binary dummy variable. This is usually what you want, but be aware.

NAs drop entire rows by default. If any variable in the formula has an NA, that row is excluded. For complex missing-data patterns, consider multiple imputation with the mice package.

Interactions between continuous variables create polynomial terms. x1:x2 gives x1 * x2. It is not matrix multiplication; it is a new predictor. Use I() for arithmetic in formulas: I(x1 * x2) forces actual multiplication.

Linear regression basics

lm(y ~ x1 + x2, data = df) fits an OLS regression. summary(model) shows coefficients, standard errors, t-statistics, p-values, and R-squared. coef(model) extracts coefficients. predict(model, newdata) generates predictions. resid(model) returns residuals.

Formula notation: y ~ x1 + x2 fits main effects; y ~ x1 * x2 fits main effects plus interaction x1:x2; y ~ x1 + I(x1^2) adds a polynomial term; y ~ . includes all columns except the outcome.

Diagnostics

Residual plots diagnose regression assumptions. plot(model) generates four diagnostic plots: residuals vs fitted (linearity), Q-Q plot (normality), scale-location (homoscedasticity), and a plot that flags observations with outsized influence on the model fit. Cook’s distance identifies highly influential observations. car::vif(model) computes variance inflation factors; values above 5 or 10 indicate multicollinearity.

Multiple regression

With multiple predictors, coefficients represent the partial effect of each predictor holding others constant. Standardize predictors with scale() to make coefficients comparable across variables with different units. car::Anova(model) computes Type III ANOVA tests for each predictor. step(model) performs stepwise variable selection by AIC, useful for exploration but should not be used as the sole selection method for inference.

Model comparison

Compare nested models with anova(model1, model2); the F-test assesses whether the additional predictors improve fit significantly. Compare non-nested models with AIC: AIC(model1, model2) (lower is better). Cross-validation via caret or tidymodels gives honest out-of-sample performance estimates that R-squared from the training set does not. See the ANOVA in R tutorial for a broader treatment of variance testing in R.

The linear model framework

Linear regression in R uses the formula interface, which describes the model structure in readable notation. A formula separates the response variable on the left from the predictors on the right with a tilde. Adding predictors with plus includes them additively. Multiplying predictors with an asterisk includes main effects and their interaction. The formula interface is consistent across R’s modeling functions; the same syntax works for linear models, generalized linear models, mixed models, and many other model types.

The lm function fits a linear model by ordinary least squares. The result is a model object that stores the fitted coefficients, residuals, and metadata for subsequent analysis. The summary function extracts coefficient estimates, standard errors, t-statistics, p-values, and model fit statistics. The model object also works with generic functions: coef extracts coefficients, fitted extracts fitted values, residuals extracts residuals, and predict generates predictions for new data.

Interpreting coefficients

Coefficient interpretation depends on the predictor type and the model structure. For a continuous predictor, the coefficient estimates the change in response for a one-unit increase. This holds all other predictors constant. For a binary indicator variable, the coefficient represents the estimated difference in response between the two levels.

For models with interactions, main effect coefficients are the effect of that predictor when all interacting variables are zero. This makes main effect interpretation depend on how you center the interacting variables. Centering continuous variables before fitting interaction models helps. The main effect coefficients then represent the effect at the mean of the other variable rather than at zero.

Checking assumptions

Ordinary least squares regression needs several assumptions to hold. The relationship must be linear, errors must be independent, error variance must be constant (homoscedasticity), and errors should be normally distributed for valid inference. Violating these assumptions causes problems. Coefficient estimates may become biased, standard errors can be wrong, and p-values may become invalid. The severity depends on which assumption breaks and how badly.

Residual plots are the primary tool for checking regression assumptions. Plotting residuals versus fitted values detects non-linearity and heteroscedasticity (non-constant variance). A Q-Q plot of residuals checks normality. Plotting residuals in order of data collection detects temporal autocorrelation. The ggfortify package provides an autoplot() function that generates the full set of diagnostic plots for any lm object.

Next steps

Now that you understand regression models in R, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.