Regression Models in R
Regression models are the workhorse of statistical analysis. They let you quantify relationships between variables, predict outcomes, and understand which factors actually matter. R’s built-in lm() and glm() functions handle most of what you need for ordinary linear regression and generalized linear models—logistic, Poisson, gamma, and more.
This guide walks through fitting regression models, interpreting output, and checking assumptions so you can use these tools confidently on real data.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Regression Models in R. By the end, you will know how to apply the core functions in real data analysis workflows.
Fitting linear regression with lm()
The lm() function fits ordinary least squares (OLS) linear regression. It expects a formula describing the relationship you want to model, plus a data frame.
# Simple linear regression
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
The formula mpg ~ wt reads as “mpg predicted by weight.” The output gives you:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.86 < 2e-16 ***
wt -5.344 0.559 -9.56 1.3e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared: 0.753, Adjusted R-squared: 0.745
F-statistic: 91.4 on 1 and 30 DF, p-value: 1.3e-10
Intercept (37.285) is the predicted mpg when weight is zero—physically meaningless here but needed for the model. wt coefficient (-5.344) means each additional 1000 lb reduces mpg by about 5.3. The p-value (< 0.001) tells you this relationship is unlikely to be noise.
R-squared (0.753) means weight explains 75% of the variance in mpg. Adjusted R-squared (0.745) penalizes for adding predictors, so it stays honest when you add more variables.
Adding multiple predictors
Include more variables with +:
# Multiple regression: mpg predicted by weight and horsepower
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
Now each coefficient is adjusted for the other variables in the model—the wt effect is what remains after controlling for hp, and vice versa.
Use * for interactions:
# Main effects plus interaction
model <- lm(mpg ~ wt * hp, data = mtcars)
This expands to wt + hp + wt:hp. If you want additive only (no interaction), use wt + hp.
To include all available predictors, use . in the formula:
# All variables except mpg
model <- lm(mpg ~ ., data = mtcars)
Generalized linear models with glm()
glm() extends lm() to handle non-normal outcomes—binary events, counts, positive continuous values—through different error distributions and link functions.
Logistic regression for binary outcomes
When your outcome is 0/1 (admitted or not, purchased or not), logistic regression models the probability:
# Predict admission based on GRE score and GPA
adm_model <- glm(admitted ~ gre + gpa, data = admission_df,
family = binomial)
summary(adm_model)
The coefficients are on the log-odds scale. A positive coefficient means higher values increase the log-odds of admission.
To convert to odds ratios:
exp(coef(adm_model))
# gre coefficient: exp(0.002) = 1.002
# Each 1-point GRE increase multiplies odds by ~1.002
To get predicted probabilities instead of log-odds:
predict(adm_model, type = "response")
The type = "response" argument converts the log-odds predictions through the logistic function, giving you probabilities between 0 and 1.
Poisson regression for count data
Count outcomes (number of visits, accidents, days absent) often follow a Poisson distribution:
# Model days absent by age and sex
absences_model <- glm(days_absent ~ age + sex,
data = employee,
family = poisson)
summary(absences_model)
For rate modeling (e.g., accidents per driver-year), use an offset to account for exposure:
# Offset by log(exposure) for rate comparison
model <- glm(accidents ~ age + offset(log_exposure),
data = drivers,
family = poisson)
When to use gaussian VS other families
| Outcome Type | Distribution | Link |
|---|---|---|
| Continuous, normal | gaussian (default) | identity |
| Binary (0/1) | binomial | logit |
| Count, Poisson | poisson | log |
| Positive continuous (skewed) | Gamma | inverse |
| Overdispersed counts | quasipoisson | log |
gaussian with lm() and glm(family = "gaussian") give identical results—lm() is just more convenient for OLS.
Extracting results from model objects
Both lm() and glm() return model objects with useful components:
# Coefficient table
coef(model)
# (Intercept) wt
# 37.2851 -5.3445
# Confidence intervals for coefficients
confint(model, level = 0.95)
# 2.5 % 97.5 %
# (Intercept) 33.4505 41.1196
# wt -6.4740 -4.2151
# Fitted values (predicted y)
fitted(model)
# Residuals (observed - predicted)
residuals(model)
The broom package makes this cleaner for reporting:
library(broom)
# Coefficient table as data.frame
tidy(model)
# term estimate std.error statistic p.value
# 1 (Intercept) 37.28 1.878 19.857 2.6e-19
# 2 wt -5.34 0.559 -9.563 1.3e-10
# Model-level statistics
glance(model)
# r.squared adj.r.squared sigma statistic p.value df
# 1 0.753 0.745 3.046 91.4 1.3e-10 1
# Augment: add predictions and residuals to your data
augment(model, data = mtcars)
# .rownames mpg wt .fitted .resid .hat .sigma
# 1 Mazda RX4 21.0 2.620 23.28 -2.280 0.115 3.14
The .fitted column has predictions, .resid has residuals, .hat shows use (influence on fitted values), and .sigma is the Cook’s distance component.
Making predictions on new data
Use predict() with a newdata argument:
# Predict mpg for a car weighing 3000 lbs
predict(model, newdata = data.frame(wt = 3.0))
# 1
# 21.251
# Get confidence intervals for the predicted mean
predict(model, newdata = df, interval = "confidence")
# fit lwr upr
# 1 21.25 19.8 22.7
# Get prediction intervals for new observations
predict(model, newdata = df, interval = "prediction")
# fit lwr upr
# 1 21.25 15.1 27.4
Prediction intervals are wider than confidence intervals because they account for both uncertainty in the mean and the irreducible observation-level noise.
Checking model assumptions
Linear regression makes four key assumptions. The built-in diagnostic plot surfaces them:
plot(model, which = 1:4)
1. Residuals vs Fitted (linearity): Points should scatter randomly around the horizontal line at 0. A curved pattern means the relationship isn’t linear—consider transforming predictors or adding polynomial terms.
2. Normal Q-Q (normality): Points should follow the diagonal. Deviations in the tails suggest outliers or skewed residuals.
3. Scale-Location (homoscedasticity): The line should be roughly flat and horizontal. A funnel shape (spread widening) means non-constant variance—glm(family = "Gamma") or MASS::rlm() may help.
4. Residuals vs Use: Points outside the dashed Cook’s distance lines are influential—removing them would substantially change the model. Investigate these cases.
Multicollinearity and outliers
library(car)
# Variance Inflation Factor - values > 5 or 10 indicate problems
vif(model)
# wt hp
# 2.94 2.94
# Formal outlier test
outlierTest(model)
If VIF is high, the correlated predictors are redundant—consider dropping one or using principal components.
Comparing nested models
Use anova() to test whether adding variables significantly improves fit:
# Compare full model to reduced model
reduced <- lm(mpg ~ wt, data = mtcars)
full <- lm(mpg ~ wt + hp + disp, data = mtcars)
anova(reduced, full)
The p-value tests the null hypothesis that the additional coefficients are all zero. A small value means at least one of them matters.
Common pitfalls
Logistic regression gives log-odds coefficients. If you want odds ratios, exp(coef(model)). If you want probabilities, predict(model, type = "response"). Reporting raw coefficients without context is confusing.
glm() doesn’t produce R-squared. Use deviance-based pseudo-R² instead:
pseudo_r2 <- 1 - model$deviance / model$null.deviance
Or rely on glance(model)$deviance for model comparison.
Character variables get dummy-coded automatically. lm(y ~ sex) creates a binary dummy variable. This is usually what you want, but be aware.
NAs drop entire rows by default. If any variable in the formula has an NA, that row is excluded. For complex missing-data patterns, consider multiple imputation with the mice package.
Interactions between continuous variables create polynomial terms. x1:x2 gives x1 * x2. It’s not matrix multiplication—it’s a new predictor. Use I() for arithmetic in formulas: I(x1 * x2) forces actual multiplication.
Next steps
Now that you understand regression models in r, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.
Linear regression basics
lm(y ~ x1 + x2, data = df) fits an OLS regression. summary(model) shows coefficients, standard errors, t-statistics, p-values, and R-squared. coef(model) extracts coefficients. predict(model, newdata) generates predictions. resid(model) returns residuals.
Formula notation: y ~ x1 + x2 fits main effects; y ~ x1 * x2 fits main effects plus interaction x1:x2; y ~ x1 + I(x1^2) adds a polynomial term; y ~ . includes all columns except the outcome.
Diagnostics
Residual plots diagnose regression assumptions. plot(model) generates four diagnostic plots: residuals vs fitted (linearity), Q-Q plot (normality), scale-location (homoscedasticity), and residuals vs use (influential points). Cook’s distance identifies highly influential observations. car::vif(model) computes variance inflation factors, values above 5-10 indicate multicollinearity.
Multiple regression
With multiple predictors, coefficients represent the partial effect of each predictor holding others constant. Standardize predictors with scale() to make coefficients comparable across variables with different units. car::Anova(model) computes Type III ANOVA tests for each predictor. step(model) performs stepwise variable selection by AIC, useful for exploration but should not be used as the sole selection method for inference.
Model comparison
Compare nested models with anova(model1, model2), the F-test assesses whether the additional predictors improve fit significantly. Compare non-nested models with AIC: AIC(model1, model2) (lower is better). Cross-validation via caret or tidymodels gives honest out-of-sample performance estimates that R-squared from the training set does not.
The linear model framework
Linear regression in R uses the formula interface, which describes the model structure in readable notation. A formula separates the response variable on the left from the predictors on the right with a tilde. Adding predictors with plus includes them additively. Multiplying predictors with an asterisk includes main effects and their interaction. The formula interface is consistent across R’s modeling functions, the same syntax works for linear models, generalized linear models, mixed models, and many other model types.
The lm function fits a linear model by ordinary least squares. The result is a model object that stores the fitted coefficients, residuals, and metadata for subsequent analysis. The summary function extracts coefficient estimates, standard errors, t-statistics, p-values, and model fit statistics. The model object also works with generic functions: coef extracts coefficients, fitted extracts fitted values, residuals extracts residuals, and predict generates predictions for new data.
Interpreting coefficients
Coefficient interpretation depends on the predictor type and the model structure. For a continuous predictor, the coefficient represents the estimated change in the response for a one-unit increase in the predictor, holding all other predictors constant. For a binary indicator variable, the coefficient represents the estimated difference in the response between the two levels.
For models with interactions, main effect coefficients are the effect of that predictor when all interacting variables are zero. This makes main effect interpretation dependent on the centering of the interacting variables. Centering continuous variables before fitting interaction models produces main effect coefficients with more natural interpretations, they represent the effect at the mean of the other variable rather than at zero.
Checking assumptions
Ordinary least squares regression makes assumptions: linearity of the relationship, independence of errors, constant variance of errors, and normality of errors for inference. Violating these assumptions biases coefficient estimates, produces incorrect standard errors, or invalidates p-values, depending on which assumption is violated and how severely.
Residual plots are the primary tool for checking regression assumptions. Plotting residuals versus fitted values detects non-linearity and heteroscedasticity, non-constant variance. A Q-Q plot of residuals checks normality. Plotting residuals in order of data collection detects temporal autocorrelation. The autoplot function from the ggfortify package produces a standard set of diagnostic plots for any lm object, making routine assumption checking easy.
See also
- Linear Regression in R, a dedicated walkthrough for simple and multiple OLS regression
- Logistic Regression in R, binary outcome modeling with
glm(family = binomial) - ANOVA in R, comparing means and testing group differences