rguides

Regression Models in R

Regression models are the workhorse of statistical analysis. They let you quantify relationships between variables, predict outcomes, and understand which factors actually matter. R’s built-in lm() and glm() functions handle most of what you need for ordinary linear regression and generalized linear models—logistic, Poisson, gamma, and more.

This guide walks through fitting regression models, interpreting output, and checking assumptions so you can use these tools confidently on real data.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Regression Models in R. By the end, you will know how to apply the core functions in real data analysis workflows.

Fitting linear regression with lm()

The lm() function fits ordinary least squares (OLS) linear regression. It expects a formula describing the relationship you want to model, plus a data frame.

# Simple linear regression
model <- lm(mpg ~ wt, data = mtcars)
summary(model)

The formula mpg ~ wt reads as “mpg predicted by weight.” The output gives you:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  37.285      1.878   19.86  < 2e-16 ***
wt           -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.3e-10

Intercept (37.285) is the predicted mpg when weight is zero—physically meaningless here but needed for the model. wt coefficient (-5.344) means each additional 1000 lb reduces mpg by about 5.3. The p-value (< 0.001) tells you this relationship is unlikely to be noise.

R-squared (0.753) means weight explains 75% of the variance in mpg. Adjusted R-squared (0.745) penalizes for adding predictors, so it stays honest when you add more variables.

Adding multiple predictors

Include more variables with +:

# Multiple regression: mpg predicted by weight and horsepower
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)

Now each coefficient is adjusted for the other variables in the model—the wt effect is what remains after controlling for hp, and vice versa.

Use * for interactions:

# Main effects plus interaction
model <- lm(mpg ~ wt * hp, data = mtcars)

This expands to wt + hp + wt:hp. If you want additive only (no interaction), use wt + hp.

To include all available predictors, use . in the formula:

# All variables except mpg
model <- lm(mpg ~ ., data = mtcars)

Generalized linear models with glm()

glm() extends lm() to handle non-normal outcomes—binary events, counts, positive continuous values—through different error distributions and link functions.

Logistic regression for binary outcomes

When your outcome is 0/1 (admitted or not, purchased or not), logistic regression models the probability:

# Predict admission based on GRE score and GPA
adm_model <- glm(admitted ~ gre + gpa, data = admission_df,
                 family = binomial)
summary(adm_model)

The coefficients are on the log-odds scale. A positive coefficient means higher values increase the log-odds of admission.

To convert to odds ratios:

exp(coef(adm_model))
# gre coefficient: exp(0.002) = 1.002
# Each 1-point GRE increase multiplies odds by ~1.002

To get predicted probabilities instead of log-odds:

predict(adm_model, type = "response")

The type = "response" argument converts the log-odds predictions through the logistic function, giving you probabilities between 0 and 1.

Poisson regression for count data

Count outcomes (number of visits, accidents, days absent) often follow a Poisson distribution:

# Model days absent by age and sex
 absences_model <- glm(days_absent ~ age + sex,
                       data = employee,
                       family = poisson)
summary(absences_model)

For rate modeling (e.g., accidents per driver-year), use an offset to account for exposure:

# Offset by log(exposure) for rate comparison
model <- glm(accidents ~ age + offset(log_exposure),
             data = drivers,
             family = poisson)

When to use gaussian VS other families

Outcome TypeDistributionLink
Continuous, normalgaussian (default)identity
Binary (0/1)binomiallogit
Count, Poissonpoissonlog
Positive continuous (skewed)Gammainverse
Overdispersed countsquasipoissonlog

gaussian with lm() and glm(family = "gaussian") give identical results—lm() is just more convenient for OLS.

Extracting results from model objects

Both lm() and glm() return model objects with useful components:

# Coefficient table
coef(model)
# (Intercept)         wt
#    37.2851     -5.3445

# Confidence intervals for coefficients
confint(model, level = 0.95)
#                   2.5 %    97.5 %
# (Intercept)  33.4505   41.1196
# wt           -6.4740   -4.2151

# Fitted values (predicted y)
fitted(model)

# Residuals (observed - predicted)
residuals(model)

The broom package makes this cleaner for reporting:

library(broom)

# Coefficient table as data.frame
tidy(model)
#   term        estimate std.error statistic  p.value
# 1 (Intercept)    37.28      1.878    19.857 2.6e-19
# 2 wt              -5.34     0.559    -9.563 1.3e-10

# Model-level statistics
glance(model)
#   r.squared adj.r.squared   sigma statistic p.value    df
# 1     0.753         0.745    3.046      91.4  1.3e-10     1

# Augment: add predictions and residuals to your data
augment(model, data = mtcars)
#   .rownames  mpg    wt  .fitted .resid    .hat .sigma
# 1     Mazda RX4  21.0 2.620   23.28   -2.280  0.115   3.14

The .fitted column has predictions, .resid has residuals, .hat shows use (influence on fitted values), and .sigma is the Cook’s distance component.

Making predictions on new data

Use predict() with a newdata argument:

# Predict mpg for a car weighing 3000 lbs
predict(model, newdata = data.frame(wt = 3.0))
#        1
#  21.251

# Get confidence intervals for the predicted mean
predict(model, newdata = df, interval = "confidence")
#      fit   lwr   upr
# 1  21.25  19.8  22.7

# Get prediction intervals for new observations
predict(model, newdata = df, interval = "prediction")
#      fit   lwr   upr
# 1  21.25  15.1  27.4

Prediction intervals are wider than confidence intervals because they account for both uncertainty in the mean and the irreducible observation-level noise.

Checking model assumptions

Linear regression makes four key assumptions. The built-in diagnostic plot surfaces them:

plot(model, which = 1:4)

1. Residuals vs Fitted (linearity): Points should scatter randomly around the horizontal line at 0. A curved pattern means the relationship isn’t linear—consider transforming predictors or adding polynomial terms.

2. Normal Q-Q (normality): Points should follow the diagonal. Deviations in the tails suggest outliers or skewed residuals.

3. Scale-Location (homoscedasticity): The line should be roughly flat and horizontal. A funnel shape (spread widening) means non-constant variance—glm(family = "Gamma") or MASS::rlm() may help.

4. Residuals vs Use: Points outside the dashed Cook’s distance lines are influential—removing them would substantially change the model. Investigate these cases.

Multicollinearity and outliers

library(car)

# Variance Inflation Factor - values > 5 or 10 indicate problems
vif(model)
#        wt       hp
#     2.94     2.94

# Formal outlier test
outlierTest(model)

If VIF is high, the correlated predictors are redundant—consider dropping one or using principal components.

Comparing nested models

Use anova() to test whether adding variables significantly improves fit:

# Compare full model to reduced model
reduced <- lm(mpg ~ wt, data = mtcars)
full    <- lm(mpg ~ wt + hp + disp, data = mtcars)
anova(reduced, full)

The p-value tests the null hypothesis that the additional coefficients are all zero. A small value means at least one of them matters.

Common pitfalls

Logistic regression gives log-odds coefficients. If you want odds ratios, exp(coef(model)). If you want probabilities, predict(model, type = "response"). Reporting raw coefficients without context is confusing.

glm() doesn’t produce R-squared. Use deviance-based pseudo-R² instead:

pseudo_r2 <- 1 - model$deviance / model$null.deviance

Or rely on glance(model)$deviance for model comparison.

Character variables get dummy-coded automatically. lm(y ~ sex) creates a binary dummy variable. This is usually what you want, but be aware.

NAs drop entire rows by default. If any variable in the formula has an NA, that row is excluded. For complex missing-data patterns, consider multiple imputation with the mice package.

Interactions between continuous variables create polynomial terms. x1:x2 gives x1 * x2. It’s not matrix multiplication—it’s a new predictor. Use I() for arithmetic in formulas: I(x1 * x2) forces actual multiplication.

Next steps

Now that you understand regression models in r, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.

Linear regression basics

lm(y ~ x1 + x2, data = df) fits an OLS regression. summary(model) shows coefficients, standard errors, t-statistics, p-values, and R-squared. coef(model) extracts coefficients. predict(model, newdata) generates predictions. resid(model) returns residuals.

Formula notation: y ~ x1 + x2 fits main effects; y ~ x1 * x2 fits main effects plus interaction x1:x2; y ~ x1 + I(x1^2) adds a polynomial term; y ~ . includes all columns except the outcome.

Diagnostics

Residual plots diagnose regression assumptions. plot(model) generates four diagnostic plots: residuals vs fitted (linearity), Q-Q plot (normality), scale-location (homoscedasticity), and residuals vs use (influential points). Cook’s distance identifies highly influential observations. car::vif(model) computes variance inflation factors, values above 5-10 indicate multicollinearity.

Multiple regression

With multiple predictors, coefficients represent the partial effect of each predictor holding others constant. Standardize predictors with scale() to make coefficients comparable across variables with different units. car::Anova(model) computes Type III ANOVA tests for each predictor. step(model) performs stepwise variable selection by AIC, useful for exploration but should not be used as the sole selection method for inference.

Model comparison

Compare nested models with anova(model1, model2), the F-test assesses whether the additional predictors improve fit significantly. Compare non-nested models with AIC: AIC(model1, model2) (lower is better). Cross-validation via caret or tidymodels gives honest out-of-sample performance estimates that R-squared from the training set does not.

The linear model framework

Linear regression in R uses the formula interface, which describes the model structure in readable notation. A formula separates the response variable on the left from the predictors on the right with a tilde. Adding predictors with plus includes them additively. Multiplying predictors with an asterisk includes main effects and their interaction. The formula interface is consistent across R’s modeling functions, the same syntax works for linear models, generalized linear models, mixed models, and many other model types.

The lm function fits a linear model by ordinary least squares. The result is a model object that stores the fitted coefficients, residuals, and metadata for subsequent analysis. The summary function extracts coefficient estimates, standard errors, t-statistics, p-values, and model fit statistics. The model object also works with generic functions: coef extracts coefficients, fitted extracts fitted values, residuals extracts residuals, and predict generates predictions for new data.

Interpreting coefficients

Coefficient interpretation depends on the predictor type and the model structure. For a continuous predictor, the coefficient represents the estimated change in the response for a one-unit increase in the predictor, holding all other predictors constant. For a binary indicator variable, the coefficient represents the estimated difference in the response between the two levels.

For models with interactions, main effect coefficients are the effect of that predictor when all interacting variables are zero. This makes main effect interpretation dependent on the centering of the interacting variables. Centering continuous variables before fitting interaction models produces main effect coefficients with more natural interpretations, they represent the effect at the mean of the other variable rather than at zero.

Checking assumptions

Ordinary least squares regression makes assumptions: linearity of the relationship, independence of errors, constant variance of errors, and normality of errors for inference. Violating these assumptions biases coefficient estimates, produces incorrect standard errors, or invalidates p-values, depending on which assumption is violated and how severely.

Residual plots are the primary tool for checking regression assumptions. Plotting residuals versus fitted values detects non-linearity and heteroscedasticity, non-constant variance. A Q-Q plot of residuals checks normality. Plotting residuals in order of data collection detects temporal autocorrelation. The autoplot function from the ggfortify package produces a standard set of diagnostic plots for any lm object, making routine assumption checking easy.

See also