Linear Regression in R
Linear regression is one of the most fundamental and widely used statistical modeling techniques in data science and statistics. It models the relationship between a continuous response variable and one or more predictor variables using a straight line. Despite its relative simplicity, linear regression serves as the foundation for understanding statistical modeling and remains highly relevant for many real-world applications.
In this comprehensive tutorial, you will learn how to perform both simple and multiple linear regression in R using the built-in lm() function. We will cover model fitting, detailed interpretation of results, diagnostic checks to validate assumptions, making predictions on new data, and visualization techniques. By the end of this guide, you will have a solid understanding of how to apply linear regression to real-world data analysis problems and how to interpret the output correctly.
When to Use Linear Regression
Linear regression is appropriate in many analytical scenarios, but it is particularly well-suited for certain types of problems. Understanding when to use linear regression will help you apply it correctly and avoid misuse.
Linear regression is appropriate when your response variable is continuous (not categorical or binary), you want to understand and quantify the relationship between predictors and response, you need interpretable results where coefficients have clear and meaningful interpretations, and the underlying assumptions are reasonably met. These assumptions include linearity (the relationship is straight-line), independence of observations, normality of residuals, and homoscedasticity (constant variance of residuals).
Common applications of linear regression include predicting house prices from square footage and location, estimating sales revenue from advertising spend and marketing budget allocation, analyzing test scores from study time and attendance, forecasting demand from price and seasonal factors, and many other business and scientific applications where understanding the magnitude and direction of relationships is important.
Simple Linear Regression
Simple linear regression uses exactly one predictor variable to explain variation in the response variable. The mathematical formula is: y equals beta-0 plus beta-1 times x plus epsilon. In this equation, y represents the response variable, x is the predictor variable, beta-0 is the intercept (the predicted value of y when x equals zero), beta-1 is the slope (the change in y for each one-unit increase in x), and epsilon represents the error term (the difference between observed and predicted values that cannot be explained by the model).
The goal of linear regression is to find the best-fitting line that minimizes the sum of squared errors (the differences between observed and predicted values). This method is called ordinary least squares (OLS) estimation.
Example: Predicting Fuel Efficiency
Let us work through a practical example using the mtcars dataset, which is built into R and contains information about various car models. We will predict miles per gallon (mpg) based on the weight of the car (wt):
# Load the mtcars dataset (built into R)
data(mtcars)
# Examine the relevant variables
head(mtcars[, c("mpg", "wt")])
# Get summary statistics
summary(mtcars$wt)
Now fit a simple linear regression model:
# Fit a simple linear regression model
# Predict mpg (miles per gallon) from wt (weight in 1000 lbs)
model <- lm(mpg ~ wt, data = mtcars)
# View the complete model summary
summary(model)
The output shows key coefficients: Intercept equals 37.285 and weight coefficient equals negative 5.344. This means the regression equation is: mpg equals 37.285 minus 5.344 times weight. The interpretation is that for every 1000-pound increase in weight, the miles per gallon decreases by 5.344. This makes intuitive sense: heavier cars typically have lower fuel efficiency.
Interpreting the Model Summary
The summary() function provides comprehensive information about your regression model. Understanding each component is essential for proper interpretation and for determining whether your model is useful.
Key elements to examine include the Residuals (summary statistics of the differences between observed and predicted values), the Coefficients section with intercept and slope estimates along with their standard errors, t-values, and p-values that indicate statistical significance, the Residual Standard Error which represents the average magnitude of prediction errors in the same units as the response variable, R-squared which indicates the proportion of variance in the response variable explained by the predictor, and the F-statistic for overall model significance testing.
Multiple Linear Regression
Multiple regression extends simple regression by using two or more predictor variables simultaneously. This allows you to examine the effect of each predictor while controlling for the others, which is crucial for understanding the independent contribution of each variable.
The formula extends to: y equals beta-0 plus beta-1 times x-1 plus beta-2 times x-2 and so on through beta-n times x-n plus epsilon. Each coefficient represents the effect of that predictor on the response while holding all other predictors constant, which is called a partial or ceteris paribus interpretation.
Example: Adding More Predictors
# Fit multiple regression with weight and horsepower as predictors
model_multi <- lm(mpg ~ wt + hp, data = mtcars)
# View complete summary
summary(model_multi)
Interpreting these coefficients requires understanding that they represent partial effects. The weight coefficient (negative 3.88) means that controlling for horsepower, each additional 1000 pounds of weight decreases mpg by 3.88. The horsepower coefficient (negative 0.032) means that controlling for weight, each additional horsepower decreases mpg by 0.032.
Both predictors are statistically significant (p less than 0.05), indicating that both weight and horsepower are meaningful independent predictors of fuel efficiency.
Model Diagnostics
Before trusting your regression results, you must check that the underlying assumptions are reasonably met. R provides convenient diagnostic plots through the base graphics system.
1. Linearity Assumption
The relationship between each predictor and the response should be linear:
# Plot residuals vs fitted values to check linearity
plot(model, which = 1)
If you see a curve or funnel shape, the linear assumption may be violated.
2. Normality of Residuals
Residuals should follow a normal distribution for valid hypothesis testing:
# Q-Q plot to check normality
plot(model, which = 2)
If points deviate substantially from the diagonal, the normality assumption may be questionable.
3. Homoscedasticity (Constant Variance)
The variance of residuals should be constant across all levels of predictors:
# Scale-location plot to check homoscedasticity
plot(model, which = 3)
A funnel shape indicates heteroscedasticity.
4. Checking for Outliers
Some data points can disproportionately affect the regression results:
# Residuals vs leverage plot to identify influential points
plot(model, which = 5)
Points with high leverage and large residuals are concerning.
Making Predictions
One of the primary purposes of regression is making predictions on new data:
# Create new data for prediction
new_cars <- data.frame(wt = c(2.0, 2.5, 3.0, 3.5, 4.0))
# Predict mpg for these new weight values
predictions <- predict(model, newdata = new_cars)
predictions
You can also obtain confidence intervals:
# 95% confidence intervals for the mean response
predict(model, newdata = new_cars, interval = "confidence")
Or prediction intervals (wider):
# 95% prediction intervals for new observations
predict(model, newdata = new_cars, interval = "prediction")
Visualizing Results
Visualizations help communicate regression results effectively:
library(ggplot2)
# Create scatter plot with regression line and confidence band
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
labs(
title = "Linear Regression: Fuel Efficiency vs Weight",
subtitle = "mpg = 37.29 - 5.34 × wt",
x = "Weight (1000 lbs)",
y = "Miles per Gallon"
) +
theme_minimal()
Summary
Linear regression is a powerful and interpretable technique for modeling continuous outcomes. In this tutorial, you learned how to fit simple and multiple linear regression using lm(), interpret coefficients in context while understanding partial effects, perform model diagnostics to verify assumptions, make predictions on new data, and visualize results with ggplot2.
The key takeaways are that the lm() function provides everything you need for basic regression analysis, coefficient interpretation requires careful attention to the holding other variables constant framing, and diagnostic checks are essential for valid inference.
In the next tutorial in this series, we will cover logistic regression, which extends linear regression concepts to binary classification problems.