Bayesian Modelling with rstanarm
Bayesian statistics offers a different way to think about data analysis. Instead of asking “what do the data tell us alone,” Bayesian methods ask “what do the data tell us given what we already believe?” The rstanarm package brings this powerful framework to R users through a familiar interface.
What Makes Bayesian Different
In classical (frequentist) statistics, you treat parameters as unknown fixed values and compute the probability of observing your data given those parameters. In Bayesian statistics, you treat parameters themselves as random variables with their own probability distributions.
You start with a prior distribution — your belief about a parameter before seeing the data. Then you combine this with the data through likelihood to get the posterior distribution — your updated belief after considering the evidence.
The formula is simple:
posterior ∝ likelihood × prior
This is Bayes’ theorem in action. The posterior becomes your new baseline for inference, and it serves as the prior for any future data you collect.
Getting Started with rstanarm
First, install and load the package:
install.packages("rstanarm")
library(rstanarm)
rstanarm provides a interface to Stan, a powerful probabilistic programming language. The main functions mirror familiar R regression functions but add Bayesian interpretation.
Fitting Your First Model
Let’s work with a classic example — predicting fuel efficiency based on car weight:
# Load built-in data
data(mtcars)
# Fit a Bayesian linear regression
# Compare to: lm(mpg ~ wt, data = mtcars)
fit <- stan_glm(mpg ~ wt, data = mtcars,
prior = normal(0, 10),
prior_intercept = normal(30, 10),
chains = 4, iter = 2000)
The syntax should look familiar — it’s almost identical to glm(). The key differences are the prior specifications and the output.
Understanding Priors
Priors are where Bayesian analysis gets interesting. They encode your existing knowledge:
- Weakly informative priors — say “I don’t know much, but it’s probably in this range”
- Strong priors — express strong beliefs based on previous research
- Flat priors — essentially say “I know nothing” (similar to maximum likelihood)
For the intercept, normal(30, 10) says “before seeing data, I think fuel efficiency is around 30 mpg, with standard deviation 10.” This is fairly vague — the data will dominate the final estimate.
For the coefficient, normal(0, 10) is a common default that says “I think the effect could be positive or negative, but large effects are less likely.”
Interpreting the Posterior
Now let’s look at what the model gives us:
print(fit)
You’ll see point estimates (median, mean) and uncertainty intervals. The output looks different from lm() — instead of a single coefficient with a p-value, you get a distribution.
To get credible intervals (the Bayesian equivalent of confidence intervals):
posterior_interval(fit, prob = 0.9)
A 90% credible interval for the weight coefficient might be [-5.8, -3.2]. You can interpret this as: “Given our priors and the data, there’s a 90% probability that the true coefficient lies between -5.8 and -3.2.”
This is more intuitive than the frequentist interpretation — you’re making probability statements about the parameter, not about repeated sampling.
Checking Model Fit
rstanarm provides several diagnostics:
# Check convergence
plot(fit)
# Posterior predictive checks
pp_check(fit)
The plot() function shows trace plots — you want to see “fuzzy caterpillars” rather than systematic patterns. If chains converged properly, the plots should look like random noise.
Posterior predictive checks compare simulated data from your model to the actual data. If the real data falls within the simulated range, your model is reasonable.
Adding More Complexity
You can fit more elaborate models:
# Multiple predictors
fit_full <- stan_glm(mpg ~ wt + cyl + hp, data = mtcars,
prior = normal(0, 5),
chains = 4)
# Generalized linear models
fit_poisson <- stan_glm(count ~ treatment, data = your_data,
family = poisson(),
prior = normal(0, 2))
Making Predictions
Prediction works similarly to other R models:
# Point prediction
predict(fit, newdata = data.frame(wt = 3.0))
# Full posterior for predictions (includes uncertainty)
y_pred <- posterior_predict(fit, newdata = data.frame(wt = 3.0))
quantile(y_pred, c(0.5, 0.025, 0.975))
The key advantage: you get uncertainty in predictions automatically, including both parameter uncertainty and observation-level variation.
Why Use Bayesian Methods
Bayesian approaches shine when:
- You have meaningful prior information to incorporate
- You need interpretable probability statements
- Working with hierarchical/nested data structures
- Small sample sizes where priors add stability
- You want to propagate uncertainty naturally through models
The computational cost is higher than maximum likelihood — Markov Chain Monte Carlo sampling takes time — but modern computers make this increasingly practical. The rstanarm package handles most of the computational complexity behind the scenes, letting you focus on model specification and interpretation rather than algorithm details.
One practical consideration: Bayesian models can take several minutes to fit, especially with many iterations or complex models. Plan accordingly when working with large datasets or hierarchical models.
Common Pitfalls
Watch out for these common issues when starting with Bayesian modeling:
- Uninformative priors that are too wide can lead to numerical instability
- Not running enough iterations can result in biased estimates — always check convergence
- Interpreting credible intervals incorrectly — they represent probability of the parameter, not repeated sampling
- Forgetting that priors influence results — with small data, your priors matter a lot
See Also
lm()— classic OLS fitting for comparisonpredict()— making predictions with fitted models- Linear Regression in R — classic regression techniques