Posterior Predictive Checks for Model Validation in R
Posterior predictive checks (PPCs) are a fundamental tool in Bayesian analysis. They let you ask: “If I repeatedly drew new data from my model, would it look like the data I actually observed?” If the answer is yes, your model is consistent with the data. If no, something is wrong.
This tutorial teaches you how to perform meaningful posterior predictive checks using brms and bayesplot, with practical examples you can apply to your own models.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Posterior Predictive Checks. By the end, you will know how to apply the core functions in real data analysis workflows.
The intuition behind pPCs
The core idea is straightforward. After fitting a Bayesian model, you have a full posterior distribution—not just point estimates. Each draw from this posterior represents a plausible set of parameters that could have generated your data.
A posterior predictive check works like this:
- Draw a parameter vector from the posterior
- Use those parameters to simulate new data
- Compare the simulated data to your actual observations
If the real data looks like a typical outcome from the model, you’re in good shape. If the real data looks unusual compared to what the model typically produces, your model is missing something important.
Setting up your environment
# Required packages
install.packages(c("brms", "bayesplot", "ggplot2", "tidybayes"))
library(brms)
library(bayesplot)
library(ggplot2)
library(tidybayes)
For this tutorial, we use the built-in mtcars dataset to fit a model predicting fuel efficiency. The brm() function uses the same formula syntax as lm() or glm() — here we predict miles per gallon from weight, number of cylinders, and horsepower. The family = gaussian() argument specifies a normal error distribution, which is the default for continuous unbounded outcomes like MPG. Setting refresh = 0 suppresses the per-iteration progress output to keep the tutorial output clean while the sampler runs four independent Markov chains to diagnose convergence.
# Fit a simple model for demonstration
model <- brm(
mpg ~ wt + cyl + hp,
data = mtcars,
family = gaussian(),
chains = 4,
seed = 123,
refresh = 0
)
Basic visual checks with pp_check
The pp_check function
bayesplot provides the pp_check function for quick visual checks. Once your model has finished sampling, pp_check() generates simulated datasets by drawing parameter values from the posterior and using them to produce new y-values. The function overlays a density curve of your observed outcome against densities from many simulated replicates. A well-fitting model produces replicate densities that envelop the observed density — if the observed curve sits far outside the simulated range, the model’s distributional assumptions are likely wrong and need revisiting.
# Most basic PPC: compare observed y to simulated y's
pp_check(model)
This plots your actual data (one dot per observation) overlaid on a distribution of simulated datasets from the posterior. If your model fits well, the observed data should fall in the middle of the simulated distribution.
Types of PPC plots
bayesplot offers several PPC plot styles:
# Histogram overlay
pp_check(model, type = "hist", n = 100)
# Scatter: observed vs predicted mean
pp_check(model, type = "scatter")
# Scatter: observed vs predicted with intervals
pp_check(model, type = "scatter_avg")
# Rootogram (counts)
pp_check(model, type = "rootogram")
Interpreting the plots
The key question is: do your observations look plausible under the model?
- Good fit: Observations fall in the bulk of the simulated distribution
- Poor fit: Observations cluster in the tails or outside the simulated range
For example, if you see extreme outliers in your data that the model never reproduces, that’s a clear signal your model needs improvement.
Test statistics as pPCs
Beyond visual checks, you can use test statistics to quantify model fit.
Defining test statistics
A test statistic (or discrepancy measure) summarizes some aspect of your data. Common choices include:
# Example test statistics
mean(mtcars$mpg) # Sample mean
sd(mtcars$mpg) # Sample standard deviation
max(mtcars$mpg) # Maximum value
sum(mtcars$mpg < 15) # Count of low mpg cars
Comparing observed to simulated
Once you’ve chosen a test statistic that captures the aspect of your data you care about, the next step is computing that same statistic on each simulated dataset and checking where the observed value falls within the simulated distribution. If the observed statistic sits near the center of the simulated values, your model reproduces that aspect of the data well. Values in the extreme tails signal that the model systematically misrepresents that structural feature, which may require adding predictors, changing the likelihood, or restructuring the linear predictor.
# Get posterior predictive samples
yrep <- predict(model)
# Calculate test statistic for each simulated dataset
test_stat <- apply(yrep, 1, mean) # Mean for each simulated dataset
# Observed test statistic
obs_stat <- mean(mtcars$mpg)
# Visualize
ggplot() +
geom_histogram(aes(test_stat), bins = 30, fill = "steelblue", alpha = 0.7) +
geom_vline(aes(xintercept = obs_stat), color = "red", linewidth = 2) +
labs(
title = "Posterior Predictive Check: Sample Mean",
x = "Simulated Mean MPG",
y = "Frequency"
)
The red line shows where your observed data falls. If it’s in the tails of the distribution, your model systematically mispredicts the mean.
Formal test statistics
For a more formal approach, calculate the proportion of simulated datasets with test statistics more extreme than observed:
# Bayesian p-value
p_value <- mean(test_stat >= obs_stat)
# Two-sided p-value
p_value_two_sided <- mean(abs(test_stat - mean(test_stat)) >=
abs(obs_stat - mean(test_stat)))
# Ideal: p-value around 0.5
# Suspicious: p-value near 0 or 1
cat("One-sided p-value:", p_value, "\n")
cat("Two-sided p-value:", p_value_two_sided, "\n")
A p-value near 0.5 suggests good fit. Extreme p-values (near 0 or 1) indicate problems.
Categorical and count data
Binary outcomes
Working with binary outcomes requires a different set of discrepancy measures than continuous data. Instead of comparing means or standard deviations, you check whether the model captures the proportion of successes — for example, the fraction of cars achieving more than 20 MPG. If your model systematically predicts too many or too few successes relative to the observed rate, it indicates the linear predictor or link function needs adjustment. This check is particularly important for logistic and probit regressions where calibration directly affects downstream probability estimates.
# Example: Proportion of "successes"
prop_success <- function(y) mean(y == 1)
# Apply to posterior predictions
yrep_binary <- predict(model, transform = plogis) > 0.5
test_props <- apply(yrep_binary, 1, prop_success)
obs_prop <- prop_success(mtcars$mpg > 20)
ggplot() +
geom_histogram(aes(test_props), bins = 30, fill = "darkgreen", alpha = 0.7) +
geom_vline(aes(xintercept = obs_prop), color = "red", linewidth = 2)
Count data
Moving from binary proportions to count-valued outcomes introduces additional considerations about distribution shape. Count data can be overdispersed — where the variance exceeds the mean — or zero-inflated, and your PPCs should specifically test whether the model captures these patterns. The bar chart overlay compares the frequency of each count value in your observed data against the distribution of counts across many posterior predictive replicates, making it easy to spot inflated zero counts or missing tails.
# For count outcomes, check distribution shape
pp_check(model, type = "bar")
pp_check(model, type = "freqpoly")
Group-Level pPCs
When your model includes grouping structures — like schools within districts or repeated measures within subjects — checking the overall distribution isn’t enough. A model might reproduce overall marginal patterns perfectly while failing for specific subgroups. Running pp_check() with the group argument partitions predictions by group, revealing whether certain clusters are systematically over- or under-predicted. This is especially valuable for random-intercept models where shrinkage can mask poor fit within individual groups.
# If you have a hierarchical model
# Check predictions for specific groups
pp_check(model, group = "group_name", type = "scatter")
# Or overlay observations for specific groups
pp_check(model, group = "school_id", type = "intervals")
Predictive calibration
Calibration plots
Calibration checks whether predicted probabilities match observed frequencies. Beyond checking distributional fit, you need to verify that your model’s uncertainty estimates are honest — a calibrated model should have prediction intervals that cover observed values at the advertised rate. The type = "calibration" option in pp_check() bins predicted probabilities and plots the observed proportion for each bin, so systematic deviations from the diagonal reveal where your model is overconfident or underconfident.
# For probability predictions
pp_check(model, type = "calibration")
This shows how well your model’s uncertainty matches the true outcome rates.
Interval coverage
Check whether your uncertainty intervals contain the truth at the expected rate. While calibration plots give a qualitative view, computing interval coverage provides a numeric benchmark. By constructing prediction intervals at multiple confidence levels — say 50%, 80%, and 95% — and counting how often observed values land inside them, you directly quantify calibration. A well-calibrated model should have a 95% interval that covers roughly 95% of observations, and systematic undercoverage means your predictions are overconfident.
# Get prediction intervals
yrep_pred <- predict(model, interval = "prediction")
# Check 50%, 80%, 95% interval coverage
coverage <- function(yrep, y, prob) {
lower <- apply(yrep, 2, quantile, (1 - prob) / 2)
upper <- apply(yrep, 2, quantile, (1 + prob) / 2)
mean(y >= lower & y <= upper)
}
coverage_50 <- coverage(yrep_pred, mtcars$mpg, 0.50)
coverage_80 <- coverage(yrep_pred, mtcars$mpg, 0.80)
coverage_95 <- coverage(yrep_pred, mtcars$mpg, 0.95)
cat("50% interval covers:", coverage_50, "of observations\n")
cat("80% interval covers:", coverage_80, "of observations\n")
cat("95% interval covers:", coverage_95, "of observations\n")
If your 95% intervals only cover 80% of observations, your uncertainty is underestimated.
Custom pPCs with tidybayes
While pp_check() and bayesplot provide convenient canned visualizations, the tidybayes package gives you direct access to posterior predictive draws as tidy data frames. This allows you to build fully customized diagnostic plots — comparing density curves with ggplot2, faceting by predictor levels, or computing any bespoke test statistic you can express in R code. With add_predicted_draws(), each row of your original data gets repeated across every posterior draw, yielding a long-format dataset where you can slice, summarize, and visualize the full predictive distribution using standard tidyverse verbs.
# Generate predictions
predictions <- add_predicted_draws(mtcars, model)
# Custom PPC: distribution comparison
ggplot(predictions, aes(x = .prediction, group = .draw)) +
geom_density(alpha = 0.1, color = "gray") +
geom_density(aes(x = mpg), data = mtcars, color = "red", linewidth = 2) +
labs(title = "Posterior Predictive Distribution vs Observed")
Common patterns and what they mean
Right-Skewed residuals
If your observed data consistently exceeds predictions for high values, the first common failure mode is at play — right-skewed residuals occur when the outcome variable has a heavier tail than the normal distribution allows. A scatter plot of observed versus predicted values makes this visible as data points clustering above the diagonal at the upper end of the outcome range, suggesting a log-normal or Student-t likelihood would fit better than the default Gaussian.
pp_check(model, type = "scatter")
This suggests you need a different likelihood (perhaps log-normal) or a transformation.
Systematic bias
If all predictions are too low, you may be dealing with systematic directional bias — a shift of the entire predictive distribution relative to the observed data. Unlike skew-related problems that affect only the tails, this pattern pushes all predictions above or below observations regardless of predictor value. Common causes include omitting a key predictor with a strong effect, misspecifying the functional form, or failing to center predictor variables when using weakly informative priors on the intercept.
# Your model consistently underpredicts
# Consider adding predictors or changing the model family
Heteroscedasticity
If prediction accuracy varies with the predictor, non-constant error variance — heteroscedasticity — has likely crept into your model. In a residuals-versus-fitted plot, this manifests as a funnel or fan shape where the spread of residuals widens or narrows as predictions change. The standard Gaussian model assumes identical error variance for all observations, so when this assumption fails, you need either a variance-modeling approach via bf() formulas in brms or a distribution like Student-t that can accommodate varying scale.
# Residuals vary across the range of x
# Consider adding a variance model or using a reliable likelihood
Adding test statistics in brms
For integrated PPCs within brms, the loo package provides the LOO-PIT diagnostic — a complementary calibration check that works directly with the log-likelihood. For a well-calibrated model, the probability integral transform of each observation under its leave-one-out predictive distribution should follow a Uniform(0, 1) distribution. The loo package computes these values efficiently using Pareto-smoothed importance sampling, and plotting them against a uniform reference reveals systematic deviations — departures from uniformity indicate the predictive distribution is consistently too narrow or too wide.
library(loo)
# Calculate LOO-PIT (Leave-One-Out Probability Integral Transform)
# Good calibration: uniform distribution
loo_model <- loo(model)
plot(loo_model)
How PPCs work
Posterior predictive checks (PPCs) generate new datasets from the posterior distribution and compare them to the observed data. If the model captures the data-generating process well, the observed data should look like a typical draw from the posterior predictive distribution. Systematic differences indicate model misfit.
PPCs in brms
pp_check(fit) creates an overlay density plot: the observed outcome distribution (dark line) overlaid on multiple posterior predictive distributions (light lines). If the observed distribution is atypical relative to the simulated ones, the model is misspecified. pp_check(fit, type = "stat", stat = "mean") checks that the posterior predictive mean matches the observed mean. pp_check(fit, type = "scatter_avg") plots observed vs mean predicted values.
Specific checks
Choose PPC statistics relevant to your research question. For count data: check that zero-inflation is captured with pp_check(fit, type = "rootogram"). For binary outcomes: check calibration with calibration plots. For time series: check autocorrelation with pp_check(fit, type = "stat_grouped", stat = "mean", group = "time_period"). Generic PPCs catch gross misfit; targeted PPCs detect specific patterns.
Acting on PPC failures
When PPCs reveal misfit, consider: adding missing predictors, using a different likelihood family (e.g., negative binomial instead of Poisson for overdispersed counts), adding a random effect for overdispersion, or including a non-linear term. Iterate the model-check-revise cycle until PPCs are satisfactory. A model that fails PPCs for features relevant to your inference should not be used for those inferences, even if the parameter estimates look reasonable.
What PPCs reveal about model fit
Posterior predictive checks (PPCs) are the primary tool for assessing Bayesian model fit. The idea is simple: a good model should generate data that looks like the data you actually observed. Simulate data from the fitted model and compare to the observations, if they look different, the model has a systematic misfit.
PPCs are more informative than summary statistics like deviance or information criteria because they reveal the specific nature of misfit. A model might have good DIC but still fail to reproduce observed clustering, tails, or time trends. Visual PPCs make the failure mode immediately apparent.
The comparison is always against summary statistics of the data, not individual data points. Common comparisons: the marginal distribution (histogram or density), group means, quantiles, the range, autocorrelation structure, or any other feature relevant to the scientific question. A mismatch in the tails suggests the distributional assumption is wrong. A mismatch in the mean across groups suggests a missing predictor.
The brms workflow
After fitting a model with brms, posterior predictive checks are one function call away. The brms package integrates with the bayesplot package to produce a variety of check plots with minimal code.
When a check reveals misfit, the response is to consider what model change would address it. Heavy tails in the residuals suggest a Student-t likelihood rather than Gaussian. Bimodal simulated data that doesn’t match unimodal observed data might indicate missing subgroup effects. Zero-inflation in count data that the model doesn’t capture suggests a zero-inflated or hurdle model.
The iterative process, fit, check, modify, refit, is the core of Bayesian model building. Posterior predictive checks are the feedback mechanism that guides this iteration. Unlike frequentist goodness-of-fit tests, PPCs do not produce a single pass/fail; they produce a rich picture of where the model succeeds and where it falls short, enabling targeted improvements.
Common check types
Density overlay checks compare the density of observed data against a sample of posterior predictive densities. Each posterior predictive density is drawn from one posterior sample, so the overlay shows the range of data the model considers plausible. If the observed density is within this range, the model captures the marginal distribution.
Statistics checks compute a test statistic on the observed data and on many posterior predictive datasets and compare the distribution. A Bayesian p-value (the fraction of simulated datasets where the statistic exceeds the observed value) near 0.5 indicates the model explains the statistic well; values near 0 or 1 indicate systematic misfit.
Error-vs-predicted plots for regression models show residuals as a function of predicted values. Non-random patterns indicate model misfit: a curved pattern suggests a missing nonlinear term, a fan pattern indicates heteroscedasticity, systematic over- or under-prediction at specific ranges indicates the model is not capturing the full relationship.
Summary
Posterior predictive checks are essential for Bayesian model validation:
| Check Type | What It Tests | When to Use |
|---|---|---|
| pp_check visual | Overall distribution fit | Always start here |
| Test statistics | Specific aspects of data | When you care about particular patterns |
| Group-level | Hierarchical structure | For multilevel models |
| Calibration | Uncertainty accuracy | For prediction intervals |
| LOO-PIT | Calibration | For model comparison |
Next steps
After mastering PPCs:
- Explore leave-one-out cross-validation (LOO) for model comparison
- Learn about posterior-predictive simulation for model expansion
- Apply PPCs to your specific domain problems
See also
-
dplyr::filter() - Basic data filtering for understanding examples
-
Prior Selection in Bayesian Models — Choosing appropriate priors that lead to valid PPCs
-
Getting Started with brms — Fitting Bayesian models that you’ll validate with PPCs
-
Linear Regression in R — Understanding regression fundamentals before going Bayesian