Descriptive Statistics in R
Descriptive statistics summarize the main features of a dataset. They give you a quick overview of your data before you dive into more complex analyses. In this tutorial, you’ll learn how to calculate descriptive statistics in R using both base R functions and the tidyverse approach.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Descriptive Statistics in R. By the end, you will know how to apply the core functions in real data analysis workflows.
What are descriptive statistics?
Descriptive statistics include measures of central tendency (mean, median, mode) and measures of spread (variance, standard deviation, range, quartiles). These metrics help you understand where your data clusters and how much it varies.
Measures of central tendency
Mean
The mean is the arithmetic average of a dataset:
# Calculate mean
heights <- c(165, 170, 175, 180, 168, 172, 178, 162, 185, 169)
mean(heights)
# [1] 172.4
Median
The median is the middle value when data is sorted:
median(heights)
# [1] 171
The median is more reliable to outliers than the mean. If you have extreme values, the median often better represents the “typical” value.
Mode
R doesn’t have a built-in mode function. Here’s how to calculate it:
# Custom mode function
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# Test it
colors <- c("red", "blue", "red", "green", "blue", "blue")
get_mode(colors)
# [1] "blue"
Measures of spread
Variance and standard deviation
Variance measures how far each number is from the mean:
# Variance
var(heights)
# [1] 54.48889
# Standard deviation
sd(heights)
# [1] 7.379086
The standard deviation is the square root of variance. It’s often easier to interpret because it’s in the same units as your data.
Range
The range is the difference between the maximum and minimum:
max(heights) - min(heights)
# [1] 23
# Or use range() to get min and max
range(heights)
# [1] 162 185
Interquartile range (IQR)
The IQR is the range between the 25th and 75th percentiles:
IQR(heights)
# [1] 10.75
Quartiles and percentiles
Using quantile()
The quantile() function calculates any percentile:
quantile(heights)
# 0% 25% 50% 75% 100%
# 162.0 165.8 171.0 175.2 185.0
# Specific quartiles
quantile(heights, probs = c(0.25, 0.5, 0.75))
# 25% 50% 75%
# 165.8 171.0 175.2
Using summary()
The summary() function gives you a quick overview:
summary(heights)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 162.0 165.8 171.0 172.4 175.2 185.0
Grouped statistics
Using tapply()
Calculate statistics by group:
# Create grouped data
gender <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
values <- c(100, 90, 95, 85, 88, 92, 98, 87, 91, 89)
tapply(values, gender, mean)
# F M
# 88.25 94.40
Using dplyr
The tidyverse approach:
library(dplyr)
df <- data.frame(
gender = rep(c("M", "F"), each = 5),
score = c(100, 95, 88, 91, 98, 90, 85, 92, 87, 89)
)
df %>%
group_by(gender) %>%
summarise(
mean = mean(score),
median = median(score),
sd = sd(score),
n = n()
)
Handling missing values
Many statistical functions have an na.rm parameter to handle missing values:
data_with_na <- c(1, 2, NA, 4, 5, NA, 7)
mean(data_with_na)
# [1] NA
mean(data_with_na, na.rm = TRUE)
# [1] 3.8
# Apply to all at once
colMeans(df, na.rm = TRUE)
Common gotchas
NA handling
Always check for missing values before calculating statistics:
# Check for NAs
anyNA(heights)
# [1] FALSE
# Count NAs
sum(is.na(data_with_na))
# [1] 2
Outliers
Use multiple measures to understand your data:
# If data has outliers, compare mean vs median
# Large difference suggests outliers are affecting the mean
# Check for outliers
boxplot.stats(heights)$out
# Returns any outliers
Summary
- Use
mean()andmedian()for central tendency - Use
var()andsd()for spread - Use
quantile()for percentiles - Use
summary()for a quick overview - Remember
na.rm = TRUEfor missing data - Compare multiple measures to understand outliers
Next steps
Continue with Hypothesis Testing in R to learn how to make inferences from your data.
Practical examples
Descriptive stats on real data
R comes with built-in datasets you can practice on:
Using fivenum()
For a quick five-number summary:
This returns: minimum, lower-hinge, median, upper-hinge, maximum. It’s similar to quantile() but uses a different algorithm.
Central tendency
Three measures of central tendency: mean() is the arithmetic average, sensitive to outliers; median() is the middle value when sorted, reliable to outliers; mode has no built-in function in R, compute with which.max(table(x)). For skewed distributions, the median is usually a more informative center than the mean. mean(x, trim = 0.1) computes a 10% trimmed mean, which is reliable but less extreme than the median.
Spread
var() computes sample variance with n-1 denominator; sd() is its square root. IQR(x) returns the interquartile range (Q3 - Q1), a reliable measure of spread unaffected by outliers. mad(x) returns the median absolute deviation, even more reliable, equivalent to the sd for normal data scaled by a constant. range(x) gives the minimum and maximum.
Distribution shape
skewness() from moments or e1071 measures asymmetry. Positive skew has a long right tail; negative has a long left tail. kurtosis() measures tail heaviness, a normal distribution has kurtosis 3. shapiro.test(x) formally tests normality (Shapiro-Wilk). qqnorm(x); qqline(x) visually assesses normality with a Q-Q plot: points should fall on the line for a normal distribution.
Summary statistics in depth
Descriptive statistics summarize the distribution of a variable. For numeric data, the key statistics fall into three groups: measures of central tendency (what is typical), measures of spread (how variable is the data), and measures of shape (skewness and kurtosis).
mean(x, na.rm = TRUE) and median(x, na.rm = TRUE) describe central tendency. The mean is sensitive to outliers; the median is reliable. When they differ substantially, the distribution is skewed. mode is not a base R function for continuous data, use table(x) for discrete data or a kernel density estimate for continuous data.
var(x), sd(x), IQR(x), and range(x) describe spread. range() returns the min and max as a two-element vector; diff(range(x)) gives the range as a scalar. For reliable spread measures, mad(x) (median absolute deviation) is less sensitive to outliers than sd().
skewness() and kurtosis() from the moments or e1071 packages describe shape. Positive skewness means a long right tail; negative means a long left tail. Kurtosis measures tail heaviness relative to a normal distribution.
Frequency tables and cross-Tabulations
table(x) counts occurrences of each unique value. For ordered factors, the table follows level order. prop.table(table(x)) converts counts to proportions. addmargins(table(x, y)) adds row and column totals to a cross-tabulation.
janitor::tabyl(df, var1, var2) produces a cleaner cross-tabulation with built-in percentage formatting. adorn_percentages("row") adds row percentages; adorn_totals("row") adds totals. janitor::tabyl() integrates with the pipe workflow better than base R table().
For weighted frequency tables, questionr::wtd.table(x, weights = w) computes weighted counts. Survey data almost always comes with design weights that should be applied to all summaries.
Grouped summaries with dplyr
The standard workflow for descriptive statistics on grouped data:
df %>%
group_by(category) %>%
summarise(
n = n(),
mean_val = mean(value, na.rm = TRUE),
median_val = median(value, na.rm = TRUE),
sd_val = sd(value, na.rm = TRUE),
q25 = quantile(value, 0.25, na.rm = TRUE),
q75 = quantile(value, 0.75, na.rm = TRUE)
)
across() applies multiple summary functions to multiple columns: summarise(across(where(is.numeric), list(mean = mean, sd = sd), na.rm = TRUE)). The result column names follow the colname_functionname convention.
Missing data analysis
Descriptive statistics without examining missingness can be misleading. sum(is.na(x)) counts missing values. mean(is.na(x)) gives the fraction missing. colSums(is.na(df)) counts missing by column.
visdat::vis_miss(df) visualizes the missing data pattern across all columns as a heatmap. naniar::miss_var_summary(df) produces a tidy table of missingness by variable. naniar::gg_miss_upset(df) shows which combinations of columns co-occur as missing, valuable for understanding whether missingness is random or structured.
mice::md.pattern(df) shows the pattern of missingness across variables. This distinguishes MCAR (missing completely at random), MAR (missing at random, depends on observed variables), and MNAR (missing not at random, depends on unobserved variables), though determining the mechanism requires domain knowledge, not just data.
Distribution testing and comparison
shapiro.test(x) tests for normality (null hypothesis: the data is normally distributed). It is sensitive to sample size, for large samples, even minor non-normality rejects the null. For practical purposes, examine a QQ plot with qqnorm(x); qqline(x) and consider whether deviations matter for your intended analysis.
ks.test(x, y) compares two empirical distributions (Kolmogorov-Smirnov test). t.test(x, y) tests whether two group means are equal. For non-normal data or ordinal outcomes, wilcox.test(x, y) (Mann-Whitney U) is the non-parametric alternative.
psych::describe(df) computes a comprehensive set of descriptive statistics, mean, SD, median, trimmed mean, MAD, min, max, skewness, kurtosis, SE, for all numeric columns simultaneously. It is faster than assembling individual summarise() calls and useful for data quality checks.
Descriptive statistics as data understanding
Descriptive statistics are the first thing you do with new data. Before modeling, before visualization, before any analysis, you need to understand what is in the dataset: the range of values, the distribution shape, the presence and pattern of missing data, and any obvious anomalies. This understanding shapes every subsequent decision in the analysis.
The summary function in base R provides minimum, maximum, quartiles, and mean for each numeric column, and counts of each level for factor columns. For a first look at a dataset, summary gives a broad overview in one call. For deeper understanding, distribution plots and grouped summaries reveal patterns that the five-number summary misses.
Understanding distributions
A mean and standard deviation fully describe a normal distribution but are misleading for skewed, bimodal, or heavy-tailed distributions. A distribution with two peaks has a mean between the peaks that is not representative of either group. A distribution with extreme outliers has a mean that is pulled toward the outliers. Visualizing the full distribution with a histogram, density plot, or violin plot is more informative than summary statistics alone.
Skewness measures asymmetry. Right-skewed (positive skew) distributions have a long right tail — income is a common example. Left-skewed distributions have a long left tail. Moderately skewed data can often be usefully analyzed with standard methods; severely skewed data often benefits from a log transformation before analysis. Checking skewness before modeling helps identify when transformations might improve model fit or the interpretation of results.
Grouped summaries
Most substantive questions in data analysis involve comparing groups. Grouped summaries — means by treatment arm, proportions by demographic group, medians by product category — are the statistical expression of these comparisons. The dplyr group_by and summarize workflow makes grouped summaries readable. The result is a summary table that quantifies the differences between groups.
Reporting grouped summaries with measures of uncertainty — standard errors, confidence intervals, or interquartile ranges — is more informative than point estimates alone. A group mean difference that is large relative to its standard error is more noteworthy than one that is small relative to it. The size of the group, which determines the uncertainty, is as important as the point estimate for interpreting group comparisons.