rguides

Descriptive Statistics in R

Descriptive statistics summarize the main features of a dataset. They give you a quick overview of your data before you dive into more complex analyses. In this tutorial, you’ll learn how to calculate descriptive statistics in R using both base R functions and the tidyverse approach.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Descriptive Statistics in R. By the end, you will know how to apply the core functions in real data analysis workflows.

What are descriptive statistics?

Descriptive statistics include measures of central tendency (mean, median, mode) and measures of spread (variance, standard deviation, range, quartiles). These metrics help you understand where your data clusters and how much it varies.

Measures of central tendency

Mean

The mean is the arithmetic average of a dataset:

# Calculate mean
heights <- c(165, 170, 175, 180, 168, 172, 178, 162, 185, 169)
mean(heights)
# [1] 172.4

Median

The median is the middle value when data is sorted:

median(heights)
# [1] 171

The median is more reliable to outliers than the mean. If you have extreme values, the median often better represents the “typical” value.

Mode

R doesn’t have a built-in mode function. Here’s how to calculate it:

# Custom mode function
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Test it
colors <- c("red", "blue", "red", "green", "blue", "blue")
get_mode(colors)
# [1] "blue"

Measures of spread

Variance and standard deviation

Variance measures how far each number is from the mean:

# Variance
var(heights)
# [1] 54.48889

# Standard deviation
sd(heights)
# [1] 7.379086

The standard deviation is the square root of variance. It’s often easier to interpret because it’s in the same units as your data.

Range

The range is the difference between the maximum and minimum:

max(heights) - min(heights)
# [1] 23

# Or use range() to get min and max
range(heights)
# [1] 162 185

Interquartile range (IQR)

The IQR is the range between the 25th and 75th percentiles:

IQR(heights)
# [1] 10.75

Quartiles and percentiles

Using quantile()

The quantile() function calculates any percentile:

quantile(heights)
#   0%  25%  50%  75% 100%
# 162.0 165.8 171.0 175.2 185.0

# Specific quartiles
quantile(heights, probs = c(0.25, 0.5, 0.75))
# 25%  50%  75%
# 165.8 171.0 175.2

Using summary()

The summary() function gives you a quick overview:

summary(heights)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   162.0   165.8   171.0   172.4   175.2   185.0

Grouped statistics

Using tapply()

Calculate statistics by group:

# Create grouped data
gender <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
values <- c(100, 90, 95, 85, 88, 92, 98, 87, 91, 89)

tapply(values, gender, mean)
#      F      M
# 88.25 94.40

Using dplyr

The tidyverse approach:

library(dplyr)

df <- data.frame(
  gender = rep(c("M", "F"), each = 5),
  score = c(100, 95, 88, 91, 98, 90, 85, 92, 87, 89)
)

df %>%
  group_by(gender) %>%
  summarise(
    mean = mean(score),
    median = median(score),
    sd = sd(score),
    n = n()
  )

Handling missing values

Many statistical functions have an na.rm parameter to handle missing values:

data_with_na <- c(1, 2, NA, 4, 5, NA, 7)

mean(data_with_na)
# [1] NA

mean(data_with_na, na.rm = TRUE)
# [1] 3.8

# Apply to all at once
colMeans(df, na.rm = TRUE)

Common gotchas

NA handling

Always check for missing values before calculating statistics:

# Check for NAs
anyNA(heights)
# [1] FALSE

# Count NAs
sum(is.na(data_with_na))
# [1] 2

Outliers

Use multiple measures to understand your data:

# If data has outliers, compare mean vs median
# Large difference suggests outliers are affecting the mean

# Check for outliers
boxplot.stats(heights)$out
# Returns any outliers

Summary

  • Use mean() and median() for central tendency
  • Use var() and sd() for spread
  • Use quantile() for percentiles
  • Use summary() for a quick overview
  • Remember na.rm = TRUE for missing data
  • Compare multiple measures to understand outliers

Next steps

Continue with Hypothesis Testing in R to learn how to make inferences from your data.

Practical examples

Descriptive stats on real data

R comes with built-in datasets you can practice on:

Using fivenum()

For a quick five-number summary:

This returns: minimum, lower-hinge, median, upper-hinge, maximum. It’s similar to quantile() but uses a different algorithm.

Central tendency

Three measures of central tendency: mean() is the arithmetic average, sensitive to outliers; median() is the middle value when sorted, reliable to outliers; mode has no built-in function in R, compute with which.max(table(x)). For skewed distributions, the median is usually a more informative center than the mean. mean(x, trim = 0.1) computes a 10% trimmed mean, which is reliable but less extreme than the median.

Spread

var() computes sample variance with n-1 denominator; sd() is its square root. IQR(x) returns the interquartile range (Q3 - Q1), a reliable measure of spread unaffected by outliers. mad(x) returns the median absolute deviation, even more reliable, equivalent to the sd for normal data scaled by a constant. range(x) gives the minimum and maximum.

Distribution shape

skewness() from moments or e1071 measures asymmetry. Positive skew has a long right tail; negative has a long left tail. kurtosis() measures tail heaviness, a normal distribution has kurtosis 3. shapiro.test(x) formally tests normality (Shapiro-Wilk). qqnorm(x); qqline(x) visually assesses normality with a Q-Q plot: points should fall on the line for a normal distribution.

Summary statistics in depth

Descriptive statistics summarize the distribution of a variable. For numeric data, the key statistics fall into three groups: measures of central tendency (what is typical), measures of spread (how variable is the data), and measures of shape (skewness and kurtosis).

mean(x, na.rm = TRUE) and median(x, na.rm = TRUE) describe central tendency. The mean is sensitive to outliers; the median is reliable. When they differ substantially, the distribution is skewed. mode is not a base R function for continuous data, use table(x) for discrete data or a kernel density estimate for continuous data.

var(x), sd(x), IQR(x), and range(x) describe spread. range() returns the min and max as a two-element vector; diff(range(x)) gives the range as a scalar. For reliable spread measures, mad(x) (median absolute deviation) is less sensitive to outliers than sd().

skewness() and kurtosis() from the moments or e1071 packages describe shape. Positive skewness means a long right tail; negative means a long left tail. Kurtosis measures tail heaviness relative to a normal distribution.

Frequency tables and cross-Tabulations

table(x) counts occurrences of each unique value. For ordered factors, the table follows level order. prop.table(table(x)) converts counts to proportions. addmargins(table(x, y)) adds row and column totals to a cross-tabulation.

janitor::tabyl(df, var1, var2) produces a cleaner cross-tabulation with built-in percentage formatting. adorn_percentages("row") adds row percentages; adorn_totals("row") adds totals. janitor::tabyl() integrates with the pipe workflow better than base R table().

For weighted frequency tables, questionr::wtd.table(x, weights = w) computes weighted counts. Survey data almost always comes with design weights that should be applied to all summaries.

Grouped summaries with dplyr

The standard workflow for descriptive statistics on grouped data:

df %>%
  group_by(category) %>%
  summarise(
    n = n(),
    mean_val = mean(value, na.rm = TRUE),
    median_val = median(value, na.rm = TRUE),
    sd_val = sd(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  )

across() applies multiple summary functions to multiple columns: summarise(across(where(is.numeric), list(mean = mean, sd = sd), na.rm = TRUE)). The result column names follow the colname_functionname convention.

Missing data analysis

Descriptive statistics without examining missingness can be misleading. sum(is.na(x)) counts missing values. mean(is.na(x)) gives the fraction missing. colSums(is.na(df)) counts missing by column.

visdat::vis_miss(df) visualizes the missing data pattern across all columns as a heatmap. naniar::miss_var_summary(df) produces a tidy table of missingness by variable. naniar::gg_miss_upset(df) shows which combinations of columns co-occur as missing, valuable for understanding whether missingness is random or structured.

mice::md.pattern(df) shows the pattern of missingness across variables. This distinguishes MCAR (missing completely at random), MAR (missing at random, depends on observed variables), and MNAR (missing not at random, depends on unobserved variables), though determining the mechanism requires domain knowledge, not just data.

Distribution testing and comparison

shapiro.test(x) tests for normality (null hypothesis: the data is normally distributed). It is sensitive to sample size, for large samples, even minor non-normality rejects the null. For practical purposes, examine a QQ plot with qqnorm(x); qqline(x) and consider whether deviations matter for your intended analysis.

ks.test(x, y) compares two empirical distributions (Kolmogorov-Smirnov test). t.test(x, y) tests whether two group means are equal. For non-normal data or ordinal outcomes, wilcox.test(x, y) (Mann-Whitney U) is the non-parametric alternative.

psych::describe(df) computes a comprehensive set of descriptive statistics, mean, SD, median, trimmed mean, MAD, min, max, skewness, kurtosis, SE, for all numeric columns simultaneously. It is faster than assembling individual summarise() calls and useful for data quality checks.

Descriptive statistics as data understanding

Descriptive statistics are the first thing you do with new data. Before modeling, before visualization, before any analysis, you need to understand what is in the dataset: the range of values, the distribution shape, the presence and pattern of missing data, and any obvious anomalies. This understanding shapes every subsequent decision in the analysis.

The summary function in base R provides minimum, maximum, quartiles, and mean for each numeric column, and counts of each level for factor columns. For a first look at a dataset, summary gives a broad overview in one call. For deeper understanding, distribution plots and grouped summaries reveal patterns that the five-number summary misses.

Understanding distributions

A mean and standard deviation fully describe a normal distribution but are misleading for skewed, bimodal, or heavy-tailed distributions. A distribution with two peaks has a mean between the peaks that is not representative of either group. A distribution with extreme outliers has a mean that is pulled toward the outliers. Visualizing the full distribution with a histogram, density plot, or violin plot is more informative than summary statistics alone.

Skewness measures asymmetry. Right-skewed (positive skew) distributions have a long right tail — income is a common example. Left-skewed distributions have a long left tail. Moderately skewed data can often be usefully analyzed with standard methods; severely skewed data often benefits from a log transformation before analysis. Checking skewness before modeling helps identify when transformations might improve model fit or the interpretation of results.

Grouped summaries

Most substantive questions in data analysis involve comparing groups. Grouped summaries — means by treatment arm, proportions by demographic group, medians by product category — are the statistical expression of these comparisons. The dplyr group_by and summarize workflow makes grouped summaries readable. The result is a summary table that quantifies the differences between groups.

Reporting grouped summaries with measures of uncertainty — standard errors, confidence intervals, or interquartile ranges — is more informative than point estimates alone. A group mean difference that is large relative to its standard error is more noteworthy than one that is small relative to it. The size of the group, which determines the uncertainty, is as important as the point estimate for interpreting group comparisons.