Descriptive Statistics in R
Descriptive statistics summarize the main features of a dataset. They give you a quick overview of your data before you dive into more complex analyses. In this tutorial, you’ll learn how to calculate descriptive statistics in R using both base R functions and the tidyverse approach.
What Are Descriptive Statistics?
Descriptive statistics include measures of central tendency (mean, median, mode) and measures of spread (variance, standard deviation, range, quartiles). These metrics help you understand where your data clusters and how much it varies.
Measures of Central Tendency
Mean
The mean is the arithmetic average of a dataset:
# Calculate mean
heights <- c(165, 170, 175, 180, 168, 172, 178, 162, 185, 169)
mean(heights)
# [1] 172.4
Median
The median is the middle value when data is sorted:
median(heights)
# [1] 171
The median is more robust to outliers than the mean. If you have extreme values, the median often better represents the “typical” value.
Mode
R doesn’t have a built-in mode function. Here’s how to calculate it:
# Custom mode function
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# Test it
colors <- c("red", "blue", "red", "green", "blue", "blue")
get_mode(colors)
# [1] "blue"
Measures of Spread
Variance and Standard Deviation
Variance measures how far each number is from the mean:
# Variance
var(heights)
# [1] 54.48889
# Standard deviation
sd(heights)
# [1] 7.379086
The standard deviation is the square root of variance. It’s often easier to interpret because it’s in the same units as your data.
Range
The range is the difference between the maximum and minimum:
max(heights) - min(heights)
# [1] 23
# Or use range() to get min and max
range(heights)
# [1] 162 185
Interquartile Range (IQR)
The IQR is the range between the 25th and 75th percentiles:
IQR(heights)
# [1] 10.75
Quartiles and Percentiles
Using quantile()
The quantile() function calculates any percentile:
quantile(heights)
# 0% 25% 50% 75% 100%
# 162.0 165.8 171.0 175.2 185.0
# Specific quartiles
quantile(heights, probs = c(0.25, 0.5, 0.75))
# 25% 50% 75%
# 165.8 171.0 175.2
Using summary()
The summary() function gives you a quick overview:
summary(heights)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 162.0 165.8 171.0 172.4 175.2 185.0
Grouped Statistics
Using tapply()
Calculate statistics by group:
# Create grouped data
gender <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
values <- c(100, 90, 95, 85, 88, 92, 98, 87, 91, 89)
tapply(values, gender, mean)
# F M
# 88.25 94.40
Using dplyr
The tidyverse approach:
library(dplyr)
df <- data.frame(
gender = rep(c("M", "F"), each = 5),
score = c(100, 95, 88, 91, 98, 90, 85, 92, 87, 89)
)
df %>%
group_by(gender) %>%
summarise(
mean = mean(score),
median = median(score),
sd = sd(score),
n = n()
)
Handling Missing Values
Many statistical functions have an na.rm parameter to handle missing values:
data_with_na <- c(1, 2, NA, 4, 5, NA, 7)
mean(data_with_na)
# [1] NA
mean(data_with_na, na.rm = TRUE)
# [1] 3.8
# Apply to all at once
colMeans(df, na.rm = TRUE)
Common Gotchas
NA Handling
Always check for missing values before calculating statistics:
# Check for NAs
anyNA(heights)
# [1] FALSE
# Count NAs
sum(is.na(data_with_na))
# [1] 2
Outliers
Use multiple measures to understand your data:
# If data has outliers, compare mean vs median
# Large difference suggests outliers are affecting the mean
# Check for outliers
boxplot.stats(heights)$out
# Returns any outliers
Summary
- Use
mean()andmedian()for central tendency - Use
var()andsd()for spread - Use
quantile()for percentiles - Use
summary()for a quick overview - Remember
na.rm = TRUEfor missing data - Compare multiple measures to understand outliers
Next Steps
Continue with Hypothesis Testing in R to learn how to make inferences from your data.
Practical Examples
Descriptive Stats on Real Data
R comes with built-in datasets you can practice on:
Using fivenum()
For a quick five-number summary:
This returns: minimum, lower-hinge, median, upper-hinge, maximum. It’s similar to quantile() but uses a different algorithm.