fct_lump
Introduction
The fct_lump_*() family collapses rare factor levels into an “Other” category. Use this when working with categorical variables that have many infrequent levels — such as survey responses, geographic regions, or data with long tails.
Four specialized functions handle different lumping strategies:
fct_lump_n()— keep the n most (or least) frequent levelsfct_lump_prop()— lump by proportion thresholdfct_lump_min()— keep levels appearing at least min timesfct_lump_lowfreq()— automatic threshold selection
All functions require the forcats package, part of the tidyverse ecosystem.
library(forcats)
# Input: factor with 9 levels, varying frequencies
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
table(x)
# x
# A B C D E F G H I
# 40 10 5 27 1 1 1 1 1
# Output: rare levels collapsed into "Other"
result <- fct_lump_n(x, n = 3)
table(result)
# result
# A B D Other
# 40 10 27 10
fct_lump_n()
Keeps the n most frequent factor levels. Use a positive n to preserve the top n levels, or a negative n to preserve the bottom n (least frequent) levels.
# Keep the 3 most frequent levels
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_n(x, n = 3)
# Levels: A B D Other
The ties.method argument controls how ties are handled when there are more levels than n:
"min"(default) — include all tied levels, giving at least n"average"— use average rank"first"— keep first occurrence"last"— keep last occurrence"random"— randomly select"max"— include all tied levels at max rank
To preserve the least frequent levels instead:
# Preserve the least frequent levels (all tied at count 1)
fct_lump_n(x, n = -1)
# Levels: E F G H I Other
With n = -1 and ties.method = "min", all levels tied for least frequency are preserved. Since E, F, G, H, and I all have count 1, all five are kept.
fct_lump_prop()
Lumps levels that appear in fewer than (or equal to) prop * n observations. The prop argument specifies the threshold as a fraction (0.10 = 10%).
# Lump levels appearing in fewer than 10% of observations
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
# Total observations: 87, so 10% = 8.7
fct_lump_prop(x, prop = 0.10)
# Levels: A B D Other
# A: 40 (46%), B: 10 (11%), D: 27 (31%), Other: 10 (11%)
Use a negative prop to lump levels appearing in at most that proportion (opposite direction):
# Lump levels appearing in at most 5% of observations
fct_lump_prop(x, prop = -0.05)
# Levels: A B C D Other
# A: 40 (46%), B: 10 (11%), C: 5 (6%), D: 27 (31%), Other: 5 (6%)
fct_lump_min()
Preserves levels that appear at least min times. Everything below the threshold goes to “Other”.
# Keep levels appearing at least 5 times
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_min(x, min = 5)
# Levels: A B C D Other
Useful when levels need a minimum sample size for statistical validity.
fct_lump_lowfreq()
Automatically determines which levels to lump. It calculates an appropriate threshold based on the data.
# Automatic threshold selection
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_lowfreq(x)
# Levels: A B D Other
Weighted Frequency
The w argument lets you weight observations differently when calculating frequencies:
# Create weights: give more importance to certain observations
x <- factor(c("A", "A", "B", "B", "B", "C"))
w <- c(1, 1, 2, 2, 2, 1)
# Without weights, C appears once
fct_lump_n(x, n = 2)
# Levels: B Other
# With weights, B appears 6 times (2×3), C appears 1 time
fct_lump_n(x, n = 2, w = w)
# Levels: B C Other
Weights must match the length of the input factor.
Arguments Reference
| Argument | Type | Description |
|---|---|---|
f | factor or character | Input factor (character is silently coerced) |
n | integer | For fct_lump_n(): positive keeps top n, negative keeps bottom n |
prop | numeric | For fct_lump_prop(): proportion threshold |
min | integer | For fct_lump_min(): minimum frequency to preserve |
w | numeric (optional) | Weights for frequency calculation; must match length of f |
other_level | string | Label for the lumped category; default: "Other" |
ties.method | string | For fct_lump_n() only; options: "min", "average", "first", "last", "random", "max" |
Common Gotchas
Negative n inverts behavior. A negative n preserves the least frequent levels, not the most.
Character vectors are silently coerced to factors. No warning is given.
Weights must match input length exactly. Mismatched lengths cause an error.
ties.method only applies to fct_lump_n(), not to fct_lump_prop() or fct_lump_min().
The “Other” level is always placed last in the levels vector.
See Also
factor()— base R function for creating factorsdplyr::count()— count observations by grouptable()— base R function for cross-tabulation