fct_lump
Introduction
The fct_lump_*() family collapses rare factor levels into an “Other” category. Use this when working with categorical variables that have many infrequent levels, such as survey responses, geographic regions, or data with long tails.
Four specialized functions handle different lumping strategies:
fct_lump_n(), keep the n most (or least) frequent levelsfct_lump_prop(), lump by proportion thresholdfct_lump_min(), keep levels appearing at least min timesfct_lump_lowfreq(), automatic threshold selection
All functions require the forcats package, part of the tidyverse ecosystem. The core idea is that every lumping function groups infrequent factor levels into a single “Other” category, reducing the number of distinct levels in plots and tables. The example below creates a factor with 9 levels at varying frequencies, then collapses the six rarest levels into “Other”:
library(forcats)
# Input: factor with 9 levels, varying frequencies
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
table(x)
# x
# A B C D E F G H I
# 40 10 5 27 1 1 1 1 1
# Output: rare levels collapsed into "Other"
result <- fct_lump_n(x, n = 3)
table(result)
# result
# A B D Other
# 40 10 27 10
fct_lump_n()
fct_lump_n() keeps the n most frequent factor levels and collapses the rest. A positive n selects the top levels by count, while a negative n selects the bottom (least frequent) levels. This is the most straightforward lumping function when you know exactly how many levels you want in the output:
# Keep the 3 most frequent levels
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_n(x, n = 3)
# Levels: A B D Other
The ties.method argument controls what happens when multiple levels share the same frequency at the cutoff boundary. With the default "min", all tied levels are included, so you may end up with more than n retained levels if there is a tie for the nth position:
"min"(default), include all tied levels, giving at least n"average", use average rank"first", keep first occurrence"last", keep last occurrence"random", randomly select"max", include all tied levels at max rank
To preserve the least frequent levels instead of the most frequent ones, use a negative n. This is useful when you want to highlight rare categories and group the common ones:
# Preserve the least frequent levels (all tied at count 1)
fct_lump_n(x, n = -1)
# Levels: E F G H I Other
With n = -1 and ties.method = "min", all levels tied for least frequency are preserved. Since E, F, G, H, and I all have count 1, all five are kept.
fct_lump_prop()
fct_lump_prop() lumps levels based on a proportional threshold rather than a fixed count. Levels that appear in fewer than prop * n observations are collapsed into “Other,” where prop is a fraction between 0 and 1. This adapts automatically to datasets of different sizes:
# Lump levels appearing in fewer than 10% of observations
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
# Total observations: 87, so 10% = 8.7
fct_lump_prop(x, prop = 0.10)
# Levels: A B D Other
# A: 40 (46%), B: 10 (11%), D: 27 (31%), Other: 10 (11%)
A negative prop reverses the direction, lumping levels that appear in at most that proportion of observations. This is equivalent to keeping only the most dominant levels while grouping everything below the threshold into “Other.” For example, prop = -0.05 means any level present in 5% or fewer rows gets lumped:
# Lump levels appearing in at most 5% of observations
fct_lump_prop(x, prop = -0.05)
# Levels: A B C D Other
# A: 40 (46%), B: 10 (11%), C: 5 (6%), D: 27 (31%), Other: 5 (6%)
fct_lump_min()
fct_lump_min() preserves levels with at least a minimum absolute count and lumps everything below that threshold. This is the most intuitive function when your criterion is a specific sample size rather than a proportion or rank, and it reads naturally when the domain rule is stated as “I need at least X observations per level for this analysis”:
# Keep levels appearing at least 5 times
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_min(x, min = 5)
# Levels: A B C D Other
This is useful when statistical validity requires a minimum sample size per level — for example, before running ANOVA or chi-square tests where small cell counts skew the results. Setting min to 5 or 10 is a common heuristic for excluding sparsely populated categories from formal modeling.
fct_lump_lowfreq()
fct_lump_lowfreq() automatically determines a sensible cutoff threshold by examining the distribution of frequencies in the data. It tries to identify a natural gap between common and rare levels, which saves you from having to guess a value for n, prop, or min based on manual inspection. The algorithm is designed to preserve the structure of the data while collapsing only the tails:
# Automatic threshold selection
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_lowfreq(x)
# Levels: A B D Other
Weighted frequency
The w argument applies observation weights when computing frequencies, so that some observations contribute more to the count than others. This is the equivalent of using count(x, wt = w) inside the lumping logic. Weighted frequencies change which levels are considered rare because a level with many low-weight observations may get lumped while a level with few high-weight observations remains:
# Create weights: give more importance to certain observations
x <- factor(c("A", "A", "B", "B", "B", "C"))
w <- c(1, 1, 2, 2, 2, 1)
# Without weights, C appears once
fct_lump_n(x, n = 2)
# Levels: B Other
# With weights, B appears 6 times (2×3), C appears 1 time
fct_lump_n(x, n = 2, w = w)
# Levels: B C Other
Weights must match the length of the input factor.
Arguments reference
| Argument | Type | Description |
|---|---|---|
f | factor or character | Input factor (character is silently coerced) |
n | integer | For fct_lump_n(): positive keeps top n, negative keeps bottom n |
prop | numeric | For fct_lump_prop(): proportion threshold |
min | integer | For fct_lump_min(): minimum frequency to preserve |
w | numeric (optional) | Weights for frequency calculation; must match length of f |
other_level | string | Label for the lumped category; default: "Other" |
ties.method | string | For fct_lump_n() only; options: "min", "average", "first", "last", "random", "max" |
Common gotchas
Negative n inverts behavior. A negative n preserves the least frequent levels, not the most.
Character vectors are silently coerced to factors. No warning is given.
Weights must match input length exactly. Mismatched lengths cause an error.
ties.method only applies to fct_lump_n(), not to fct_lump_prop() or fct_lump_min().
The “Other” level is always placed last in the levels vector.
See also
factor(), base R function for creating factorsdplyr::count(), count observations by grouptable()— base R function for cross-tabulation