fct_lump

Updated May 19, 2026· Tidyverse

rforcatsfactors

Introduction

The fct_lump_*() family collapses rare factor levels into an “Other” category. Use this when working with categorical variables that have many infrequent levels — such as survey responses, geographic regions, or data with long tails.

Four specialized functions handle different lumping strategies:

fct_lump_n() — keep the n most (or least) frequent levels
fct_lump_prop() — lump by proportion threshold
fct_lump_min() — keep levels appearing at least min times
fct_lump_lowfreq() — automatic threshold selection

All functions require the forcats package, part of the tidyverse ecosystem.

library(forcats)

# Input: factor with 9 levels, varying frequencies
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
table(x)
# x
#  A  B  C  D  E  F  G  H  I
# 40 10  5 27  1  1  1  1  1

# Output: rare levels collapsed into "Other"
result <- fct_lump_n(x, n = 3)
table(result)
# result
#    A    B    D Other
#   40   10   27   10

fct_lump_n()

Keeps the n most frequent factor levels. Use a positive n to preserve the top n levels, or a negative n to preserve the bottom n (least frequent) levels.

# Keep the 3 most frequent levels
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_n(x, n = 3)
# Levels: A B D Other

The ties.method argument controls how ties are handled when there are more levels than n:

"min" (default) — include all tied levels, giving at least n
"average" — use average rank
"first" — keep first occurrence
"last" — keep last occurrence
"random" — randomly select
"max" — include all tied levels at max rank

To preserve the least frequent levels instead:

# Preserve the least frequent levels (all tied at count 1)
fct_lump_n(x, n = -1)
# Levels: E F G H I Other

With n = -1 and ties.method = "min", all levels tied for least frequency are preserved. Since E, F, G, H, and I all have count 1, all five are kept.

fct_lump_prop()

Lumps levels that appear in fewer than (or equal to) prop * n observations. The prop argument specifies the threshold as a fraction (0.10 = 10%).

# Lump levels appearing in fewer than 10% of observations
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

# Total observations: 87, so 10% = 8.7
fct_lump_prop(x, prop = 0.10)
# Levels: A B D Other
# A: 40 (46%), B: 10 (11%), D: 27 (31%), Other: 10 (11%)

Use a negative prop to lump levels appearing in at most that proportion (opposite direction):

# Lump levels appearing in at most 5% of observations
fct_lump_prop(x, prop = -0.05)
# Levels: A B C D Other
# A: 40 (46%), B: 10 (11%), C: 5 (6%), D: 27 (31%), Other: 5 (6%)

fct_lump_min()

Preserves levels that appear at least min times. Everything below the threshold goes to “Other”.

# Keep levels appearing at least 5 times
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

fct_lump_min(x, min = 5)
# Levels: A B C D Other

Useful when levels need a minimum sample size for statistical validity.

fct_lump_lowfreq()

Automatically determines which levels to lump. It calculates an appropriate threshold based on the data.

# Automatic threshold selection
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

fct_lump_lowfreq(x)
# Levels: A B D Other

Weighted Frequency

The w argument lets you weight observations differently when calculating frequencies:

# Create weights: give more importance to certain observations
x <- factor(c("A", "A", "B", "B", "B", "C"))
w <- c(1, 1, 2, 2, 2, 1)

# Without weights, C appears once
fct_lump_n(x, n = 2)
# Levels: B Other

# With weights, B appears 6 times (2×3), C appears 1 time
fct_lump_n(x, n = 2, w = w)
# Levels: B C Other

Weights must match the length of the input factor.

Arguments Reference

Argument	Type	Description
`f`	factor or character	Input factor (character is silently coerced)
`n`	integer	For `fct_lump_n()`: positive keeps top n, negative keeps bottom n
`prop`	numeric	For `fct_lump_prop()`: proportion threshold
`min`	integer	For `fct_lump_min()`: minimum frequency to preserve
`w`	numeric (optional)	Weights for frequency calculation; must match length of `f`
`other_level`	string	Label for the lumped category; default: `"Other"`
`ties.method`	string	For `fct_lump_n()` only; options: `"min"`, `"average"`, `"first"`, `"last"`, `"random"`, `"max"`

Common Gotchas

Negative n inverts behavior. A negative n preserves the least frequent levels, not the most.

Character vectors are silently coerced to factors. No warning is given.

Weights must match input length exactly. Mismatched lengths cause an error.

ties.method only applies to fct_lump_n(), not to fct_lump_prop() or fct_lump_min().

The “Other” level is always placed last in the levels vector.