rguides

fct_lump

Introduction

The fct_lump_*() family collapses rare factor levels into an “Other” category. Use this when working with categorical variables that have many infrequent levels, such as survey responses, geographic regions, or data with long tails.

Four specialized functions handle different lumping strategies:

  • fct_lump_n(), keep the n most (or least) frequent levels
  • fct_lump_prop(), lump by proportion threshold
  • fct_lump_min(), keep levels appearing at least min times
  • fct_lump_lowfreq(), automatic threshold selection

All functions require the forcats package, part of the tidyverse ecosystem. The core idea is that every lumping function groups infrequent factor levels into a single “Other” category, reducing the number of distinct levels in plots and tables. The example below creates a factor with 9 levels at varying frequencies, then collapses the six rarest levels into “Other”:

library(forcats)

# Input: factor with 9 levels, varying frequencies
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
table(x)
# x
#  A  B  C  D  E  F  G  H  I
# 40 10  5 27  1  1  1  1  1

# Output: rare levels collapsed into "Other"
result <- fct_lump_n(x, n = 3)
table(result)
# result
#    A    B    D Other
#   40   10   27   10

fct_lump_n()

fct_lump_n() keeps the n most frequent factor levels and collapses the rest. A positive n selects the top levels by count, while a negative n selects the bottom (least frequent) levels. This is the most straightforward lumping function when you know exactly how many levels you want in the output:

# Keep the 3 most frequent levels
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_n(x, n = 3)
# Levels: A B D Other

The ties.method argument controls what happens when multiple levels share the same frequency at the cutoff boundary. With the default "min", all tied levels are included, so you may end up with more than n retained levels if there is a tie for the nth position:

  • "min" (default), include all tied levels, giving at least n
  • "average", use average rank
  • "first", keep first occurrence
  • "last", keep last occurrence
  • "random", randomly select
  • "max", include all tied levels at max rank

To preserve the least frequent levels instead of the most frequent ones, use a negative n. This is useful when you want to highlight rare categories and group the common ones:

# Preserve the least frequent levels (all tied at count 1)
fct_lump_n(x, n = -1)
# Levels: E F G H I Other

With n = -1 and ties.method = "min", all levels tied for least frequency are preserved. Since E, F, G, H, and I all have count 1, all five are kept.

fct_lump_prop()

fct_lump_prop() lumps levels based on a proportional threshold rather than a fixed count. Levels that appear in fewer than prop * n observations are collapsed into “Other,” where prop is a fraction between 0 and 1. This adapts automatically to datasets of different sizes:

# Lump levels appearing in fewer than 10% of observations
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

# Total observations: 87, so 10% = 8.7
fct_lump_prop(x, prop = 0.10)
# Levels: A B D Other
# A: 40 (46%), B: 10 (11%), D: 27 (31%), Other: 10 (11%)

A negative prop reverses the direction, lumping levels that appear in at most that proportion of observations. This is equivalent to keeping only the most dominant levels while grouping everything below the threshold into “Other.” For example, prop = -0.05 means any level present in 5% or fewer rows gets lumped:

# Lump levels appearing in at most 5% of observations
fct_lump_prop(x, prop = -0.05)
# Levels: A B C D Other
# A: 40 (46%), B: 10 (11%), C: 5 (6%), D: 27 (31%), Other: 5 (6%)

fct_lump_min()

fct_lump_min() preserves levels with at least a minimum absolute count and lumps everything below that threshold. This is the most intuitive function when your criterion is a specific sample size rather than a proportion or rank, and it reads naturally when the domain rule is stated as “I need at least X observations per level for this analysis”:

# Keep levels appearing at least 5 times
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

fct_lump_min(x, min = 5)
# Levels: A B C D Other

This is useful when statistical validity requires a minimum sample size per level — for example, before running ANOVA or chi-square tests where small cell counts skew the results. Setting min to 5 or 10 is a common heuristic for excluding sparsely populated categories from formal modeling.

fct_lump_lowfreq()

fct_lump_lowfreq() automatically determines a sensible cutoff threshold by examining the distribution of frequencies in the data. It tries to identify a natural gap between common and rare levels, which saves you from having to guess a value for n, prop, or min based on manual inspection. The algorithm is designed to preserve the structure of the data while collapsing only the tails:

# Automatic threshold selection
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

fct_lump_lowfreq(x)
# Levels: A B D Other

Weighted frequency

The w argument applies observation weights when computing frequencies, so that some observations contribute more to the count than others. This is the equivalent of using count(x, wt = w) inside the lumping logic. Weighted frequencies change which levels are considered rare because a level with many low-weight observations may get lumped while a level with few high-weight observations remains:

# Create weights: give more importance to certain observations
x <- factor(c("A", "A", "B", "B", "B", "C"))
w <- c(1, 1, 2, 2, 2, 1)

# Without weights, C appears once
fct_lump_n(x, n = 2)
# Levels: B Other

# With weights, B appears 6 times (2×3), C appears 1 time
fct_lump_n(x, n = 2, w = w)
# Levels: B C Other

Weights must match the length of the input factor.

Arguments reference

ArgumentTypeDescription
ffactor or characterInput factor (character is silently coerced)
nintegerFor fct_lump_n(): positive keeps top n, negative keeps bottom n
propnumericFor fct_lump_prop(): proportion threshold
minintegerFor fct_lump_min(): minimum frequency to preserve
wnumeric (optional)Weights for frequency calculation; must match length of f
other_levelstringLabel for the lumped category; default: "Other"
ties.methodstringFor fct_lump_n() only; options: "min", "average", "first", "last", "random", "max"

Common gotchas

Negative n inverts behavior. A negative n preserves the least frequent levels, not the most.

Character vectors are silently coerced to factors. No warning is given.

Weights must match input length exactly. Mismatched lengths cause an error.

ties.method only applies to fct_lump_n(), not to fct_lump_prop() or fct_lump_min().

The “Other” level is always placed last in the levels vector.

See also