rguides

fct_lump

Introduction

The fct_lump_*() family collapses rare factor levels into an “Other” category. Use this when working with categorical variables that have many infrequent levels — such as survey responses, geographic regions, or data with long tails.

Four specialized functions handle different lumping strategies:

  • fct_lump_n() — keep the n most (or least) frequent levels
  • fct_lump_prop() — lump by proportion threshold
  • fct_lump_min() — keep levels appearing at least min times
  • fct_lump_lowfreq() — automatic threshold selection

All functions require the forcats package, part of the tidyverse ecosystem.

library(forcats)

# Input: factor with 9 levels, varying frequencies
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
table(x)
# x
#  A  B  C  D  E  F  G  H  I
# 40 10  5 27  1  1  1  1  1

# Output: rare levels collapsed into "Other"
result <- fct_lump_n(x, n = 3)
table(result)
# result
#    A    B    D Other
#   40   10   27   10

fct_lump_n()

Keeps the n most frequent factor levels. Use a positive n to preserve the top n levels, or a negative n to preserve the bottom n (least frequent) levels.

# Keep the 3 most frequent levels
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
fct_lump_n(x, n = 3)
# Levels: A B D Other

The ties.method argument controls how ties are handled when there are more levels than n:

  • "min" (default) — include all tied levels, giving at least n
  • "average" — use average rank
  • "first" — keep first occurrence
  • "last" — keep last occurrence
  • "random" — randomly select
  • "max" — include all tied levels at max rank

To preserve the least frequent levels instead:

# Preserve the least frequent levels (all tied at count 1)
fct_lump_n(x, n = -1)
# Levels: E F G H I Other

With n = -1 and ties.method = "min", all levels tied for least frequency are preserved. Since E, F, G, H, and I all have count 1, all five are kept.

fct_lump_prop()

Lumps levels that appear in fewer than (or equal to) prop * n observations. The prop argument specifies the threshold as a fraction (0.10 = 10%).

# Lump levels appearing in fewer than 10% of observations
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

# Total observations: 87, so 10% = 8.7
fct_lump_prop(x, prop = 0.10)
# Levels: A B D Other
# A: 40 (46%), B: 10 (11%), D: 27 (31%), Other: 10 (11%)

Use a negative prop to lump levels appearing in at most that proportion (opposite direction):

# Lump levels appearing in at most 5% of observations
fct_lump_prop(x, prop = -0.05)
# Levels: A B C D Other
# A: 40 (46%), B: 10 (11%), C: 5 (6%), D: 27 (31%), Other: 5 (6%)

fct_lump_min()

Preserves levels that appear at least min times. Everything below the threshold goes to “Other”.

# Keep levels appearing at least 5 times
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

fct_lump_min(x, min = 5)
# Levels: A B C D Other

Useful when levels need a minimum sample size for statistical validity.

fct_lump_lowfreq()

Automatically determines which levels to lump. It calculates an appropriate threshold based on the data.

# Automatic threshold selection
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

fct_lump_lowfreq(x)
# Levels: A B D Other

Weighted Frequency

The w argument lets you weight observations differently when calculating frequencies:

# Create weights: give more importance to certain observations
x <- factor(c("A", "A", "B", "B", "B", "C"))
w <- c(1, 1, 2, 2, 2, 1)

# Without weights, C appears once
fct_lump_n(x, n = 2)
# Levels: B Other

# With weights, B appears 6 times (2×3), C appears 1 time
fct_lump_n(x, n = 2, w = w)
# Levels: B C Other

Weights must match the length of the input factor.

Arguments Reference

ArgumentTypeDescription
ffactor or characterInput factor (character is silently coerced)
nintegerFor fct_lump_n(): positive keeps top n, negative keeps bottom n
propnumericFor fct_lump_prop(): proportion threshold
minintegerFor fct_lump_min(): minimum frequency to preserve
wnumeric (optional)Weights for frequency calculation; must match length of f
other_levelstringLabel for the lumped category; default: "Other"
ties.methodstringFor fct_lump_n() only; options: "min", "average", "first", "last", "random", "max"

Common Gotchas

Negative n inverts behavior. A negative n preserves the least frequent levels, not the most.

Character vectors are silently coerced to factors. No warning is given.

Weights must match input length exactly. Mismatched lengths cause an error.

ties.method only applies to fct_lump_n(), not to fct_lump_prop() or fct_lump_min().

The “Other” level is always placed last in the levels vector.

See Also

  • factor() — base R function for creating factors
  • dplyr::count() — count observations by group
  • table() — base R function for cross-tabulation