Handling Factors with forcats

· 5 min read · Updated March 7, 2026 · beginner
forcats factors tidyverse categorical-data

Factors are R’s way of handling categorical data—variables that take on a limited set of values. While they might seem like just strings, factors have ordered levels that make them powerful for statistical modeling, data visualization, and any analysis involving grouped data. The tidyverse provides the forcats package to make working with factors intuitive and efficient.

In this tutorial, you’ll learn how to create factors, reorder them for better visualizations, combine rare levels, recode categories, and more.

Creating Factors

The base R way to create factors uses the factor() function:

# Create a factor from a character vector
education <- c("High School", "Bachelor", "Master", "PhD", "Bachelor", "Master")
education_factor <- factor(education)

education_factor
# [1] High School Bachelor Master   PhD       Bachelor Master  
# Levels: Bachelor High School Master PhD

The tidyverse approach with forcats gives you more control:

library(forcats)

# Create with explicit level order
education <- c("High School", "Bachelor", "Master", "PhD", "Bachelor", "Master")
education_factor <- factor(education, levels = c("High School", "Bachelor", "Master", "PhD"))

education_factor
# [1] High School Bachelor Master   PhD       Bachelor Master  
# Levels: High School Bachelor Master PhD

Reordering Factor Levels with fct_reorder

When creating visualizations, you often want factor levels ordered by another variable rather than alphabetically:

library(ggplot2)
library(dplyr)

# Sample data
survey <- data.frame(
  department = c("Sales", "Engineering", "HR", "Marketing", "Engineering", 
                 "Sales", "Marketing", "HR"),
  salary = c(55000, 95000, 52000, 61000, 98000, 53000, 64000, 51000)
)

# Reorder department by median salary
ggplot(survey, aes(x = fct_reorder(department, salary, .fun = median), y = salary)) +
  geom_boxplot() +
  labs(x = "Department", y = "Salary") +
  theme_minimal()

fct_reorder() automatically sorts levels by the summary function you specify, making your plots more readable.

Combining Rare Levels with fct_lump

When you have many categories, some with few observations, fct_lump() combines rare levels into “Other”:

# Sample data with many categories
countries <- c("USA", "UK", "Germany", "France", "Japan", "China", "India", 
               "Brazil", "Canada", "USA", "UK", "Germany", "France", "Japan")

# Keep top 3 most common, lump rest into "Other"
fct_lump(countries, n = 3)
# [1] USA        UK        Germany   France    Japan     China    India   
# [8] Brazil     Canada    USA       UK        Germany  France   Japan   
# Levels: USA UK Germany France Japan Other

You can also lump by proportion:

# Keep categories that appear more than 10% of the time
fct_lump(countries, prop = 0.1)

Recoding Levels with fct_recode

Sometimes you need to rename categories for clarity or consistency:

# Original data
response <- c("Yes", "No", "Maybe", "Yes", "No", "Yes")

# Recode levels
fct_recode(response,
  "Agree" = "Yes",
  "Disagree" = "No",
  "Uncertain" = "Maybe"
)
# [1] Agree      Disagree   Uncertain  Agree      Disagree   Agree    
# Levels: Agree Disagree Uncertain

Collapsing Levels with fct_collapse

For grouping related categories:

# Original data
region <- c("NY", "CA", "TX", "FL", "WA", "NY", "CA", "TX", "FL")

# Collapse into broader categories
fct_collapse(region,
  West = c("CA", "WA"),
  South = c("TX", "FL"),
  Northeast = c("NY")
)
# [1] Northeast West    South    South    West     Northeast West    South   
# [9] South   
# Levels: West South Northeast

Counting Levels with fct_count

Quickly see your level distribution:

# Sample data
colors <- c("Blue", "Red", "Green", "Blue", "Blue", "Red", "Yellow", "Blue")

fct_count(colors, sort = TRUE)
# # A tibble: 4 × 2
#   f         n
#   <fct> <int>
# 1 Blue      4
# 2 Red       2
# 3 Green     1
# 4 Yellow    1

Practical Example: Survey Analysis

Putting it all together in a real analysis:

library(dplyr)
library(ggplot2)
library(forcats)

# Simulated survey data
survey_data <- tibble(
  education = sample(c("High School", "Bachelor", "Master", "PhD", "Some College"), 
                    200, replace = TRUE),
  income = rnorm(200, mean = 50000, sd = 15000)
)

# Clean up: lump rare categories and reorder
survey_clean <- survey_data %>%
  mutate(
    education = fct_lump(education, n = 3),
    education = fct_reorder(education, income, .fun = median)
  )

# Analyze by education level
survey_clean %>%
  group_by(education) %>%
  summarise(
    n = n(),
    median_income = median(income)
  )

Summary

The forcats package provides essential tools for working with categorical data:

FunctionPurpose
factor()Create factors with custom levels
fct_reorder()Reorder levels by another variable
fct_lump()Combine rare levels into “Other”
fct_recode()Rename individual levels
fct_collapse()Group levels into categories
fct_count()Count observations per level

Mastering these functions will make your R data wrangling much more efficient, especially when preparing data for visualization or statistical modeling.

Why Factors Matter

Understanding when to use factors versus character vectors is crucial for effective R programming. Factors are essential when:

  1. Statistical modeling — Many modeling functions in R automatically treat factor variables as categorical predictors and create dummy variables appropriately.

  2. Ordinal data — When the order of categories matters (like “Low”, “Medium”, “High”), factors preserve this ordering.

  3. Memory efficiency — For columns with many repeated values, factors store integers internally rather than storing each string repeatedly.

  4. Consistent groupings — Factors ensure that your categories remain consistent across subsets of data, preventing typos or case mismatches from creating new unintended levels.

# Example: Factors in modeling
library(dplyr)

# Create sample data
model_data <- tibble(
  treatment = factor(rep(c("Control", "Drug A", "Drug B"), each = 30)),
  outcome = c(rnorm(30, mean = 10), rnorm(30, mean = 15), rnorm(30, mean = 12))
)

# Linear model automatically treats factor as categorical
lm(outcome ~ treatment, data = model_data)

The treatment variable is automatically converted to dummy variables, with one level (Control) as the reference category.

Working with Missing Levels

Sometimes you want to keep levels that don’t appear in your data, or handle missing values explicitly:

# Keep all levels even if not present in data
colors <- factor(c("Red", "Blue", "Red"), levels = c("Red", "Blue", "Green", "Yellow"))
levels(colors)
# [1] "Red"    "Blue"   "Green"  "Yellow"

# Drop unused levels
colors <- fct_drop(colors)
levels(colors)
# [1] "Red" "Blue"

These functions give you complete control over your categorical data, ensuring your analysis is both correct and efficient.