Strings and Factors in R

· 12 min read · beginner
strings factors characters r-fundamentals categorical-data
Part 7 of the r-fundamentals series

Strings and factors are fundamental data types in R that you’ll use constantly in data analysis. Strings store text data, while factors handle categorical variables — essential for statistical modeling and data visualization. This tutorial covers everything you need to work with these types effectively.

Creating Strings in R

Strings in R are created using quotation marks. You can use either double quotes (") or single quotes (`’):

# Double quotes
name <- "Alice"
print(name)

# Single quotes
city <- 'London'
print(city)

# String with quotes inside
quote <- "She said \"hello\""
print(quote)

Both approaches work identically in most cases. However, double quotes are more common in R code, especially when working with the tidyverse ecosystem.

Checking String Length

The nchar() function returns the number of characters in a string:

text <- "Hello, World!"
nchar(text)
# [1] 13

# Works with vectors too
words <- c("apple", "banana", "cherry")
nchar(words)
# [1] 5 6 6

String Manipulation with Base R

Base R provides several functions for working with strings. While they’re not as powerful as the stringr package, knowing them helps you understand R’s evolution.

Concatenating Strings

Use paste() to combine strings:

first_name <- "John"
last_name <- "Doe"

# Default: separated by space
paste(first_name, last_name)
# [1] "John Doe"

# Custom separator
paste(first_name, last_name, sep = "_")
# [1] "John_Doe"

# paste0: no separator
paste0(first_name, last_name)
# [1] "JohnDoe"

Extracting Substrings

The substr() function extracts portions of a string:

text <- "R programming"
substr(text, 1, 9)
# [1] "R program"

# Replace in place
substr(text, 1, 1) <- "r"
text
# [1] "r programming"

Case Conversion

Convert between uppercase and lowercase:

text <- "Hello World"

toupper(text)
# [1] "HELLO WORLD"

tolower(text)
# [1] "hello world"

Finding and Replacing Patterns

The grep() family searches for patterns:

colors <- c("red", "blue", "green", "redapple")

# Find matches (returns indices)
grep("red", colors)
# [1] 1 4

# Returns TRUE/FALSE
grepl("red", colors)
# [1]  TRUE FALSE FALSE  TRUE

# Replace patterns
gsub("red", "RED", colors)
# [1] "RED"      "blue"     "green"    "REDapple"

Introduction to Factors

Factors are R’s way of storing categorical data efficiently. They’re stored as integers internally, with corresponding level labels. This makes them memory-efficient and essential for statistical modeling.

Creating a Factor

Convert a character vector to a factor using factor():

# Simple factor
gender <- factor(c("male", "female", "male", "female"))
print(gender)
# [1] male   female male   female
# Levels: female male

# Check it's a factor
class(gender)
# [1] "factor"

Understanding Factor Levels

Levels are the unique values in your categorical data:

colors <- factor(c("red", "blue", "green", "blue", "red"))

# Get the levels
levels(colors)
# [1] "blue" "green" "red"

# Number of levels
nlevels(colors)
# [1] 3

Specifying Level Order

By default, levels are ordered alphabetically. Use levels argument to control order:

# Default: alphabetical
education <- factor(c("high school", "bachelor", "master", "PhD"))
levels(education)
# [1] "PhD"           "bachelor"      "high school"   "master"

# Custom order (important for modeling and visualization)
education <- factor(c("high school", "bachelor", "master", "PhD"), 
                    levels = c("high school", "bachelor", "master", "PhD"))
levels(education)
# [1] "high school" "bachelor"    "master"      "PhD"

Converting Between Strings and Factors

Understanding how to convert between data types is essential:

String to Factor

# Create character vector
colors <- c("red", "blue", "green", "blue", "red")

# Convert to factor
color_factor <- factor(colors)
color_factor
# [1] red   blue  green blue  red  
# Levels: blue green red

Factor to String

# Convert factor back to character
as.character(color_factor)
# [1] "red"  "blue" "green" "blue" "red"

# Important: never use as.numeric() directly on factors!
# This returns internal codes, not the actual values:
as.numeric(color_factor)
# [1] 3 1 2 1 3 (these are level positions, not values!)

# Correct approach:
as.numeric(as.character(color_factor))
# [1] NA NA NA NA NA (because "red" can't become numeric)

# What you usually want:
as.character(color_factor)
# [1] "red"  "blue" "green" "blue" "red"

Working with Factors in Data Frames

When reading data with read.csv() or read.table(), R automatically converts string columns to factors. You can control this behavior:

# Prevent automatic factor conversion
df <- read.csv("data.csv", stringsAsFactors = FALSE)

# Or convert specific columns later
df$category <- factor(df$category)

Modifying Factor Levels

Replace or consolidate levels after creation:

# Create sample data
survey <- factor(c("yes", "no", "yes", "maybe", "yes"))

# Rename levels
levels(survey) <- c("No", "Yes", "Unsure")
survey
# [1] Yes  No  Yes Unsure Yes 
# Levels: No Yes Unsure

# Combine levels
survey2 <- factor(c("low", "medium", "high", "medium", "low"))
levels(survey2) <- list(Low = "low", High = c("medium", "high"))
survey2
# [1] Low  High High High Low 
# Levels: Low High

Best Practices for Strings and Factors

Follow these guidelines for cleaner, more maintainable code:

  1. Use stringsAsFactors = FALSE when reading data (modern R practice)
  2. Create factors with explicit levels when order matters
  3. Use forcats package for advanced factor manipulation (part of tidyverse)
  4. Check data types with class(), is.factor(), and is.character()
  5. Convert factors to characters before any string operations
# Good practice example
library(dplyr)

# Create a tibble with proper factor handling
df <- tibble(
  name = c("Alice", "Bob", "Charlie"),
  department = factor(c("Sales", "Engineering", "Sales"), 
                      levels = c("Sales", "Engineering", "Marketing"))
)

# Reorder for visualization
df %>% 
  mutate(department = forcats::fct_relevel(department, "Engineering", after = 0))

Summary

You’ve learned how to create and manipulate strings in R, including concatenation, substring extraction, and pattern matching. You’ve also mastered factors — R’s powerful way of handling categorical data. Key takeaways:

  • Strings are text data created with quotes (" or ')
  • Use paste() for concatenation, nchar() for length
  • grep(), grepl(), and gsub() handle pattern matching
  • Factors store categorical data efficiently as integers with level labels
  • Always specify factor levels explicitly when order matters
  • Convert factors to characters before string operations

These skills form the foundation for text processing and categorical data analysis in R. In the next tutorial, you’ll learn about error handling to make your R code more robust.