Strings and Factors in R
Strings and factors are fundamental data types in R that you’ll use constantly in data analysis. Strings store text data, while factors handle categorical variables — essential for statistical modeling and data visualization. This tutorial covers everything you need to work with these types effectively.
Creating Strings in R
Strings in R are created using quotation marks. You can use either double quotes (") or single quotes (`’):
# Double quotes
name <- "Alice"
print(name)
# Single quotes
city <- 'London'
print(city)
# String with quotes inside
quote <- "She said \"hello\""
print(quote)
Both approaches work identically in most cases. However, double quotes are more common in R code, especially when working with the tidyverse ecosystem.
Checking String Length
The nchar() function returns the number of characters in a string:
text <- "Hello, World!"
nchar(text)
# [1] 13
# Works with vectors too
words <- c("apple", "banana", "cherry")
nchar(words)
# [1] 5 6 6
String Manipulation with Base R
Base R provides several functions for working with strings. While they’re not as powerful as the stringr package, knowing them helps you understand R’s evolution.
Concatenating Strings
Use paste() to combine strings:
first_name <- "John"
last_name <- "Doe"
# Default: separated by space
paste(first_name, last_name)
# [1] "John Doe"
# Custom separator
paste(first_name, last_name, sep = "_")
# [1] "John_Doe"
# paste0: no separator
paste0(first_name, last_name)
# [1] "JohnDoe"
Extracting Substrings
The substr() function extracts portions of a string:
text <- "R programming"
substr(text, 1, 9)
# [1] "R program"
# Replace in place
substr(text, 1, 1) <- "r"
text
# [1] "r programming"
Case Conversion
Convert between uppercase and lowercase:
text <- "Hello World"
toupper(text)
# [1] "HELLO WORLD"
tolower(text)
# [1] "hello world"
Finding and Replacing Patterns
The grep() family searches for patterns:
colors <- c("red", "blue", "green", "redapple")
# Find matches (returns indices)
grep("red", colors)
# [1] 1 4
# Returns TRUE/FALSE
grepl("red", colors)
# [1] TRUE FALSE FALSE TRUE
# Replace patterns
gsub("red", "RED", colors)
# [1] "RED" "blue" "green" "REDapple"
Introduction to Factors
Factors are R’s way of storing categorical data efficiently. They’re stored as integers internally, with corresponding level labels. This makes them memory-efficient and essential for statistical modeling.
Creating a Factor
Convert a character vector to a factor using factor():
# Simple factor
gender <- factor(c("male", "female", "male", "female"))
print(gender)
# [1] male female male female
# Levels: female male
# Check it's a factor
class(gender)
# [1] "factor"
Understanding Factor Levels
Levels are the unique values in your categorical data:
colors <- factor(c("red", "blue", "green", "blue", "red"))
# Get the levels
levels(colors)
# [1] "blue" "green" "red"
# Number of levels
nlevels(colors)
# [1] 3
Specifying Level Order
By default, levels are ordered alphabetically. Use levels argument to control order:
# Default: alphabetical
education <- factor(c("high school", "bachelor", "master", "PhD"))
levels(education)
# [1] "PhD" "bachelor" "high school" "master"
# Custom order (important for modeling and visualization)
education <- factor(c("high school", "bachelor", "master", "PhD"),
levels = c("high school", "bachelor", "master", "PhD"))
levels(education)
# [1] "high school" "bachelor" "master" "PhD"
Converting Between Strings and Factors
Understanding how to convert between data types is essential:
String to Factor
# Create character vector
colors <- c("red", "blue", "green", "blue", "red")
# Convert to factor
color_factor <- factor(colors)
color_factor
# [1] red blue green blue red
# Levels: blue green red
Factor to String
# Convert factor back to character
as.character(color_factor)
# [1] "red" "blue" "green" "blue" "red"
# Important: never use as.numeric() directly on factors!
# This returns internal codes, not the actual values:
as.numeric(color_factor)
# [1] 3 1 2 1 3 (these are level positions, not values!)
# Correct approach:
as.numeric(as.character(color_factor))
# [1] NA NA NA NA NA (because "red" can't become numeric)
# What you usually want:
as.character(color_factor)
# [1] "red" "blue" "green" "blue" "red"
Working with Factors in Data Frames
When reading data with read.csv() or read.table(), R automatically converts string columns to factors. You can control this behavior:
# Prevent automatic factor conversion
df <- read.csv("data.csv", stringsAsFactors = FALSE)
# Or convert specific columns later
df$category <- factor(df$category)
Modifying Factor Levels
Replace or consolidate levels after creation:
# Create sample data
survey <- factor(c("yes", "no", "yes", "maybe", "yes"))
# Rename levels
levels(survey) <- c("No", "Yes", "Unsure")
survey
# [1] Yes No Yes Unsure Yes
# Levels: No Yes Unsure
# Combine levels
survey2 <- factor(c("low", "medium", "high", "medium", "low"))
levels(survey2) <- list(Low = "low", High = c("medium", "high"))
survey2
# [1] Low High High High Low
# Levels: Low High
Best Practices for Strings and Factors
Follow these guidelines for cleaner, more maintainable code:
- Use
stringsAsFactors = FALSEwhen reading data (modern R practice) - Create factors with explicit levels when order matters
- Use
forcatspackage for advanced factor manipulation (part of tidyverse) - Check data types with
class(),is.factor(), andis.character() - Convert factors to characters before any string operations
# Good practice example
library(dplyr)
# Create a tibble with proper factor handling
df <- tibble(
name = c("Alice", "Bob", "Charlie"),
department = factor(c("Sales", "Engineering", "Sales"),
levels = c("Sales", "Engineering", "Marketing"))
)
# Reorder for visualization
df %>%
mutate(department = forcats::fct_relevel(department, "Engineering", after = 0))
Summary
You’ve learned how to create and manipulate strings in R, including concatenation, substring extraction, and pattern matching. You’ve also mastered factors — R’s powerful way of handling categorical data. Key takeaways:
- Strings are text data created with quotes (
"or') - Use
paste()for concatenation,nchar()for length grep(),grepl(), andgsub()handle pattern matching- Factors store categorical data efficiently as integers with level labels
- Always specify factor levels explicitly when order matters
- Convert factors to characters before string operations
These skills form the foundation for text processing and categorical data analysis in R. In the next tutorial, you’ll learn about error handling to make your R code more robust.