Text Manipulation with stringr
Introduction
When you need to clean, transform, or search through text in R, the stringr package is the tool to reach for. It ships as part of the tidyverse, so when you run library(tidyverse) you already have it loaded. stringr wraps the powerful stringi package with a simpler, more consistent API that fits naturally into tidyverse workflows.
The main thing to know about stringr is its naming convention: every function starts with str_. The first argument is always the input string, and most functions accept regular expression patterns as well as plain text. This consistency means once you learn one function, the others feel familiar.
Basic String Operations
Concatenating with str_c()
str_c() combines character vectors into single strings. It behaves like paste0() but follows tidyverse conventions and handles NA values cleanly.
str_c("Hello", "World", sep = " ")
# [1] "Hello World"
str_c("file", 1:3, sep = "_")
# [1] "file_1" "file_2" "file_3"
str_c(c("a", "b"), collapse = "-")
# [1] "a-b"
The sep argument controls what goes between elements when combining vectors of the same length. The collapse argument joins all elements into a single string using the separator you provide. This distinction matters — sep is element-wise, collapse is for final aggregation.
Measuring Length with str_length()
str_length() returns the number of characters in each string. Unlike base R’s nchar(), it handles NA values correctly without converting them to 2.
str_length(c("hello", "world", NA))
# [1] 5 5 NA
This behavior matters when you are working with real-world data where missing values are common. If you use nchar() on a vector containing NA, you might get unexpected results instead of NA back.
Extracting Substrings with str_sub()
str_sub() extracts or replaces portions of a string using start and end positions. Negative indices count backwards from the end of the string.
str_sub("stringr", 1, 3)
# [1] "str"
str_sub("stringr", -3, -1)
# [1] "ngr"
# Replacement works by assigning to the result
str_sub("stringr", 1, 1) <- "S"
# [1] "Stringr"
The replacement syntax is handy when you need to modify part of a string in place, such as correcting a typo or standardizing a format.
Changing Case
Case conversion functions make it easy to standardize text for comparison or display.
str_to_upper("Hello World")
# [1] "HELLO WORLD"
str_to_lower("Hello World")
# [1] "hello world"
str_to_title("hello world from r")
# [1] "Hello World From R"
The locale argument lets you handle language-specific rules. The Turkish locale is the classic example — it has different case mappings for the letter “i”, which matters if you are working with Turkish text.
Trimming and Padding
Removing Whitespace with str_trim()
Raw data imports often contain leading or trailing whitespace that can cause matching problems. str_trim() removes it.
str_trim(" hello ")
# [1] "hello"
str_trim(" hello ", side = "left")
# [1] "hello "
Adding Padding with str_pad()
str_pad() adds characters to reach a minimum width, which is useful for formatting tables or fixed-width output.
str_pad("hello", width = 10, side = "left", pad = " ")
# [1] " hello"
str_pad("hello", width = 10, side = "right", pad = "-")
# [1] "hello-----"
str_pad(1:3, width = 3, side = "left", pad = "0")
# [1] "001" "002" "003"
These two functions are opposites in a sense — str_trim() removes unwanted characters, while str_pad() adds them. Both are essential when preparing data for export or matching against fixed-width formats.
Detecting and Matching Patterns
Pattern matching is where stringr really shows its value. Functions in this group accept regular expressions, making them far more powerful than simple exact-match operations.
Detecting Patterns with str_detect()
str_detect() returns TRUE or FALSE for each string depending on whether the pattern matches. This integrates perfectly with dplyr::filter().
emails <- c("alice@example.com", "bob@test.org", "invalid-email", NA)
str_detect(emails, "@")
# [1] TRUE TRUE FALSE NA
Counting Occurrences with str_count()
str_count() tells you how many times a pattern appears in each string.
str_count(c("banana", "apple", "cherry"), "a")
# [1] 3 1 0
Extracting Matches with str_extract()
str_extract() pulls out the first match from each string. Use str_extract_all() when you need every match, which returns a list.
str_extract(c("abc123def", "hello456"), "[0-9]+")
# [1] "123" "456"
str_extract_all(c("abc123def", "a1b2c3"), "[a-z]")
# [[1]] "a" "b" "c" "d" "e" "f"
# [[2]] "a" "b" "c"
Replacing with str_replace() and str_replace_all()
str_replace() swaps the first match, while str_replace_all() replaces every match.
str_replace("aaa bbb", "a", "X")
# [1] "Xaa bbb"
str_replace_all("aaa bbb", "a", "X")
# [1] "XXX bbb"
Splitting with str_split()
str_split() breaks strings at each match and returns a list of character vectors.
str_split("a-b-c-d", "-")
# [[1]] "a" "b" "c" "d"
str_split("one two three", " ", n = 2)
# [[1]] "one" "two three"
The n argument limits how many splits occur, which is useful when you only need the first few parts.
String Interpolation with str_glue()
str_glue() inserts R expressions directly into strings using {...} placeholders, similar to f-strings in Python.
name <- "Alice"
score <- 95
str_glue("Hello {name}, your score is {score}.")
# Hello Alice, your score is 95.
Inside a mutate() pipeline, str_glue_data() reads column values from a data frame.
df <- data.frame(first = c("John", "Jane"), last = c("Doe", "Smith"))
dplyr::mutate(df, full_name = str_glue_data(., "{first} {last}"))
# first last full_name
# 1 John Doe John Doe
# 2 Jane Smith Jane Smith
The . in str_glue_data() refers to the data frame row being processed, making it straightforward to combine multiple columns into one.
Pattern Modifiers
By default, stringr treats patterns as regular expressions. Three modifier functions change this behavior.
fixed() matches literally, ignoring any regex metacharacters. This is faster for simple strings.
str_extract(c("a.b", "a*b"), fixed("."))
# [1] "." "."
boundary() matches on word, line, or character boundaries, which is useful for precise tokenization.
str_extract_all("hello world! how are you?", boundary("word"))
# [[1]] "hello" "world" "how" "are" "you"
regex() gives you the standard regex behavior with options like ignore_case = TRUE.
str_detect("Hello", regex("hello", ignore_case = TRUE))
# [1] TRUE
stringr uses ICU regular expressions under the hood, which supports Unicode properties like \p{L} for letters and \p{N} for numbers.
A Complete Example
Here is a realistic pipeline that brings several stringr functions together using the pipe operator and mutate().
library(tidyverse)
df <- tibble(
name = c(" alice smith ", "BOB JONES", "Carol Davis"),
email = c("alice@example.com", "bob@test.org", NA)
)
df %>%
mutate(
name_clean = str_trim(name),
name_title = str_to_title(str_to_lower(name_clean)),
has_email = str_detect(email, "@"),
domain = str_extract(email, "(?<=@).+"),
initials = str_c(str_sub(name_title, 1, 1), str_sub(name_title, -1, -1), sep = "."),
formatted = str_glue_data(., "{name_title} <{email}>")
)
# A tibble: 3 × 7
# name email name_clean name_title has_email domain formatted
# <chr> <chr> <chr> <chr> <lgl> <chr> <glue>
# 1 " alice smith " alice@example.com alice smith Alice Smith TRUE example.c… Alice Smith <alice@example.com>
# 2 "BOB JONES" bob@test.org BOB JONES Bob Jones TRUE test.org Bob Jones <bob@test.org>
# 3 "Carol Davis" NA Carol Davis Carol Davis FALSE NA Carol Davis <NA>
This example shows a common pattern: cleaning a name field, normalizing case, checking for the presence of an email address, extracting the domain, building initials, and finally formatting a display name with the email. All of this happens inside one mutate() call, which is readable because each step builds on the previous one.
Handling Missing Values
One of stringr’s practical advantages over base R string functions is consistent NA handling. All stringr functions propagate NA values without error, which keeps your pipelines running smoothly when data is incomplete.
str_c("hello", NA)
# [1] NA
str_length(c("hello", NA))
# [1] 5 NA
str_detect(c("apple", NA), "a")
# [1] TRUE NA
In base R, functions like nchar() can produce unexpected results with NA values, forcing you to handle missing data explicitly. stringr sidesteps this by always returning NA when the input is NA.
Conclusion
The stringr package gives you a consistent, readable set of tools for working with text in R. Its functions follow the str_ naming convention, handle NA values uniformly, and integrate smoothly with mutate() pipelines. You learned how to concatenate strings with str_c(), measure length with str_length(), extract substrings with str_sub(), convert case, trim and pad strings, detect and replace patterns with regular expressions, and interpolate values with str_glue(). These operations cover the vast majority of everyday text manipulation tasks in data analysis.
See Also
- tidyverse-intro — Installing and loading the tidyverse, of which stringr is a core member
- r-strings-and-factors — Base R string functions and when to prefer them over stringr
- functions-and-control-flow — Writing custom functions and control structures that use stringr inside pipelines