Text Manipulation with stringr

March 29, 2026 · 7 min read · Updated March 29, 2026 · beginner

r tidyverse stringr text-processing

Introduction

When you need to clean, transform, or search through text in R, the stringr package is the tool to reach for. It ships as part of the tidyverse, so when you run library(tidyverse) you already have it loaded. stringr wraps the powerful stringi package with a simpler, more consistent API that fits naturally into tidyverse workflows.

The main thing to know about stringr is its naming convention: every function starts with str_. The first argument is always the input string, and most functions accept regular expression patterns as well as plain text. This consistency means once you learn one function, the others feel familiar.

Basic String Operations

Concatenating with `str_c()`

str_c() combines character vectors into single strings. It behaves like paste0() but follows tidyverse conventions and handles NA values cleanly.

str_c("Hello", "World", sep = " ")
# [1] "Hello World"

str_c("file", 1:3, sep = "_")
# [1] "file_1" "file_2" "file_3"

str_c(c("a", "b"), collapse = "-")
# [1] "a-b"

The sep argument controls what goes between elements when combining vectors of the same length. The collapse argument joins all elements into a single string using the separator you provide. This distinction matters — sep is element-wise, collapse is for final aggregation.

Measuring Length with `str_length()`

str_length() returns the number of characters in each string. Unlike base R’s nchar(), it handles NA values correctly without converting them to 2.

str_length(c("hello", "world", NA))
# [1] 5 5 NA

This behavior matters when you are working with real-world data where missing values are common. If you use nchar() on a vector containing NA, you might get unexpected results instead of NA back.

Extracting Substrings with `str_sub()`

str_sub() extracts or replaces portions of a string using start and end positions. Negative indices count backwards from the end of the string.

str_sub("stringr", 1, 3)
# [1] "str"

str_sub("stringr", -3, -1)
# [1] "ngr"

# Replacement works by assigning to the result
str_sub("stringr", 1, 1) <- "S"
# [1] "Stringr"

The replacement syntax is handy when you need to modify part of a string in place, such as correcting a typo or standardizing a format.

Changing Case

Case conversion functions make it easy to standardize text for comparison or display.

str_to_upper("Hello World")
# [1] "HELLO WORLD"

str_to_lower("Hello World")
# [1] "hello world"

str_to_title("hello world from r")
# [1] "Hello World From R"

The locale argument lets you handle language-specific rules. The Turkish locale is the classic example — it has different case mappings for the letter “i”, which matters if you are working with Turkish text.

Trimming and Padding

Removing Whitespace with `str_trim()`

Raw data imports often contain leading or trailing whitespace that can cause matching problems. str_trim() removes it.

str_trim("  hello  ")
# [1] "hello"

str_trim("  hello  ", side = "left")
# [1] "hello  "

Adding Padding with `str_pad()`

str_pad() adds characters to reach a minimum width, which is useful for formatting tables or fixed-width output.

str_pad("hello", width = 10, side = "left", pad = " ")
# [1] "     hello"

str_pad("hello", width = 10, side = "right", pad = "-")
# [1] "hello-----"

str_pad(1:3, width = 3, side = "left", pad = "0")
# [1] "001" "002" "003"

These two functions are opposites in a sense — str_trim() removes unwanted characters, while str_pad() adds them. Both are essential when preparing data for export or matching against fixed-width formats.

Detecting and Matching Patterns

Pattern matching is where stringr really shows its value. Functions in this group accept regular expressions, making them far more powerful than simple exact-match operations.

Detecting Patterns with `str_detect()`

str_detect() returns TRUE or FALSE for each string depending on whether the pattern matches. This integrates perfectly with dplyr::filter().

emails <- c("alice@example.com", "bob@test.org", "invalid-email", NA)
str_detect(emails, "@")
# [1]  TRUE  TRUE FALSE    NA

Counting Occurrences with `str_count()`

str_count() tells you how many times a pattern appears in each string.

str_count(c("banana", "apple", "cherry"), "a")
# [1] 3 1 0

Extracting Matches with `str_extract()`

str_extract() pulls out the first match from each string. Use str_extract_all() when you need every match, which returns a list.

str_extract(c("abc123def", "hello456"), "[0-9]+")
# [1] "123" "456"

str_extract_all(c("abc123def", "a1b2c3"), "[a-z]")
# [[1]] "a" "b" "c" "d" "e" "f"
# [[2]] "a" "b" "c"

Replacing with `str_replace()` and `str_replace_all()`

str_replace() swaps the first match, while str_replace_all() replaces every match.

str_replace("aaa bbb", "a", "X")
# [1] "Xaa bbb"

str_replace_all("aaa bbb", "a", "X")
# [1] "XXX bbb"

Splitting with `str_split()`

str_split() breaks strings at each match and returns a list of character vectors.

str_split("a-b-c-d", "-")
# [[1]] "a" "b" "c" "d"

str_split("one two three", " ", n = 2)
# [[1]] "one" "two three"

The n argument limits how many splits occur, which is useful when you only need the first few parts.

String Interpolation with `str_glue()`

str_glue() inserts R expressions directly into strings using {...} placeholders, similar to f-strings in Python.

name <- "Alice"
score <- 95
str_glue("Hello {name}, your score is {score}.")
# Hello Alice, your score is 95.

Inside a mutate() pipeline, str_glue_data() reads column values from a data frame.

df <- data.frame(first = c("John", "Jane"), last = c("Doe", "Smith"))
dplyr::mutate(df, full_name = str_glue_data(., "{first} {last}"))
#   first   last       full_name
# 1  John    Doe     John Doe
# 2  Jane  Smith    Jane Smith

The . in str_glue_data() refers to the data frame row being processed, making it straightforward to combine multiple columns into one.

Pattern Modifiers

By default, stringr treats patterns as regular expressions. Three modifier functions change this behavior.

fixed() matches literally, ignoring any regex metacharacters. This is faster for simple strings.

str_extract(c("a.b", "a*b"), fixed("."))
# [1] "." "."

boundary() matches on word, line, or character boundaries, which is useful for precise tokenization.

str_extract_all("hello world! how are you?", boundary("word"))
# [[1]] "hello" "world" "how" "are" "you"

regex() gives you the standard regex behavior with options like ignore_case = TRUE.

str_detect("Hello", regex("hello", ignore_case = TRUE))
# [1] TRUE

stringr uses ICU regular expressions under the hood, which supports Unicode properties like \p{L} for letters and \p{N} for numbers.

A Complete Example

Here is a realistic pipeline that brings several stringr functions together using the pipe operator and mutate().

library(tidyverse)

df <- tibble(
  name  = c("  alice smith  ", "BOB JONES", "Carol Davis"),
  email = c("alice@example.com", "bob@test.org", NA)
)

df %>%
  mutate(
    name_clean  = str_trim(name),
    name_title  = str_to_title(str_to_lower(name_clean)),
    has_email   = str_detect(email, "@"),
    domain      = str_extract(email, "(?<=@).+"),
    initials    = str_c(str_sub(name_title, 1, 1), str_sub(name_title, -1, -1), sep = "."),
    formatted   = str_glue_data(., "{name_title} <{email}>")
  )
# A tibble: 3 × 7
#   name              email                name_clean  name_title  has_email domain     formatted
#   <chr>             <chr>                <chr>       <chr>       <lgl>     <chr>      <glue>
# 1 "  alice smith  " alice@example.com   alice smith  Alice Smith TRUE      example.c… Alice Smith <alice@example.com>
# 2 "BOB JONES"       bob@test.org         BOB JONES   Bob Jones   TRUE      test.org   Bob Jones <bob@test.org>
# 3 "Carol Davis"     NA                   Carol Davis  Carol Davis FALSE     NA         Carol Davis <NA>

This example shows a common pattern: cleaning a name field, normalizing case, checking for the presence of an email address, extracting the domain, building initials, and finally formatting a display name with the email. All of this happens inside one mutate() call, which is readable because each step builds on the previous one.

Handling Missing Values

One of stringr’s practical advantages over base R string functions is consistent NA handling. All stringr functions propagate NA values without error, which keeps your pipelines running smoothly when data is incomplete.

str_c("hello", NA)
# [1] NA

str_length(c("hello", NA))
# [1] 5 NA

str_detect(c("apple", NA), "a")
# [1]  TRUE NA

In base R, functions like nchar() can produce unexpected results with NA values, forcing you to handle missing data explicitly. stringr sidesteps this by always returning NA when the input is NA.

Conclusion

The stringr package gives you a consistent, readable set of tools for working with text in R. Its functions follow the str_ naming convention, handle NA values uniformly, and integrate smoothly with mutate() pipelines. You learned how to concatenate strings with str_c(), measure length with str_length(), extract substrings with str_sub(), convert case, trim and pad strings, detect and replace patterns with regular expressions, and interpolate values with str_glue(). These operations cover the vast majority of everyday text manipulation tasks in data analysis.

Introduction

Basic String Operations

Concatenating with str_c()

Measuring Length with str_length()

Extracting Substrings with str_sub()

Changing Case

Trimming and Padding

Removing Whitespace with str_trim()

Adding Padding with str_pad()

Detecting and Matching Patterns

Detecting Patterns with str_detect()

Counting Occurrences with str_count()

Extracting Matches with str_extract()

Replacing with str_replace() and str_replace_all()

Splitting with str_split()

String Interpolation with str_glue()