rguides

String Manipulation with stringr

String manipulation in R becomes predictable with stringr, part of the tidyverse ecosystem. If you have ever struggled with R’s base string functions — paste(), substr(), grep(), gsub() — stringr replaces them with a consistent, verb-first API where every function name starts with str_ and the string argument always comes first.

This guide covers the most useful stringr functions for everyday data work.

Installing stringr

Install stringr directly from CRAN, or load it as part of the full tidyverse meta-package. Both approaches give you the same stringr functions — loading the tidyverse simply imports all core packages at once for convenience.

# Install and load stringr alone
install.packages("stringr")
library(stringr)

# Or load the entire tidyverse, which includes stringr
library(tidyverse)

Creating strings

str_c(): combining strings

str_c() combines multiple strings element-wise. Unlike paste(), it never inserts a space separator by default — you must specify sep if you want one. Missing values propagate as NA unless you set str_c(..., na.rm = TRUE). When you need to collapse an entire vector into a single string, str_flatten() joins all elements with a separator, useful for comma-separated lists or SQL clauses.

str_c("Hello", " ", "World")
# [1] "Hello World"

str_c("x", 1:3, sep = "_")
# [1] "x_1" "x_2" "x_3"

When you need to collapse an entire vector into a single string, str_flatten() is the dedicated tool. It joins all elements with a separator you specify, which is ideal for building comma-separated lists or constructing SQL IN clauses from a vector of identifiers.

str_flatten(c("a", "b", "c"), collapse = ", ")
# [1] "a, b, c"

str_repeat(): repeating strings

String repetition is straightforward: str_repeat() duplicates a string the specified number of times. This is useful for generating padding characters, creating visual separators in console output, or constructing test data with known repeated patterns where you need a specific number of repetitions.

str_repeat("ha", 3)
# [1] "hahaha"

Subsetting strings

str_length(): counting characters

Before extracting substrings, you often need to know how long a string is. str_length() counts Unicode characters — not bytes — which matters when working with non-ASCII text, emoji, or multi-byte encodings where nchar() with the wrong encoding setting can give misleading results.

str_length(c("apple", "banana", "cherry"))
# [1] 5 6 6

str_sub(): extracting parts

str_sub() extracts a substring by position with support for negative indices that count from the end. This is the stringr equivalent of base R’s substr() but more flexible — str_sub(x, -3, -1) extracts the last three characters without needing to compute nchar(x) first.

x <- "abcdef"
str_sub(x, 1, 3)
# [1] "abc"

str_sub(x, -3, -1)
# [1] "def"

str_sub() also supports assignment on the left-hand side, letting you replace a specific character range in-place. This mirrors substr()<- in base R but with the added benefit of negative indices for end-relative positions. The assignment modifies the string at the specified positions without creating a new vector.

x <- "apple"
str_sub(x, 1, 1) <- "A"
x
# [1] "Apple"

str_extract(): pattern extraction

Position-based extraction works for known offsets, but most real-world string parsing requires matching patterns. str_extract() pulls out the first substring that matches a regular expression, returning NA when no match is found across the vector. For strings with multiple matches, str_extract_all() captures every occurrence and returns a list of character vectors, one list element per input string.

Pattern detection

str_detect(): finding patterns

str_detect() returns TRUE if a pattern exists in a string:

fruits <- c("apple", "banana", "cherry", "apricot")
str_detect(fruits, "^a")
# [1]  TRUE FALSE FALSE  TRUE

str_detect() returns a logical vector that integrates naturally with other R operations. Use sum() to count matches, mean() to compute the proportion of matches, or pass the logical vector directly to dplyr’s filter() for subsetting rows based on text content.

# Count strings starting with 'a'
sum(str_detect(fruits, "^a"))
# [1] 2

str_starts() and str_ends()

For checking string boundaries without writing regex anchors, str_starts() and str_ends() provide direct, readable alternatives. They accept the same pattern types as other stringr functions — including fixed() for literal matching — and are faster than str_detect() with ^ or $ when the pattern is long.

str_starts(fruits, "a")
# [1]  TRUE FALSE FALSE  TRUE

str_ends(fruits, "e")
# [1]  TRUE FALSE TRUE FALSE

String replacement

str_replace(): substituting patterns

String replacement comes in two variants: str_replace() changes only the first match in each string, while str_replace_all() replaces every occurrence. Choose based on whether you want surgical edits on the first hit or a global find-and-replace across the entire text.

str_replace("apple pie", "pie", "tart")
# [1] "apple tart"

str_replace_all("aaa", "a", "b")
# [1] "bbb"

When your goal is deletion rather than substitution, str_remove() and str_remove_all() are convenient shorthands that replace matches with the empty string. They produce the same result as calling str_replace() with replacement = "" but communicate intent more clearly in your code.

str_remove_all("a-b-c-d", "-")
# [1] "abcd"

Splitting strings

str_split(): breaking apart

str_split() breaks a string into pieces at each occurrence of a delimiter pattern, returning a list of character vectors. The list structure accommodates strings that split into different numbers of pieces — each input element gets its own vector in the output list.

str_split("a,b,c", ",")
# [[1]]
# [1] "a" "b" "c"

When you know every string will split into the same number of pieces — for instance, a column of dates in “YYYY-MM-DD” format — add simplify = TRUE to return a character matrix instead of a list. The matrix form drops directly into data frame column assignments.

str_split("a,b,c", ",", simplify = TRUE)
#      [,1] [,2] [,3]
# [1,] "a"  "b"  "c"

For assembling strings from templates rather than breaking them apart, str_glue() evaluates R expressions inside curly braces and substitutes the results. This is the stringr equivalent of Python f-strings and avoids the repetitive quoting and concatenation of paste0(). Any valid R expression can go inside the braces.

name <- "Alice"
age <- 30
str_glue("My name is {name} and I am {age} years old.")
# My name is Alice and I am 30 years old.

Whitespace handling

str_trim(): removing extra spaces

Whitespace inconsistencies are among the most common data quality issues. str_trim() removes leading and trailing spaces, while str_squish() also collapses multiple internal spaces into single spaces — the latter is ideal for cleaning free-text fields from forms and surveys.

str_trim("  hello  ")
# [1] "hello"

str_squish("  hello   world  ")
# [1] "hello world"

str_pad(): adding padding

Padding adds characters to reach a target width, commonly used for formatting identifiers or aligning output. The side argument controls whether padding appears on the left, right, or both sides of the string. Zero-padding numeric identifiers to a fixed width is a frequent use case for str_pad().

str_pad("apple", width = 10, side = "left", pad = " ")
# [1] "     apple"

str_pad("5", width = 2, pad = "0")
# [1] "05"

Case manipulation

str_to_upper(), str_to_lower(), str_to_title()

Case conversion functions transform text to consistent capitalization. str_to_upper() converts all letters to uppercase, str_to_lower() to lowercase, and str_to_title() capitalizes the first letter of each word while lowercasing the rest. All three accept a locale argument that controls locale-specific behavior — for example, uppercase “ß” in German produces “SS” while in Turkish, uppercase “i” produces “İ”.

str_to_upper("Hello World")
# [1] "HELLO WORLD"

str_to_lower("Hello World")
# [1] "hello world"

str_to_title("hello world")
# [1] "Hello World"

Sorting strings

str_order() and str_sort()

str_order() returns the integer indices that would sort a character vector, making it useful inside dplyr::slice() or for reordering. str_sort() returns the sorted vector directly. Both accept a locale argument for locale-aware collation, which affects how accented characters and digraphs are ordered.

x <- c("banana", "Apple", "cherry")
str_sort(x)
# [1] "Apple"  "banana" "cherry"

str_sort(x, locale = "en")
# [1] "Apple"  "banana" "cherry"

The locale argument matters for non-English characters.

Common patterns

Email extraction

Extracting email addresses from text is a common validation task. The regex matches the standard email format: a local part containing alphanumeric characters and certain special characters, followed by an @ symbol, a domain name, a literal dot, and a top-level domain of at least two letters.

emails <- c("john@email.com", "jane.doe@company.org", "invalid")
str_extract(emails, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
# [1] "john@email.com"    "jane.doe@company.org" NA

Phone number formatting

Phone numbers stored as raw digits can be formatted for display using capture groups in the pattern and back-references in the replacement. The regex groups the area code, exchange, and subscriber number separately, and the replacement string inserts parentheses, spaces, and a hyphen in the conventional positions.

phone <- "5551234567"
str_replace(phone, "(\\d{3})(\\d{3})(\\d{4})", "(\\1) \\2-\\3")
# [1] "(555) 123-4567"

Extracting numbers from text

Numeric values embedded in prose need to be pulled out before they can be used in calculations. The optional leading minus sign in the pattern handles negative numbers, while \\d+ captures one or more consecutive digits. For decimal values, extending the pattern to -?\\d+\\.?\\d* captures both integers and floating-point numbers.

text <- "The temperature is 25 degrees"
str_extract(text, "-?\\d+")
# [1] "25"

When to use stringr

stringr is ideal for most string manipulation tasks. The function names are intuitive: str_ prefix, then a verb (detect, extract, replace, split, etc.).

For very large text data, you might consider stringi, which stringr is built on. For regex-heavy operations, the pattern syntax is the same.

Pattern matching overview

stringr provides consistent wrappers around base R regex functions with predictable argument order (string always first, pattern second) and uniform return types. str_detect() returns logical; str_extract() returns character; str_match() returns a matrix with captured groups. All functions vectorize over the string argument.

Fixed vs regex patterns

By default, stringr interprets patterns as regular expressions. Wrap in fixed() for literal matching: str_detect(x, fixed("$1.00")) matches the literal dollar sign without escaping. coll() matches using locale-sensitive collation rules, important for case-insensitive matching of non-ASCII text. regex() is the default and accepts flags like ignore_case = TRUE and multiline = TRUE.

Modifying strings

str_replace() replaces the first match; str_replace_all() replaces all matches. str_remove() is shorthand for str_replace(x, pattern, ""). str_to_lower(), str_to_upper(), and str_to_title() change case. str_wrap() wraps long strings at a specified width, useful for plot labels. str_glue() provides glue-style string interpolation: str_glue("Hello, {name}!") evaluates name from the current environment.

String splitting

str_split() splits strings on a pattern and returns a list of character vectors. str_split_fixed() returns a matrix with a fixed number of columns, useful when you know the exact number of splits. For splitting on a single character, strsplit() in base R is slightly faster. To split a string into individual characters, use str_split(x, "").

Collapsing and joining

str_flatten(x, collapse = ", ") joins a vector of strings with a separator. str_c(x, collapse = " ") does the same. For building strings from template parts, str_glue("{first} {last}") evaluates expressions in curly braces from the current environment, equivalent to Python f-strings. str_glue_data(df, "{name}: {value}") evaluates against a data frame’s columns.

Unicode and encoding

stringr functions use ICU (International Components for Unicode) for character-level operations, so they work correctly across encodings. str_length() counts Unicode code points, not bytes. For multi-byte encoded strings (Chinese, Arabic, emoji), nchar(x, type = "bytes") and nchar(x, type = "chars") give different counts. stringi::stri_enc_detect() identifies the encoding of unknown-encoding strings.

Core string operations

stringr provides a consistent API for string manipulation, with all functions beginning with str_ and taking the string as the first argument (enabling pipe usage). The package wraps stringi, which provides Unicode-aware operations through the ICU library.

Case transformation: str_to_upper(), str_to_lower(), str_to_title(), str_to_sentence(). All are locale-aware, str_to_upper("ß", locale = "de") correctly produces “SS” in German. Length: str_length(x) counts Unicode characters, not bytes.

Trimming and padding: str_trim(x) removes leading and trailing whitespace; str_squish(x) also collapses internal whitespace to single spaces. str_pad(x, width = 10, side = "left") pads to a minimum width. str_trunc(x, width = 50, side = "right") truncates long strings, adding "..." by default.

Pattern operations

str_detect(x, pattern) returns a logical vector. str_which(x, pattern) returns integer indices. str_count(x, pattern) counts non-overlapping matches per string. str_locate(x, pattern) returns a matrix of match start/end positions.

Replacement: str_replace(x, pattern, replacement) replaces the first match; str_replace_all() replaces all. Back-references in the replacement: "\1" refers to the first capture group. str_remove(x, pattern) and str_remove_all() are shortcuts for replacing with "".

Extraction: str_extract(x, pattern) returns the first match (NA if none); str_extract_all() returns all matches as a list. str_match(x, pattern) returns a matrix including capture groups.

Fixed and regex patterns

By default, stringr patterns are regular expressions. fixed("literal.string") treats the pattern as a literal, bypassing regex interpretation, useful for strings containing regex metacharacters like ., (, *. coll("text", locale = "sv") uses locale-specific collation rules for case-insensitive matching.

For case-insensitive regex, wrap the pattern in regex("pattern", ignore_case = TRUE). regex("pattern", multiline = TRUE) makes ^ and $ match the start and end of each line.

String splitting and joining

str_split(x, pattern) returns a list of character vectors. str_split_fixed(x, pattern, n = 2) returns a character matrix with exactly n columns, more convenient for data frame operations. str_split_1(x, pattern) is for splitting a single string and returns a plain character vector.

str_c(x, y, sep = "") concatenates element-wise, with NA propagation. str_flatten(x, collapse = ", ") collapses a vector to one string. str_flatten_comma(x) is a shortcut using Oxford comma conventions.

str_glue("Hello {name}!") is glue-style interpolation. str_glue_data(df, "{col1} and {col2}") uses data frame columns as the variable environment.

Substring operations

str_sub(x, start, end) extracts a substring by position. Negative indices count from the end: str_sub(x, -3, -1) extracts the last three characters. str_sub(x, 1, 3) <- "new" modifies in place (left-side assignment). This is more flexible than substr() because it handles vectors and negative indices.

str_starts(x, pattern) and str_ends(x, pattern) check the beginning or end of strings using a fixed or regex pattern. These are faster than str_detect() with ^ and $ anchors when the pattern is long.

str_wrap(x, width = 80) wraps long strings to a specified line width, inserting newlines. Useful for preparing long text for plots, reports, or terminal output.

String operations in data cleaning

String data cleaning is one of the most common tasks in data preparation. Raw data from surveys, web scraping, database exports, and manual entry typically contains formatting inconsistencies, extraneous whitespace, inconsistent capitalization, and encoding issues. Stringr provides a consistent toolkit for addressing these problems.

A systematic cleaning workflow handles the most common issues in a standard order. Start with encoding — ensure all strings are in a consistent encoding (UTF-8) before doing anything else. Then handle whitespace — trim leading and trailing spaces, collapse multiple internal spaces. Then normalize capitalization — convert to lowercase for comparisons, or apply title case for display. Then handle special characters — remove or replace characters that cause problems downstream.

The order matters because some operations interact. Lowercasing before removing punctuation ensures that punctuation attached to capital letters is handled correctly. Trimming before splitting ensures that split results do not contain empty strings from leading or trailing delimiters.

Performance at scale

For data frames with millions of rows, string operations are usually fast in R because stringr (via stringi) uses optimized C code and operates on the full vector at once. However, some patterns can be slow on large datasets.

Complex regular expressions with backtracking can degrade to quadratic time on certain inputs. If you notice unexpectedly slow pattern matching, test the regex on strings of varying length to identify exponential slowdown. Switching to a fixed-string match when no regex features are needed is dramatically faster: str_detect(x, fixed(“pattern”)) is orders of magnitude faster than str_detect(x, “pattern”) for literal strings.

String operations that create new character vectors (str_replace, str_c, str_sub) trigger allocation. For hot loops, pre-allocating result vectors and modifying them is not applicable in R (copy-on-modify semantics make this complex), but collecting results in a list and calling str_c with collapse once at the end is much faster than accumulating in a loop.

The stringi package, which underlies stringr, provides lower-level access when you need performance or features not exposed through stringr. Functions like stri_sub_replace_all for multiple non-overlapping substring replacements are faster than multiple str_replace calls.

Summary

FunctionPurpose
str_c()Combine strings
str_length()Count characters
str_sub()Extract by position
str_extract()Extract by pattern
str_detect()Check if pattern exists
str_replace()Substitute patterns
str_split()Split into pieces
str_trim()Remove whitespace
str_to_upper()Change case

Master these functions, and you’ll handle the vast majority of string manipulation tasks in R.

See also