Regular Expressions in R

· 4 min read · Updated March 10, 2026 · intermediate
regex text stringr pattern-matching grep

Regular expressions let you find, extract, and transform text using patterns instead of exact matches. This guide covers both base R functions and the tidyverse stringr package for practical text processing.

Why Regular Expressions Matter

Text data rarely comes clean. Names, emails, dates, and codes all contain inconsistencies that make simple string matching useless. Regular expressions solve this by letting you describe what a pattern looks like rather than what the exact text should be.

R provides two main approaches: base R functions like grep() and gregexpr(), and the stringr package from tidyverse. Both work well. stringr is more readable, while base R requires fewer dependencies.

The Basics: Matching Characters

The simplest regex pattern matches literal characters. The pattern "hello" matches the string “hello” exactly.

# Base R: grep returns indices of matches
text <- c("apple", "banana", "cherry", "apple pie")
grep("apple", text)
# [1] 1 4

# grepl returns TRUE/FALSE
grepl("apple", text)
# [1]  TRUE FALSE  TRUE  TRUE

Special Characters

Some characters have special meaning in regex. The dot . matches any single character:

# . matches any character
text <- c("cat", "bat", "hat", "matter")
grep("^.t$", text, value = TRUE)
# [1] "cat" "bat" "hat"

The anchors ^ and $ match the start and end of a string. Using both restricts the match to the entire string.

Character Classes

Square brackets [] define a set of characters to match:

# [aeiou] matches any vowel
text <- c("cat", "dog", "bird", "fish")
grep("[aeiou]", text, value = TRUE)
# [1] "cat"  "dog"  "bird"

# [0-9] matches any digit
# [a-z] matches any lowercase letter

Negation with [^...] matches anything except the specified characters:

# Match words without 'a'
text <- c("cat", "bat", "dog")
grep("[^a]", text, value = TRUE)
# [1] "cat" "bat" "dog"

Quantifiers: How Many Times to Match

After specifying what to match, you need to specify how many times.

QuantifierMeaning
*0 or more times
+1 or more times
?0 or 1 time
{n}Exactly n times
{n,}n or more times
{n,m}Between n and m times
# Match phone numbers like 555-1234
text <- c("555-1234", "123-4567", "5555-1234", "1234")
pattern <- "[0-9]{3}-[0-9]{4}"

grep(pattern, text, value = TRUE)
# [1] "555-1234" "123-4567"

Groups and Backreferences

Parentheses () create capture groups. You can reference them later with backreferences \\1, \\2, and so on:

# Find repeated words
text <- c("the cat sat on the mat", "hello hello world", "no repeats here")
pattern <- "\\b(\\w+) \\1\\b"

grep(pattern, text, value = TRUE)
# [1] "hello hello world"

This pattern captures a word with (\\w+), then matches a space, then references the captured word with \\1.

Using stringr

The stringr package provides cleaner functions with a consistent interface:

library(stringr)

text <- c("apple", "banana", "cherry", "APPLE")

# Detect a pattern
str_detect(text, "apple")
# [1]  TRUE FALSE FALSE FALSE

# Find all matches
str_extract_all(text, "a")
# [[1]] "a" "a"
# [[2]] "a" "a" "a"
# [[3]] "a" "a"
# [[4]] "a"

# Replace matches
str_replace(text, "a", "X")
# [1] "Xpple"   "bXnana"  "cherry"  "APPLE"

# Split by pattern
str_split("one,two,three", ",")
# [[1]] "one" "two" "three"

stringr Pattern Flags

stringr functions accept a pattern argument and optional flags:

# Case-insensitive matching
str_detect("APPLE", regex("apple", ignore_case = TRUE))
# [1] TRUE

# Multiline mode (^ and $ match line boundaries)
str_extract_all("line1\nline2", regex("^line", multiline = TRUE))
# [[1]] "line" "line"

Practical Examples

Extracting Email Addresses

emails <- c("john@example.com", "jane.doe@company.org", "invalid-email", 
            "test@sub.domain.co.uk")

pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
str_extract(emails, pattern)
# [1] "john@example.com"    "jane.doe@company.org" NA                   
# [4] "test@sub.domain.co.uk"

Cleaning Data

# Remove extra whitespace
dirty <- c("  hello  ", "world  ", "  both  ")
str_trim(dirty)
# [1] "hello" "world" "both"

# Collapse multiple spaces
messy <- "too   many    spaces"
str_squish(messy)
# [1] "too many spaces"

Validating Formats

# Validate dates (YYYY-MM-DD)
dates <- c("2024-01-15", "2024-13-01", "2024-01-01", "01-15-2024")
pattern <- "^\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$"

str_view_all(dates, pattern)
# Valid dates will highlight the match

When to Use Base R vs stringr

Base R functions work without dependencies. Use them in packages or scripts where minimizing dependencies matters:

# Base R approach - no dependencies
result <- gsub("old", "new", text)

stringr is better for exploratory analysis and data pipelines where readability matters:

# stringr - more readable
result <- str_replace(text, "old", "new")

For complex patterns, both approaches eventually reach similar complexity. Choose based on your project’s dependencies.

Common Pitfalls

Forgetting to Escape

In R strings, backslashes must be escaped. To match a literal backslash in text, use \\\\ in your pattern:

# Match a literal backslash
str_detect("path\\to\\file", "\\\\")
# [1] TRUE

Greedy Matching

Quantifiers are greedy by default. They match as much as possible:

# This matches the entire string because .* takes everything
str_extract("tag1 content tag2", "<.+>")
# [1] "<tag1 content tag2>"

# Use ? for lazy matching
str_extract("tag1 content tag2", "<.+?>")
# [1] "<tag1>"

See Also

  • stringr package documentation for all string functions
  • The rebus package for building complex patterns piece by piece
  • stringi for ICU-based international string operations