Regular Expressions in R
Regular expressions let you find, extract, and transform text using patterns instead of exact matches. This guide covers both base R functions and the tidyverse stringr package for practical text processing.
Why Regular Expressions Matter
Text data rarely comes clean. Names, emails, dates, and codes all contain inconsistencies that make simple string matching useless. Regular expressions solve this by letting you describe what a pattern looks like rather than what the exact text should be.
R provides two main approaches: base R functions like grep() and gregexpr(), and the stringr package from tidyverse. Both work well. stringr is more readable, while base R requires fewer dependencies.
The Basics: Matching Characters
The simplest regex pattern matches literal characters. The pattern "hello" matches the string “hello” exactly.
# Base R: grep returns indices of matches
text <- c("apple", "banana", "cherry", "apple pie")
grep("apple", text)
# [1] 1 4
# grepl returns TRUE/FALSE
grepl("apple", text)
# [1] TRUE FALSE TRUE TRUE
Special Characters
Some characters have special meaning in regex. The dot . matches any single character:
# . matches any character
text <- c("cat", "bat", "hat", "matter")
grep("^.t$", text, value = TRUE)
# [1] "cat" "bat" "hat"
The anchors ^ and $ match the start and end of a string. Using both restricts the match to the entire string.
Character Classes
Square brackets [] define a set of characters to match:
# [aeiou] matches any vowel
text <- c("cat", "dog", "bird", "fish")
grep("[aeiou]", text, value = TRUE)
# [1] "cat" "dog" "bird"
# [0-9] matches any digit
# [a-z] matches any lowercase letter
Negation with [^...] matches anything except the specified characters:
# Match words without 'a'
text <- c("cat", "bat", "dog")
grep("[^a]", text, value = TRUE)
# [1] "cat" "bat" "dog"
Quantifiers: How Many Times to Match
After specifying what to match, you need to specify how many times.
| Quantifier | Meaning |
|---|---|
* | 0 or more times |
+ | 1 or more times |
? | 0 or 1 time |
{n} | Exactly n times |
{n,} | n or more times |
{n,m} | Between n and m times |
# Match phone numbers like 555-1234
text <- c("555-1234", "123-4567", "5555-1234", "1234")
pattern <- "[0-9]{3}-[0-9]{4}"
grep(pattern, text, value = TRUE)
# [1] "555-1234" "123-4567"
Groups and Backreferences
Parentheses () create capture groups. You can reference them later with backreferences \\1, \\2, and so on:
# Find repeated words
text <- c("the cat sat on the mat", "hello hello world", "no repeats here")
pattern <- "\\b(\\w+) \\1\\b"
grep(pattern, text, value = TRUE)
# [1] "hello hello world"
This pattern captures a word with (\\w+), then matches a space, then references the captured word with \\1.
Using stringr
The stringr package provides cleaner functions with a consistent interface:
library(stringr)
text <- c("apple", "banana", "cherry", "APPLE")
# Detect a pattern
str_detect(text, "apple")
# [1] TRUE FALSE FALSE FALSE
# Find all matches
str_extract_all(text, "a")
# [[1]] "a" "a"
# [[2]] "a" "a" "a"
# [[3]] "a" "a"
# [[4]] "a"
# Replace matches
str_replace(text, "a", "X")
# [1] "Xpple" "bXnana" "cherry" "APPLE"
# Split by pattern
str_split("one,two,three", ",")
# [[1]] "one" "two" "three"
stringr Pattern Flags
stringr functions accept a pattern argument and optional flags:
# Case-insensitive matching
str_detect("APPLE", regex("apple", ignore_case = TRUE))
# [1] TRUE
# Multiline mode (^ and $ match line boundaries)
str_extract_all("line1\nline2", regex("^line", multiline = TRUE))
# [[1]] "line" "line"
Practical Examples
Extracting Email Addresses
emails <- c("john@example.com", "jane.doe@company.org", "invalid-email",
"test@sub.domain.co.uk")
pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
str_extract(emails, pattern)
# [1] "john@example.com" "jane.doe@company.org" NA
# [4] "test@sub.domain.co.uk"
Cleaning Data
# Remove extra whitespace
dirty <- c(" hello ", "world ", " both ")
str_trim(dirty)
# [1] "hello" "world" "both"
# Collapse multiple spaces
messy <- "too many spaces"
str_squish(messy)
# [1] "too many spaces"
Validating Formats
# Validate dates (YYYY-MM-DD)
dates <- c("2024-01-15", "2024-13-01", "2024-01-01", "01-15-2024")
pattern <- "^\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$"
str_view_all(dates, pattern)
# Valid dates will highlight the match
When to Use Base R vs stringr
Base R functions work without dependencies. Use them in packages or scripts where minimizing dependencies matters:
# Base R approach - no dependencies
result <- gsub("old", "new", text)
stringr is better for exploratory analysis and data pipelines where readability matters:
# stringr - more readable
result <- str_replace(text, "old", "new")
For complex patterns, both approaches eventually reach similar complexity. Choose based on your project’s dependencies.
Common Pitfalls
Forgetting to Escape
In R strings, backslashes must be escaped. To match a literal backslash in text, use \\\\ in your pattern:
# Match a literal backslash
str_detect("path\\to\\file", "\\\\")
# [1] TRUE
Greedy Matching
Quantifiers are greedy by default. They match as much as possible:
# This matches the entire string because .* takes everything
str_extract("tag1 content tag2", "<.+>")
# [1] "<tag1 content tag2>"
# Use ? for lazy matching
str_extract("tag1 content tag2", "<.+?>")
# [1] "<tag1>"
See Also
stringrpackage documentation for all string functions- The
rebuspackage for building complex patterns piece by piece stringifor ICU-based international string operations