Regular Expressions in R
Regular expressions let you find, extract, and transform text using patterns instead of exact matches. This guide covers both base R functions and the tidyverse stringr package for practical text processing.
Why regular expressions matter
Text data rarely comes clean. Names, emails, dates, and codes all contain inconsistencies that make simple string matching useless. Regular expressions solve this by letting you describe what a pattern looks like rather than what the exact text should be.
R provides two main approaches: base R functions like grep() and gregexpr(), and the stringr package from tidyverse. Both work well. stringr is more readable, while base R requires fewer dependencies.
The basics: matching characters
The simplest regex pattern matches literal characters. The pattern "hello" matches the string “hello” exactly.
# Base R: grep returns indices of matches
text <- c("apple", "banana", "cherry", "apple pie")
grep("apple", text)
# [1] 1 4
# grepl returns TRUE/FALSE
grepl("apple", text)
# [1] TRUE FALSE TRUE TRUE
Special characters
Some characters have special meaning in regex. The dot . matches any single character:
# . matches any character
text <- c("cat", "bat", "hat", "matter")
grep("^.t$", text, value = TRUE)
# [1] "cat" "bat" "hat"
The anchors ^ and $ match the start and end of a string. Using both restricts the match to the entire string.
Character classes
Square brackets [] define a set of characters to match:
# [aeiou] matches any vowel
text <- c("cat", "dog", "bird", "fish")
grep("[aeiou]", text, value = TRUE)
# [1] "cat" "dog" "bird"
# [0-9] matches any digit
# [a-z] matches any lowercase letter
Negation with [^...] matches anything except the specified characters:
# Match words without 'a'
text <- c("cat", "bat", "dog")
grep("[^a]", text, value = TRUE)
# [1] "cat" "bat" "dog"
Quantifiers: how many times to match
After specifying what to match, you need to specify how many times.
| Quantifier | Meaning |
|---|---|
* | 0 or more times |
+ | 1 or more times |
? | 0 or 1 time |
{n} | Exactly n times |
{n,} | n or more times |
{n,m} | Between n and m times |
# Match phone numbers like 555-1234
text <- c("555-1234", "123-4567", "5555-1234", "1234")
pattern <- "[0-9]{3}-[0-9]{4}"
grep(pattern, text, value = TRUE)
# [1] "555-1234" "123-4567"
Groups and backreferences
Parentheses () create capture groups. You can reference them later with backreferences \\1, \\2, and so on:
# Find repeated words
text <- c("the cat sat on the mat", "hello hello world", "no repeats here")
pattern <- "\\b(\\w+) \\1\\b"
grep(pattern, text, value = TRUE)
# [1] "hello hello world"
This pattern captures a word with (\\w+), then matches a space, then references the captured word with \\1.
Using stringr
The stringr package provides cleaner functions with a consistent interface:
library(stringr)
text <- c("apple", "banana", "cherry", "APPLE")
# Detect a pattern
str_detect(text, "apple")
# [1] TRUE FALSE FALSE FALSE
# Find all matches
str_extract_all(text, "a")
# [[1]] "a" "a"
# [[2]] "a" "a" "a"
# [[3]] "a" "a"
# [[4]] "a"
# Replace matches
str_replace(text, "a", "X")
# [1] "Xpple" "bXnana" "cherry" "APPLE"
# Split by pattern
str_split("one,two,three", ",")
# [[1]] "one" "two" "three"
stringr pattern flags
stringr functions accept a pattern argument and optional flags:
# Case-insensitive matching
str_detect("APPLE", regex("apple", ignore_case = TRUE))
# [1] TRUE
# Multiline mode (^ and $ match line boundaries)
str_extract_all("line1\nline2", regex("^line", multiline = TRUE))
# [[1]] "line" "line"
Practical examples
Extracting email addresses
emails <- c("john@example.com", "jane.doe@company.org", "invalid-email",
"test@sub.domain.co.uk")
pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
str_extract(emails, pattern)
# [1] "john@example.com" "jane.doe@company.org" NA
# [4] "test@sub.domain.co.uk"
Cleaning data
# Remove extra whitespace
dirty <- c(" hello ", "world ", " both ")
str_trim(dirty)
# [1] "hello" "world" "both"
# Collapse multiple spaces
messy <- "too many spaces"
str_squish(messy)
# [1] "too many spaces"
Validating formats
# Validate dates (YYYY-MM-DD)
dates <- c("2024-01-15", "2024-13-01", "2024-01-01", "01-15-2024")
pattern <- "^\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$"
str_view_all(dates, pattern)
# Valid dates will highlight the match
Common pitfalls
Forgetting to escape
In R strings, backslashes must be escaped. To match a literal backslash in text, use \\\\ in your pattern:
# Match a literal backslash
str_detect("path\\to\\file", "\\\\")
# [1] TRUE
Greedy matching
Quantifiers are greedy by default. They match as much as possible:
# This matches the entire string because .* takes everything
str_extract("tag1 content tag2", "<.+>")
# [1] "<tag1 content tag2>"
# Use ? for lazy matching
str_extract("tag1 content tag2", "<.+?>")
# [1] "<tag1>"
Character classes and anchors
Character classes match sets of characters. [aeiou] matches any vowel; [^aeiou] matches anything except a vowel. POSIX classes like [:alpha:], [:digit:], and [:space:] work inside [] and adapt to the locale. ^ anchors to the start of the string; $ anchors to the end. \b marks a word boundary, \bword\b matches “word” but not “password”.
Quantifiers
* matches zero or more; + matches one or more; ? matches zero or one. {n} matches exactly n times; {n,m} matches n to m times. By default, quantifiers are greedy, they match as much as possible. Append ? to make them lazy: .*? matches as little as possible, which is essential for extracting content between delimiters without consuming too much.
Performance
Complex regex patterns on large strings can be slow. Literal patterns without special characters match faster than patterns with .*. Anchoring patterns to the start or end with ^ and $ prevents backtracking. For fixed-string matching (no regex), use fixed("literal") in stringr or fixed = TRUE in base R functions, it uses Boyer-Moore or similar fast string search instead of the regex engine.
Regex flavors in R
R supports two regex engines. POSIX (used when perl = FALSE in base R functions like grep(), sub(), gsub()) is slower and supports fewer features. PCRE (Perl-Compatible Regular Expressions, used when perl = TRUE) is faster and more powerful. stringr always uses PCRE through ICU, which also handles Unicode properties correctly.
For most code, use perl = TRUE in base R functions or use stringr functions. The PCRE engine is more predictable, supports lookahead/lookbehind, non-greedy matching (*?, +?), and Unicode character classes (\p{L} for any letter).
regexpr(pattern, x, perl = TRUE) returns the position and length of the first match in each string. gregexpr() returns all matches. regmatches(x, regexpr(pattern, x)) extracts the matched substrings. These are the base R alternatives to stringr::str_extract().
Core syntax
Character classes: [aeiou] matches any vowel; [^aeiou] matches any non-vowel; [a-z] matches lowercase letters; [0-9] or \d matches digits; \w matches word characters (letters, digits, underscore); \s matches whitespace.
Quantifiers: * (zero or more), + (one or more), ? (zero or one), {n} (exactly n), {n,} (at least n), {n,m} (between n and m). By default, quantifiers are greedy — they match as much as possible. Append ? for non-greedy: .*? matches as little as possible.
Anchors: ^ (start of string or line), $ (end), (word boundary), \B (non-word boundary). In multiline mode ((?m) flag), ^ and $ match at line boundaries.
Groups and alternation: (pattern) captures a group; (?:pattern) is a non-capturing group. (cat|dog) matches “cat” or “dog”. Back-references in replacement: \1 refers to the first capture group in sub() and gsub().
Lookahead and lookbehind
Lookahead (?=...) and lookbehind (?<=...) match positions without consuming characters. "price: (?=\d+)" matches “price: ” only when followed by digits. The digits are not included in the match. This is useful for splitting on a position rather than a character.
Negative lookahead (?!...) and negative lookbehind (?<!...) match positions where the pattern does not follow or precede. "\d+(?!px)" matches numbers not followed by “px”.
In R with perl = TRUE: gsub("(?<=\d),(?=\d)", "", "1,234,567", perl = TRUE) removes commas between digits (for parsing formatted numbers).
Common patterns
Email address (simplified): "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$". This is a reasonable check, not RFC 5321-compliant.
ISO date: "^\d{4}-\d{2}-\d{2}$" validates YYYY-MM-DD format.
URL: "https?://[\w./-]+" matches HTTP/HTTPS URLs with letters, digits, dots, slashes, and hyphens.
Repeated words: "\b(\w+) \1\b" matches doubled words like “the the”. The back-reference \1 requires the same text as captured in group 1.
Extract numbers from text: gregexpr("-?\d+\.?\d*", x, perl = TRUE) matches integers and decimals including negative values. regmatches(x, gregexpr(...)) extracts them.
Testing and debugging regex
regexr.com or regex101.com provide interactive regex testers where you can test patterns against sample text and see match highlights and explanations. Copy the final pattern to R and add the necessary escaping.
stringr::str_view(x, pattern) (or str_view_all()) displays matches highlighted in the RStudio viewer — extremely useful for verifying a pattern against real data during development.
Unit test your patterns against known inputs: testthat::expect_true(grepl(pattern, "valid@email.com")) and expect_false(grepl(pattern, "not-an-email")).
Regex as a pattern language
Regular expressions are a notation for describing patterns in text. A regex pattern is interpreted by a regex engine that searches text for strings matching the description. The syntax is compact and dense — a few characters can describe complex patterns that would require many lines of conditional logic. The tradeoff is readability: regex patterns can be hard to understand at a glance, especially when they combine multiple features.
R uses PCRE (Perl-Compatible Regular Expressions) by default for most string functions. PCRE supports features beyond basic regex: named capture groups, non-greedy quantifiers, lookahead and lookbehind assertions, and Unicode properties. Most regex patterns that work in Python, JavaScript, or other PCRE-compatible environments also work in R.
Quantifiers and greedy matching
Quantifiers specify how many times a pattern element can repeat. The star matches zero or more; plus matches one or more; question mark matches zero or one. Curly braces specify exact counts: {3} for exactly three, {2,5} for two to five. By default, quantifiers are greedy — they match as many characters as possible while still allowing the overall pattern to match.
Adding a question mark after a quantifier makes it non-greedy (lazy): it matches as few characters as possible. The difference matters for patterns that match content between delimiters. A greedy match for quoted strings would match from the first quote to the last quote in the text, consuming multiple quoted strings. A non-greedy match stops at the first closing delimiter, correctly extracting one quoted string.
Capture groups
Parentheses in a regex pattern create a capture group. When the pattern matches, the text matching the content of each group is captured separately from the full match. Capture groups enable extracting specific parts of a match — not just whether the pattern matched but what specific substrings matched the components of the pattern.
Named capture groups with the (?P
Testing regex patterns
Testing regex patterns against example inputs before using them in production code prevents bugs from patterns that match the wrong things. The regexr.com website and similar tools provide interactive regex testing with highlighting. In R, testing a pattern against a small vector of expected matches and non-matches verifies behavior before applying to the full dataset.
Edge cases in regex: patterns that work on most inputs may fail on inputs with empty strings, special Unicode characters, or the exact boundary between matching and non-matching. Always include boundary cases in the test set: the shortest valid input, the longest valid input, inputs that should not match but are similar to inputs that should, and empty strings.
When to use base R vs stringr
Base R functions work without dependencies. Use them in packages or scripts where minimizing dependencies matters:
# Base R approach - no dependencies
result <- gsub("old", "new", text)
stringr is better for exploratory analysis and data pipelines where readability matters:
# stringr - more readable
result <- str_replace(text, "old", "new")
For complex patterns, both approaches eventually reach similar complexity. Choose based on your project’s dependencies.
See also
stringrpackage documentation for all string functions- The
rebuspackage for building complex patterns piece by piece stringifor ICU-based international string operations