rguides

sub()

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE)

sub() is a base R function for finding and replacing text patterns in character strings. It replaces only the first occurrence of a pattern in each element. Both support regular expressions, fixed matching, and case-insensitive options.

Syntax

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE)

Parameters

ParameterTypeDefaultDescription
patterncharacter,A pattern to search for (regex, literal string, or fixed if fixed = TRUE)
replacementcharacterThe replacement string. Use backreferences (\1, \2, etc.) with capture groups in regex mode
xcharacterA character vector where patterns will be searched
ignore.caselogicalFALSEIf TRUE, the search is case-insensitive
perllogicalFALSEIf TRUE, use Perl-compatible regular expressions
fixedlogicalFALSEIf TRUE, treat pattern as a literal string rather than regex

Examples

Basic replacement (first match only)

text <- "The cat sat on the cat mat"
sub("cat", "dog", text)
# [1] "The dog sat on the cat mat"

phone <- "555-123-4567"
sub("-", ":", phone)
# [1] "555:123-4567"

The first example shows sub() replacing "cat" with "dog" in a sentence that contains the word twice — notice that only the leftmost occurrence changes, leaving the second "cat" untouched. The phone number example demonstrates the same behavior with a non-word delimiter: only the first hyphen becomes a colon. This single-match behavior is what distinguishes sub() from gsub(), and it is the right choice when you know only the first instance of a pattern is meaningful, such as stripping a leading prefix or replacing only the initial delimiter in a path.

Case-insensitive matching

text <- "R is GREAT and r is great"
sub("r", "X", text, ignore.case = TRUE)
# [1] "X is GREAT and r is great"

Setting ignore.case = TRUE makes the pattern match both uppercase and lowercase forms without requiring you to convert the input string first. The example above replaces only the first "r" or "R" encountered — the initial capital R becomes X, but the lowercase "r" later in the string is left alone because sub() stops after the first match. This option is particularly useful when matching user input where capitalization is unpredictable, and you want to normalize only the leading instance of a term.

Using backreferences

names <- c("John Doe", "Jane Smith", "Bob Wilson")
sub("(\\w+) (\\w+)", "\\2, \\1", names)
# [1] "Doe, John"      "Jane Smith"    "Bob Wilson"

# Note: only the first name is transformed

Backreferences let you rearrange captured groups within the replacement string. The pattern (\\w+) (\\w+) captures two words separated by a space, and the replacement \\2, \\1 swaps their order and inserts a comma. Because sub() only acts on the first match, "Jane Smith" and "Bob Wilson" remain unchanged — only "John Doe" is transformed. This selective behavior is useful when you want to reformat only the first record in structured text while leaving subsequent entries intact, or when parsing a header line that follows a different convention from the data rows.

Fixed matching (literal strings)

filename <- "file.old.old"
sub(".old", ".new", filename, fixed = TRUE)
# [1] "file.new.old"

When fixed = TRUE, the pattern is matched as a literal string rather than a regular expression. In the example above, ".old" matches the literal substring .old (period followed by the letters o-l-d), and only the first occurrence is replaced. Without fixed = TRUE, the dot would match any character, producing an unintended result. Fixed matching also avoids the overhead of compiling a regex, which can be noticeable when you are processing thousands of strings in a loop or applying sub() across a large data frame column.

Common patterns

Replace first instance in formatted text

text <- "Price: $100, Discount: 10%"
sub("\\$", "USD ", text)
# [1] "Price: USD 100, Discount: 10%"

sub() vs gsub()

sub() replaces only the first match in each string element. gsub() replaces every match. For most data-cleaning tasks where you want to strip or replace all occurrences, gsub() is the right choice. Use sub() when the first match has special significance — for example, splitting on only the first delimiter, or replacing only the leading whitespace.

Backreferences work the same as in gsub(): sub("(\\w+) .*", "\\1", "hello world foo") captures "hello" and discards the rest. This pattern extracts the first word from a phrase or the first component from a path.

For fixed = TRUE replacement, sub() behaves like paste() applied to the first split point. Use it as a lightweight string split-and-replace without involving strsplit().

The stringr equivalent is str_replace(x, pattern, replacement), which replaces only the first match and uses PCRE.

One practical distinction between sub() and gsub() is in path and URL manipulation. To strip a query string from a URL, sub("\\?.*$", "", url) is correct — you want to remove from the first ? to the end, and there is only ever one query string per URL. Using gsub() here would work identically but is misleading; it implies multiple replacements are possible when they are not. Choosing sub() communicates intent to future readers.

# sub() vs gsub() on data frame columns
df <- data.frame(
  path = c("/usr/local/bin", "/home/user/docs", "/var/log/app"),
  stringsAsFactors = FALSE
)

# Strip only the leading slash (sub)
df$relative <- sub("^/", "", df$path)
df$relative
# [1] "usr/local/bin" "home/user/docs" "var/log/app"

# Replace all slashes with backslashes (gsub)
df$windows <- gsub("/", "\\\\", df$path)
df$windows
# [1] "\\usr\\local\\bin" "\\home\\user\\docs" "\\var\\log\\app"

sub() replaces only the first match in each string element. If a string contains multiple matches, only the leftmost one is replaced. Use gsub() to replace all occurrences. The choice between them depends on whether you expect exactly one match or multiple, and whether replacing just the first is intentional or an oversight.

Backreferences in the replacement string use \1, \2, etc. (with double backslash in R string literals, or \\1 inside sprintf() or similar contexts). For example, sub("(\w+) (\w+)", "\2 \1", "first last") returns "last first". The perl = TRUE argument enables PCRE features including lookaheads, lookbehinds, and named capture groups ((?P<name>...) with \k<name> for backreferences), which are not available in the default TRE regex engine.

See also