substring()
substring(text, first, last = .Machine$integer.max) substring() extracts or replaces substrings from character vectors using start and stop positions. It extends substr() with more flexible indexing and can be used on the left-hand side of assignments for in-place modification.
Syntax
substring(text, first, last = .Machine$integer.max)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
text | character | , | A character vector |
first | integer | , | Starting position (1-indexed) |
last | integer | .Machine$integer.max | Ending position (defaults to end of string) |
Examples
Basic usage
x <- "Hello World"
substring(x, 1, 5)
# [1] "Hello"
substring(x, 7)
# [1] "World"
# Using negative indices (from end)
substring(x, -4)
# [1] "World"
substring() uses 1-based indexing, so position 1 refers to the first character of the string. When you omit the last argument, the function extracts from the starting position to the end of the string — substring(x, 7) returns everything from position 7 onward. Negative indices count backwards from the end, so substring(x, -4) starts 4 characters before the end and extends to the string’s final character, a convenient shorthand that avoids computing nchar(x) explicitly.
Replacing substrings
x <- "Hello World"
substring(x, 1, 5) <- "Hi"
x
# [1] "Hi World"
Unlike most base R string functions which return a new value, substring() supports assignment on the left-hand side of <-. The expression substring(x, 1, 5) <- "Hi" replaces characters 1 through 5 of x with "Hi", modifying the original vector in place. This in-place replacement behavior is rare among R’s string functions — substr() also supports it, but str_sub() from stringr does not offer a direct assignment form.
Working with vectors
words <- c("apple", "banana", "cherry")
substring(words, 2, 4)
# [1] "ppl" "ana" "her"
# From position to end
substring(words, 3)
# [1] "ple" "nana" "erry"
When you pass a vector as the first argument to substring(), the function extracts the specified range from each element independently. In the first example, substring(words, 2, 4) takes the second through fourth characters of "apple", "banana", and "cherry", returning three results. Omitting last extracts from the starting position to the end of each string. This element-wise operation makes substring() a natural fit for cleaning columns in a data frame where every row follows the same positional structure.
Vectorized first/last arguments
text <- "ABCDEF"
# Different start positions for each element
substring(text, c(1, 2, 3))
# [1] "ABCDEF" "BCDEF" "CDEF"
# Different lengths via last parameter
substring(text, 1, c(2, 4, 6))
# [1] "AB" "ABCD" "ABCDEF"
A key feature that distinguishes substring() from substr() is that both first and last accept vectors. When first is a vector, each starting position is paired with the corresponding element of the input — substring("ABCDEF", c(1, 2, 3)) returns three substrings, each beginning at a different offset. When both first and last are vectors of the same length, each pair defines a separate range, allowing you to extract multiple non-overlapping fields from a single string in one call instead of running the function repeatedly in a loop.
Common patterns
Get file extension
filename <- "report.pdf"
ext_start <- nchar(filename) - 2
substring(filename, ext_start)
# [1] "pdf"
Computing the starting position dynamically with nchar() is a practical pattern when the substring offset depends on the string’s own length. In the file extension example, nchar(filename) - 2 backs up three characters from the end (accounting for the dot and two-letter extension), then substring() extracts from that position to the end. This approach works when you know the extension length in advance, but for variable-length extensions you would need to locate the dot with regexpr() first and use its match position as the starting index.
Last n characters
text <- "example.txt"
n <- 4
substring(text, nchar(text) - n + 1)
# [1] "e.txt"
substring() vs substr()
substring() and substr() both extract substrings by position. The key differences: substring() can take vectors for first and last, allowing multiple extractions per string. substr() accepts only scalar first and last and is stricter. substr() also supports assignment, substr(x, 2, 3) <- "XX" replaces characters in place, which substring() does not.
Use substring() when extracting multiple ranges from the same string, such as tokenizing fixed-width records. Use substr() for simple single-range extraction or when you need in-place character replacement.
When the stop index exceeds the string length, both functions return the string up to the end without error. When the start index exceeds the string length, "" is returned. These boundary behaviors are predictable and match how most text extraction tools work.
The stringr equivalent is str_sub(x, start, end), which also supports negative indices for counting from the end of the string, str_sub(x, -3, -1) extracts the last three characters without needing nchar().
A common application is parsing fixed-width records where fields occur at known column positions. For example, if a 20-character string always has the date in positions 1–8 and the amount in positions 10–17, substring(x, c(1, 10), c(8, 17)) extracts both fields at once in a single call, returning a two-element character vector. This vectorization over first and last makes substring() more convenient than substr() for multi-field extraction from fixed-format data. The result is a character vector with one element per position pair, not a list, all fields are returned as a flat vector, so use named positions or split into separate variables after extracting.
# Fixed-width record parsing with substring()
record <- "20240315NYC150.00"
date <- substring(record, 1, 8) # "20240315"
city <- substring(record, 9, 11) # "NYC"
amount <- substring(record, 12) # "150.00"
c(date = date, city = city, amount = amount)
# date city amount
# "20240315" "NYC" "150.00"
substring() returns "" rather than NA when first is greater than the string length — this differs from substr(), which also returns "", but the behavior diverges when last is specified: substr("abc", 2, 10) returns "bc", clamping to the actual length, while substring("abc", 2, 10) behaves the same way. Neither function raises an error on out-of-bounds indices. Use str_sub() from stringr if you need NA propagation for missing-like conditions or negative indexing to count from the end.