rguides

strsplit()

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

strsplit() splits a character vector into substrings by finding matches of a delimiter pattern. It returns a list where each element contains the substrings from the corresponding element of the input.

Syntax

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

Parameters

ParameterTypeDefaultDescription
xcharacter,A character vector to split
splitcharacterThe pattern to split on (regular expression unless fixed = TRUE)
fixedlogicalFALSEIf TRUE, match split literally rather than as a regex
perllogicalFALSEIf TRUE, use Perl-compatible regular expressions
useByteslogicalFALSEIf TRUE, match bytes rather than characters

Examples

Basic usage

# Split by space
text <- "hello world"
strsplit(text, " ")
# [[1]]
# [1] "hello" "world"

# Split by comma
colors <- "red,green,blue,orange"
strsplit(colors, ",")
# [[1]]
# [1] "red"   "green" "blue"  "orange"

The return value of strsplit() is always a list, with one element per input string. Each list element is itself a character vector containing the pieces produced by splitting at each match of the pattern. When you pass a single string, you still get a list — you must use [[1]] to extract the character vector of pieces. This design ensures the return type is consistent regardless of how many input elements you provide.

Splitting into characters

# Split each character
word <- "abcdef"
strsplit(word, "")
# [[1]]
# [1] "a" "b" "c" "d" "e" "f"

Splitting on an empty string is a convenient way to decompose a string into its individual characters. This technique does not use a regex engine — when split = "", strsplit() matches between every pair of adjacent characters, effectively returning each character as a separate element. This is useful for tasks like counting character frequency, checking for the presence of specific symbols, or implementing character-level transformations without loading external packages for tokenization.

Using fixed = TRUE for literal matching

# Split on a literal period (not regex)
paths <- c("file.txt", "data.csv", "script.R")
strsplit(paths, ".", fixed = TRUE)
# [[1]]
# [1] "file"  "txt"
# [[2]]
# [1] "data"  "csv"
# [[3]]
# [1] "script" "R"

The fixed = TRUE argument is essential when your delimiter contains characters that have special meaning in regular expressions. Without it, strsplit("file.txt", ".") would split on every character because . in regex matches any single character. With fixed = TRUE, the period is treated as a literal dot, and the string splits at the actual . character. This flag also improves performance because the function can skip regex compilation entirely and use a straightforward substring search instead.

Using regular expressions

# Split on whitespace (one or more spaces)
sentence <- "The   quick brown    fox"
strsplit(sentence, "\\s+")
# [[1]]
# [1] "The" "quick" "brown" "fox"

# Split on digits
text <- "abc123def456"
strsplit(text, "[0-9]+")
# [[1]]
# [1] "abc" "def" ""

Regular expression splitting unlocks patterns that go beyond simple literal delimiters. The \\s+ pattern matches one or more whitespace characters, so it handles irregular spacing gracefully — two spaces, three spaces, or a tab all count as a single split point. The [0-9]+ pattern demonstrates splitting on runs of digits, which is useful for separating numeric identifiers from text labels. When the pattern matches at the very end of a string, strsplit() produces a trailing empty string in the result, as shown by the "" at position 3 of the second example above.

Common patterns

First element only

# Get first split element
parts <- strsplit("first,second,third", ",")[[1]]
parts[1]
# [1] "first"

Extracting the first element after splitting is a concise way to get the initial segment of a delimited string without materializing the entire split result. This pattern is common when parsing structured identifiers like "category/subcategory/item" where you only need the top-level category. Because strsplit() returns a list, you must use [[1]] before indexing with [1]: the double bracket extracts the character vector from the list, and the single bracket selects the first element of that vector.

Combine with sapply

# Split multiple strings and get lengths
words <- c("a b c", "d e f g", "h i")
sapply(strsplit(words, " "), length)
# [1] 3 4 2

Pairing strsplit() with sapply() is the standard R idiom for applying a summary function to each split result. Because strsplit() returns a list where each element has a potentially different length, sapply() gracefully handles the ragged structure — it calls length() on each list element and returns a clean integer vector. This pattern extends naturally to other summary functions: use sapply(strsplit(x, " "), function(v) v[1]) to extract the first word from every string, or combine with table() to build a frequency distribution of token counts across a corpus.

Data cleaning with strsplit and paste

# Remove extra whitespace
dirty <- "  hello   world  "
clean <- paste(strsplit(dirty, "\\s+")[[1]], collapse = " ")
clean
# [1] "hello world"

strsplit() output structure

strsplit() always returns a list, even when given a single string. This is because splitting can produce a variable number of pieces per element, so a list is the only consistent return type. To get the pieces from a single-string split, use [[1]] to extract the first (only) list element.

When the split pattern does not match, the element is returned unchanged as a length-1 character vector — it is not an error. When the split pattern matches the entire string, the result is character(0) (an empty character vector).

For comma-delimited strings, strsplit(x, ",") is the right approach for variable-length fields. For fixed-width fields, substring() or substr() is cleaner. For tabular text files, read.table() or readr::read_delim() are faster and more convenient than manual splitting.

The stringr equivalent is str_split(x, pattern), which returns a list by default, or str_split_fixed(x, pattern, n) which returns a fixed-column matrix — useful when you know exactly how many pieces to expect.

When processing structured strings like "key:value" pairs, strsplit(x, ":", fixed = TRUE) is the right tool. If every string has the same structure, str_split_fixed(x, ":", 2) returns a two-column matrix which is easier to work with than a list. For CSV-like data within a single column, consider read.csv(text = x) as an alternative that handles quoting correctly.

strsplit() returns a list, not a character vector, because each element of the input can split into a different number of pieces. When you know all inputs split into exactly the same number of pieces, use do.call(rbind, strsplit(x, split)) to convert to a matrix, or matrix(unlist(strsplit(x, split)), ncol = n, byrow = TRUE). The tidyr::separate() function provides a cleaner interface when splitting a data frame column into multiple columns with a fixed number of parts.

strsplit() accepts a regex pattern for split. To split on a literal string that contains regex metacharacters (., +, (, etc.), escape the pattern with fixed = TRUE or wrap it with fixed() when using the stringr equivalent str_split(). Forgetting to escape metacharacters is the most common source of unexpected split behavior.

See also