rguides

stringr::str_length()

str_length(string)

The str_length() function from stringr returns the number of characters in a string. Unlike base R’s nchar(), this function consistently handles NA values and uses UTF-8 encoding by default, making it more predictable for modern text processing workflows.

Syntax

str_length(string)

Parameters

ParameterTypeDescription
stringcharacterA character vector

The function accepts a single character vector as its only argument and returns an integer vector of the same length. Each element in the output represents the count of Unicode characters in the corresponding input string, making it straightforward to compute lengths for individual strings or entire columns in a data frame.

Examples

Basic usage

library(stringr)

# Get length of individual strings
str_length("hello")
# [1] 5

# Vectorized over multiple strings
str_length(c("apple", "banana", "cherry"))
# [1] 5 6 6

The function is fully vectorised, so passing a character vector with multiple elements returns an integer vector of the same length with each element’s character count. This makes str_length() efficient for computing lengths across entire columns of a data frame without needing an explicit loop.

str_length() handles NA values cleanly by preserving them in the output, which is different from base R’s nchar() that converts NA to the string “NA” by default. This consistency with the rest of the tidyverse makes str_length() the safer choice when your data contains missing values that should stay missing.

Handling NA values

# NA is preserved (unlike nchar which may warn)
str_length(c("hello", NA, "world"))
# [1]  5 NA  5

# Compare with base R behavior
nchar(c("hello", NA, "world"))
# [1]  5 NA  6

Empty strings and whitespace strings each have different length counts, and understanding these differences helps when you need to detect blank entries in a dataset. str_length() treats every character equally, counting spaces, tabs, and newlines just as it counts letters and digits.

Working with empty and whitespace strings

# Empty string has zero length
str_length("")
# [1] 0

# Whitespace characters are counted
str_length("  ")
# [1] 2

# Newlines and tabs count as single characters
str_length("a\nb")
# [1] 3
str_length("a\tb")
# [1] 3

Whitespace characters like spaces, tabs, and newlines all increase the length count, so str_length(" ") returns 2 for two spaces. This behavior is important when you are checking for truly blank responses in survey data, where a field containing only whitespace should be treated differently from a genuinely empty string.

Using with dplyr pipelines

library(dplyr)

df <- data.frame(
  word = c("apple", "banana", "cherry", "date", "elderberry")
)

df %>%
  mutate(char_count = str_length(word))
#          word char_count
# 1      apple          5
# 2     banana          6
# 3     cherry          6
# 4        date          4
# 5 elderberry         10

str_length() integrates naturally with dplyr::mutate() to create new columns that hold character counts for each row. This pattern is useful when you need to filter, sort, or group records based on the length of a text field within a data pipeline.

Counting characters in sentences

sentences <- c(
  "The quick brown fox.",
  "Hello, world!",
  "R is great."
)

# Get lengths of each sentence
str_length(sentences)
# [1] 20 13 11

# Find the longest sentence
longest <- sentences[which.max(str_length(sentences))]
longest
# [1] "The quick brown fox."

When you have a collection of strings like sentences or titles, str_length() combined with which.max() quickly identifies the longest entry. This technique is handy when you need to find the most verbose item in a set, such as the longest product description or the most detailed survey response.

UTF-8 and special characters

# Emoji count as single characters
str_length("🔥")
# [1] 1

# Combined characters
str_length("é")        # Single codepoint
# [1] 1

str_length("e\u0301")  # e + combining accent
# [1] 2

# Non-Latin scripts
str_length("日本語")
# [1] 3

UTF-8 support in str_length() means that emoji and CJK characters are each counted as one unit, which matches how users perceive character length. The function also correctly distinguishes between precomposed characters like é and decomposed sequences where a base letter is combined with a separate combining accent mark.

Filtering by string length

library(dplyr)

words <- data.frame(
  term = c("a", "at", "cat", "cats", "caterpillar")
)

# Keep only words with 3-5 characters
words %>%
  filter(between(str_length(term), 3, 5))
#         term
# 1        cat
# 2       cats

Common patterns

  • With dplyr::mutate: Add character count columns
  • With dplyr::filter: Filter rows by string length
  • With dplyr::arrange: Sort by string length
  • With max() / min(): Find longest or shortest strings

Computing summary statistics on character lengths helps you understand the distribution of string sizes in your dataset before applying transformations. The min(), max(), and mean() functions work directly on the integer vector returned by str_length(), which allows you to quickly spot unusually short or long entries that might need cleaning.

Summary statistics

text_data <- c("short", "medium length", "this is a longer string", "tiny")

# Character length statistics
c(
  min = min(str_length(text_data)),
  max = max(str_length(text_data)),
  mean = mean(str_length(text_data))
)
#  min  max mean 
#    5   20 11.5

Once you know the length characteristics of your strings, you can combine str_length() with other stringr functions to enforce consistent formatting. For example, padding all strings to the length of the longest one creates uniform output that aligns neatly in tables or fixed-width files.

Combined with other stringr functions

# Get only strings longer than 5 characters
words <- c("a", "at", "cat", "cats", "caterpillar")
words[str_length(words) > 5]
# [1] "caterpillar"

# Pad all strings to minimum length
str_pad(words, width = max(str_length(words)), side = "right")
# [1] "a          " "at         " "cat        " "cats       " "caterpillar"

Using str_length() in practice

str_length() counts the number of characters in each string. It is equivalent to base R’s nchar() but integrates with the stringr interface. For ASCII text, character count equals byte count. For UTF-8 strings with multi-byte characters (accented letters, emoji, CJK characters), str_length() returns the number of Unicode code points, not bytes. str_length("café") returns 4, not 5.

str_length() returns NA for NA inputs, consistent with other stringr functions. nchar(NA) also returns NA by default, but has a keepNA argument. For counting bytes instead of characters, use nchar(x, type = "bytes"), there is no direct stringr equivalent.

The result of str_length() is commonly used to filter strings by length: filter(df, str_length(text) > 10) keeps rows with text longer than 10 characters. It is also used for padding and truncating: str_pad(x, width = max(str_length(x))) pads all strings to the length of the longest one.

str_length(character(0)) returns integer(0), an empty integer vector, not NA, since it is operating on an empty input, not a missing value. This distinction matters in downstream computations that check length() vs. is.na().

nchar() vs str_length()

str_length() and base R’s nchar() are equivalent for most inputs. str_length() handles NA values consistently with other stringr functions, it returns NA for NA inputs. nchar() returns 2 for NA by default (the number of characters in “NA”) unless keepNA = TRUE is set. For code in a tidyverse pipeline, str_length() is the natural choice; for code in base R without stringr, nchar() works correctly with the keepNA = TRUE argument.

Both functions count Unicode code points by default. For strings containing multi-byte characters, emoji, accented letters, the code point count is the character count as a human perceives it. The byte count, which matters for database field lengths or binary protocols, is obtained with nchar(x, type = "bytes") or str_length(enc2utf8(x)).

See also

  • stringr::str_detect(), Detect patterns in strings
  • stringr::str_pad(), Pad strings to a specified lengthstr_length() counts Unicode code points, not bytes. For ASCII text, these are equivalent, but for multi-byte characters (emoji, Chinese, Arabic), str_length() returns the number of characters while nchar(x, type = "bytes") returns the byte count. nchar(x, allowNA = TRUE) returns NA for NA input rather than throwing — str_length() also handles NA by returning NA. Use str_length(str_trim(x)) == 0 to detect blank or whitespace-only strings.