rguides

read_csv

read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"), skip = 0, n_max = Inf, guess_max = min(1000, n_max), .name_repair = "unique", trim_ws = TRUE, progress = show_progress(), show_col_types = should_show_types())

Description

read_csv() reads a comma-separated values (CSV) file and returns the data as a tibble. It parses column types automatically using the first guess_max rows (default 1000), and never converts character columns to factors.

read_csv() is a thin wrapper around read_delim(), with delim = "," hardcoded:

install.packages("readr")    # readr only
install.packages("tidyverse") # full tidyverse

Arguments

file

Path to a CSV file, URL, connection, or raw vector. Supports automatic decompression for .gz, .bz2, .xz, and .zip suffixes. Remote URLs are downloaded before parsing.

To read literal inline data, wrap the string with I():

read_csv(I("x,y\n1,2\n3,4"))
# # A tibble: 2 × 2
#   x     y
#   <dbl> <dbl>
# 1     1     2
# 2     3     4

Pass multiple paths as a character vector to read and row-bind several files at once:

read_csv(c("file1.csv", "file2.csv"))

col_names

Either TRUE (default), FALSE, or a character vector.

  • TRUE — first row supplies column names.
  • FALSE — generate names X1, X2, ....
  • Character vector — use these values as column names; the first row becomes data.
read_csv(I("a,b\n1,2"), col_names = FALSE)
# # A tibble: 2 × 2
#   X1    X2
#   <dbl> <dbl>
# 1     1     2

col_types

Column type specification. NULL (default) infers types from the first guess_max rows. Pass a cols() specification or a string shorthand to override.

String shorthand:

LetterType
lcol_logical()
icol_integer()
dcol_double()
ncol_number()
ccol_character()
fcol_factor() (requires levels)
Dcol_date()
Tcol_datetime()
tcol_time()
?col_guess()
_ or -skip column

col_factor() and col_skip() are never inferred — you must specify them explicitly. col_guess() is the fallback: it tells readr to infer the type when you’ve specified other columns but want the rest auto-detected.

# String shorthand: double, character, skip
read_csv(I("x,y,z\n1,a,TRUE\n2,b,FALSE"), col_types = "dc_")
# # A tibble: 2 × 2
#   x     y         z
#   <dbl> <chr>     <lgl>
# 1     1 a         TRUE
# 2     2 b         FALSE

# cols() specification with explicit types
read_csv(
  I("x,y\n1,a\n2,b"),
  col_types = cols(y = col_factor(levels = c("a", "b")))
)

# Override some columns, guess the rest
read_csv(
  I("x,y,z\n1,a,TRUE\n2,b,FALSE"),
  col_types = cols(x = col_double(), .default = col_guess())
)

col_select

Select which columns to read using tidyselect syntax. Supports names, numeric indexes, starts_with(), last_col(), and more.

df <- read_csv(
  I("chicken,eggs_laid,weight\nFoghorn,0,2.1\nLittle,3,1.8"),
  col_select = c(chicken, eggs_laid)
)
df
# # A tibble: 2 × 2
#   chicken           eggs_laid
#   <chr>                 <dbl>
# 1 Foghorn                  0
# 2 Little                   3

Rename during selection with c(new_name = old_name, ...):

read_csv(
  I("x,y\n1,a\n2,b"),
  col_select = c(new_x = x, y)
)
# # A tibble: 2 × 2
#   new_x y
#   <dbl> <chr>
# 1     1 a
# 2     2 b

id

Supply a string to add a column recording the source file path of each record. Particularly useful when reading multiple files at once:

combined <- read_csv(c("file1.csv", "file2.csv"), id = "source")
# # A tibble: 4 × 3
#   source     x     y
#   <chr>  <dbl> <dbl>
# 1 file1.csv   1     2
# 2 file1.csv   3     4
# 3 file2.csv   5     6
# 4 file2.csv   7     8

locale

Controls date format, time format, decimal mark, grouping mark, time zone, and encoding. Use locale() to customize. The default default_locale() is US-centric.

# Read a CSV with European decimal notation
read_csv(I("x\n1,5"), locale = locale(decimal_mark = ","))
# # A tibble: 1 × 1
#       x
#   <dbl>
# 1   1.5

# Read a file with non-UTF-8 encoding
read_csv("data.csv", locale = locale(encoding = "latin1"))

na

Character vector of strings to interpret as missing values. Default is c("", "NA"). Set character() for no missing value conversion.

read_csv(I("x\n1\nNA\n"), na = c("", "NA"))        # [1]  1 NA
read_csv(I("x\n1\nNA\n"), na = character())        # [1] "1" "NA"
read_csv(I("x\n1\nN/A\n"), na = c("", "NA", "N/A")) # [1]  1  NA

trim_ws

Logical, defaults to TRUE. Strips leading and trailing whitespace from each field before parsing. Note that read_delim() defaults to FALSE — watch for this difference when switching between functions.

skip

Number of lines to skip before reading. Comment lines within the skipped range are also skipped. Default is 0.

read_csv(I("header\nx\n1\n2"), skip = 1)
# # A tibble: 2 × 1
#   x
#   <dbl>
# 1     1
# 2     2

n_max

Maximum number of data rows to read. Inf (default) reads all rows. Useful for previewing large files:

read_csv(I("x\n1\n2\n3\n4\n5"), n_max = 2)
# # A tibble: 2 × 1
#   x
#   <dbl>
# 1     1
# 2     2

Note: guess_max is capped at n_max, so type inference uses at most the rows actually read.

guess_max

Maximum rows used for type inference. Default is min(1000, n_max). Increase if early rows are unrepresentative of the full column:

# Suppose the first 1000 rows are integers, but row 1001+ are doubles
read_csv(I("x\n1\n2\n"), guess_max = 1001)

name_repair

How to handle duplicate or invalid column names. Options:

  • "minimal" — keep names as-is (may contain duplicates).
  • "unique" (default) — make unique by appending ...1, ...2, etc.
  • "check_unique" — error if any duplicates exist.
  • "unique_quiet" — repair silently.
  • "universal" — make syntactically valid unique names.
  • Custom function — function(nms) c("name1", "name2", ...) returning repaired names.

quote, comment

  • quote — quote character, default "\". Set quote = "" to disable quoting.
  • comment — lines starting with this prefix are ignored. Default "" means no stripping.
read_csv(I('x\n1\n# comment\n2'), comment = "#")
# # A tibble: 2 × 1
#   x
#   <dbl>
# 1     1
# 2     2

skip_empty_rows

Logical, defaults to TRUE. When TRUE, blank rows are skipped entirely. When FALSE, blank rows are returned as NA across all columns.

num_threads, progress

  • num_threads — number of threads for parallel parsing. Default readr_threads(). Set to 1 for files containing newlines inside quoted fields.
  • progress — display a progress bar. Default show_progress(), which is FALSE in non-interactive sessions (e.g., knitting).

show_col_types

  • NULL (default) — print column types only when inferred (i.e., when col_types is not supplied).
  • TRUE — always print column types.
  • FALSE — never print column types.
read_csv(I("x\n1"), col_types = NULL, show_col_types = FALSE)  # silent inference
read_csv(I("x\n1"), col_types = "i", show_col_types = TRUE)    # shows types even though specified

lazy

Logical, default should_read_lazy(). When TRUE, uses lazy reading via vroom. Default is FALSE. Writing back to the same file while a lazy handle is open can cause problems.

Value

Returns a tibble with one column per CSV field and one row per record. Character columns are never auto-converted to factors. Row names are never set.

If there are parsing problems, a warning is printed showing the first few. Retrieve all problems with problems(df). Throw an error on any problem with stop_for_problems(df):

df <- read_csv(I("x\n1\nabc"))
# Warning: 1 parsing failure.
# ...
problems(df)
# # A tibble: 1 × 4
#   row   col expected actual file
#  <int> <int> <chr>    <chr>  <chr>
# 1     2     x no_dots  abc   ""

stop_for_problems(df)
# Error: Parsing errors present.

Basic Usage

Read a file from disk:

df <- read_csv("data.csv")

Read a CSV from a URL:

df <- read_csv("https://example.com/data.csv")

Read multiple files, tagged with source:

combined <- read_csv(c("train.csv", "test.csv"), id = "split")

Column Type Specification

Always specify col_factor() and col_skip() explicitly — they are never inferred. Use col_guess() as the fallback when you want readr to infer the type for specific columns:

read_csv(
  I("id,category,score\n1,A,3.2\n2,B,4.1"),
  col_types = cols(
    id = col_integer(),
    category = col_factor(levels = c("A", "B", "C")),
    .default = col_guess()
  )
)

Handling Missing Values

Empty strings and "NA" are NA by default. Add custom values:

read_csv(I("x\n1\nN/A\nnull"), na = c("", "NA", "N/A", "null"))

Skipping and Limiting Rows

Combine skip and n_max to read a specific range:

# Skip 10 header lines, read 5 data rows
read_csv("data.csv", skip = 10, n_max = 5)

Compared to Base R

Featureread_csv()read.csv()
Return typetibbledata.frame
Strings to factorsneverdefault TRUE
Row namesneveroptional
Type inferenceautomaticlimited
Speedfasterslower
Dependenciesreadrnone

read_csv() is faster, returns a tibble, and never surprises you with factors. read.csv() requires no dependencies but has limited type inference and converts strings to factors by default.

For unusual CSV formats — those with escape backslashes, alternative quote escaping, or unusual delimiters inside quoted fields — read_delim() exposes additional arguments that read_csv() does not.

Common Problems

Type inference wrong for late-appearing values: Increase guess_max:

read_csv(I("x\n1\n"), guess_max = 2000)

“NA” in my data is being converted to logical NA: Add "NA" to na or use na = character() if empty strings are not missing values:

read_csv(I("x\nNA"), na = character())  # keeps "NA" as character

Blank rows produce all-NA rows: Set skip_empty_rows = FALSE to treat blank lines as data rows (all NA), or remove them with drop_na() after reading.

Quote handling with embedded delimiters: If a field contains a comma inside quotes, ensure the quote character is " (default) and the comma is inside the quoted region.

See Also

  • read_delim() — general delimiter; the underlying engine for read_csv()
  • read_csv2() — CSV with ; delimiter (European format)
  • write_csv() — write a tibble to a CSV file
  • str_sub() — string extraction with tidyverse interface, follows similar design principles
  • fct_reorder() — reordering factor levels, another tidyverse idiom that pairs well with read_csv() for data preparation