read_csv
read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"), skip = 0, n_max = Inf, guess_max = min(1000, n_max), .name_repair = "unique", trim_ws = TRUE, progress = show_progress(), show_col_types = should_show_types()) Description
read_csv() reads a comma-separated values (CSV) file and returns the data as a tibble. It parses column types automatically using the first guess_max rows (default 1000), and never converts character columns to factors.
read_csv() is a thin wrapper around read_delim(), with delim = "," hardcoded:
install.packages("readr") # readr only
install.packages("tidyverse") # full tidyverse
Arguments
file
Path to a CSV file, URL, connection, or raw vector. Supports automatic decompression for .gz, .bz2, .xz, and .zip suffixes. Remote URLs are downloaded before parsing, which means network latency adds to the read time and large files should be cached locally when possible.
To read literal inline data, wrap the string with I():
read_csv(I("x,y\n1,2\n3,4"))
# # A tibble: 2 × 2
# x y
# <dbl> <dbl>
# 1 1 2
# 2 3 4
Pass multiple paths as a character vector to read and row-bind several files at once. Each file must have the same number and type of columns, otherwise row-binding will fail with a type mismatch error. This pattern is common when processing daily data exports or sharded datasets where each file represents a single day, batch, or partition.
read_csv(c("file1.csv", "file2.csv"))
col_names
Either TRUE (default), FALSE, or a character vector.
TRUE, first row supplies column names.FALSE, generate namesX1, X2, ....- Character vector, use these values as column names; the first row becomes data.
When the CSV file lacks a header row, setting col_names = FALSE tells read_csv() to treat every row as data and assign synthetic names. This is essential for CSV files generated by systems that don’t include column headers. Supplying a character vector is useful when the existing header names are inconsistent or need standardization before downstream analysis.
read_csv(I("a,b\n1,2"), col_names = FALSE)
# # A tibble: 2 × 2
# X1 X2
# <dbl> <dbl>
# 1 1 2
col_types
Column type specification. NULL (default) infers types from the first guess_max rows. Pass a cols() specification or a string shorthand to override.
String shorthand:
| Letter | Type |
|---|---|
l | col_logical() |
i | col_integer() |
d | col_double() |
n | col_number() |
c | col_character() |
f | col_factor() (requires levels) |
D | col_date() |
T | col_datetime() |
t | col_time() |
? | col_guess() |
_ or - | skip column |
col_factor() and col_skip() are never inferred — you must specify them explicitly. col_guess() is the fallback for columns where you want auto-detection alongside explicitly-typed columns. The string shorthand provides a compact alternative: each character in the string corresponds to one column’s type.
# String shorthand: d=double, c=character, _=skip
read_csv(I("x,y,z\n1,a,TRUE\n2,b,FALSE"), col_types = "dc_")
# cols() for explicit types with factor levels
read_csv(
I("x,y\n1,a\n2,b"),
col_types = cols(y = col_factor(levels = c("a", "b")))
)
# Mix explicit types with .default for the rest
read_csv(
I("x,y,z\n1,a,TRUE\n2,b,FALSE"),
col_types = cols(x = col_double(), .default = col_guess())
)
col_select
Read only the columns you need with tidyselect syntax, reducing memory usage for wide files. Supports column names, numeric indexes, and helpers like starts_with(), ends_with(), and last_col(). Selecting a subset of columns also reduces parse time since unselected columns are never processed, making it a performance optimization for very wide CSV files where only a few columns are needed.
df <- read_csv(
I("chicken,eggs_laid,weight\nFoghorn,0,2.1\nLittle,3,1.8"),
col_select = c(chicken, eggs_laid)
)
df
# # A tibble: 2 × 2
# chicken eggs_laid
# <chr> <dbl>
# 1 Foghorn 0
# 2 Little 3
Renaming columns during selection saves a separate rename() step. The c(new_name = old_name) syntax maps each new name to its corresponding old name, which is useful when upstream column names are cryptic or inconsistent with the naming conventions used in the rest of the pipeline.
read_csv(
I("x,y\n1,a\n2,b"),
col_select = c(new_x = x, y)
)
# # A tibble: 2 × 2
# new_x y
# <dbl> <chr>
# 1 1 a
# 2 2 b
id
Supply a string to add a column recording the source file path of each record. This is particularly useful when reading multiple files at once, as the id column lets you trace each row back to its original file without keeping them in separate data frames.
combined <- read_csv(c("file1.csv", "file2.csv"), id = "source")
# # A tibble: 4 × 3
# source x y
# <chr> <dbl> <dbl>
# 1 file1.csv 1 2
# 2 file1.csv 3 4
# 3 file2.csv 5 6
# 4 file2.csv 7 8
locale
Controls date format, time format, decimal mark, grouping mark, time zone, and encoding. Use locale() to customize these settings. The default default_locale() is US-centric, which means a decimal mark of . and a grouping mark of ,. For data produced in countries using European conventions, you need to override these defaults explicitly.
# Read a CSV with European decimal notation
read_csv(I("x\n1,5"), locale = locale(decimal_mark = ","))
# # A tibble: 1 × 1
# # x
# # <dbl>
# # 1 1.5
# Read a file with non-UTF-8 encoding
read_csv("data.csv", locale = locale(encoding = "latin1"))
na
Character vector of strings to interpret as missing values. Default is c("", "NA"). Set character() for no missing value conversion. Data sources often encode missing values in non-standard ways: “N/A”, “null”, “-999”, “missing”, or “NaN” are common conventions that require explicit na values for proper parsing.
read_csv(I("x\n1\nNA\n"), na = c("", "NA")) # [1] 1 NA
read_csv(I("x\n1\nNA\n"), na = character()) # [1] "1" "NA"
read_csv(I("x\n1\nN/A\n"), na = c("", "NA", "N/A")) # [1] 1 NA
na
Character vector of strings to interpret as missing values. Default is c("", "NA"). Set character() for no missing value conversion. Data sources often encode missing values in non-standard ways: “N/A”, “null”, “-999”, “missing”, or “NaN” are common conventions that require explicit na values for proper parsing. Adding these custom markers ensures numeric columns stay numeric instead of producing parsing failures.
read_csv(I("x\n1\nNA\n"), na = c("", "NA")) # [1] 1 NA
read_csv(I("x\n1\nNA\n"), na = character()) # [1] "1" "NA"
read_csv(I("x\n1\nN/A\n"), na = c("", "NA", "N/A")) # [1] 1 NA
trim_ws
Logical, defaults to TRUE. Strips leading and trailing whitespace from each field before parsing. Note that read_delim() defaults to FALSE, watch for this difference when switching between functions.
skip
Number of lines to skip before reading. Comment lines within the skipped range are also skipped. Default is 0.
read_csv(I("header\nx\n1\n2"), skip = 1)
# # A tibble: 2 × 1
# x
# <dbl>
# 1 1
# 2 2
n_max
Maximum number of data rows to read. Inf (default) reads all rows. This is primarily useful for previewing large files before committing to a full read, or for working with the first N rows of a streaming data source. Memory allocation is proportional to n_max, so setting a small value lets you inspect column types and data structure without loading gigabytes of data.
read_csv(I("x\n1\n2\n3\n4\n5"), n_max = 2)
# # A tibble: 2 × 1
# x
# <dbl>
# # 1 1
# # 2 2
Note: guess_max is capped at n_max, so type inference uses at most the rows actually read. This means a small n_max combined with default guess_max can produce incorrect type guesses if the first few rows are not representative of the full file.
guess_max
Maximum rows used for type inference. Default is min(1000, n_max). Increase this value if early rows are unrepresentative of the full column, as can happen when a column contains integer-like values in the first few hundred rows but fractional values later on.
# Suppose the first 1000 rows are integers, but row 1001+ are doubles
read_csv(I("x\n1\n2\n"), guess_max = 1001)
name_repair
How to handle duplicate or invalid column names. Options:
"minimal", keep names as-is (may contain duplicates)."unique"(default), make unique by appending...1,...2, etc."check_unique", error if any duplicates exist."unique_quiet", repair silently."universal", make syntactically valid unique names.- Custom function,
function(nms) c("name1", "name2", ...)returning repaired names.
quote, comment
quote, quote character, default"\". Setquote = ""to disable quoting.comment, lines starting with this prefix are ignored. Default""means no stripping.
read_csv(I('x\n1\n# comment\n2'), comment = "#")
# # A tibble: 2 × 1
# x
# <dbl>
# 1 1
# 2 2
skip_empty_rows
Logical, defaults to TRUE. When TRUE, blank rows are skipped entirely. When FALSE, blank rows are returned as NA across all columns.
num_threads, progress
num_threads, number of threads for parallel parsing. Defaultreadr_threads(). Set to1for files containing newlines inside quoted fields.progress, display a progress bar. Defaultshow_progress(), which isFALSEin non-interactive sessions (e.g., knitting).
show_col_types
NULL(default), print column types only when inferred (i.e., whencol_typesis not supplied).TRUE, always print column types.FALSE, never print column types.
read_csv(I("x\n1"), col_types = NULL, show_col_types = FALSE) # silent inference
read_csv(I("x\n1"), col_types = "i", show_col_types = TRUE) # shows types even though specified
lazy
Logical, default should_read_lazy(). When TRUE, uses lazy reading via vroom. Default is FALSE. Writing back to the same file while a lazy handle is open can cause problems.
Value
Returns a tibble with one column per CSV field and one row per record. Character columns are never auto-converted to factors. Row names are never set.
If there are parsing problems, a warning is printed showing the first few. Retrieve all problems with problems(df). Throw an error on any problem with stop_for_problems(df):
df <- read_csv(I("x\n1\nabc"))
# Warning: 1 parsing failure.
# ...
problems(df)
# # A tibble: 1 × 4
# row col expected actual file
# <int> <int> <chr> <chr> <chr>
# 1 2 x no_dots abc ""
stop_for_problems(df)
# Error: Parsing errors present.
Basic usage
The most common invocation reads a local CSV file from disk. read_csv() automatically detects the delimiter, infers column types from the first 1000 rows, and returns a tidy tibble. Column names are taken from the first row unless col_names is explicitly set.
df <- read_csv("data.csv")
Reading from a URL works the same way: the file is downloaded, parsed, and returned as a tibble. This is convenient for loading datasets hosted on GitHub, public data portals, or any web server that serves raw CSV content. The download happens synchronously, so large files may take time to fetch.
df <- read_csv("https://example.com/data.csv")
When you need to combine multiple CSV files that share the same schema, pass a character vector of paths. read_csv() reads each file in sequence and row-binds them into a single tibble. The id argument adds a column recording which file each row came from, which is useful for tracking provenance in combined datasets.
combined <- read_csv(c("train.csv", "test.csv"), id = "split")
Column type specification
Always specify col_factor() and col_skip() explicitly, since they are never inferred from the data. Use col_guess() as the fallback when you want readr to infer the type for specific columns while maintaining full control over others. The .default argument in cols() applies a default type to any column not explicitly named, which simplifies specifications for wide files.
read_csv(
I("id,category,score\n1,A,3.2\n2,B,4.1"),
col_types = cols(
id = col_integer(),
category = col_factor(levels = c("A", "B", "C")),
.default = col_guess()
)
)
Handling missing values
Empty strings and the literal string "NA" are converted to NA by default. When your data uses other conventions for missing data, add those values to the na argument. The vector is processed in order, and each value is matched exactly against the raw text in the file before type parsing occurs.
read_csv(I("x\n1\nN/A\nnull"), na = c("", "NA", "N/A", "null"))
Skipping and limiting rows
When you know that metadata or comment lines precede the actual data, skip jumps past them before parsing begins. Combining skip with n_max lets you extract a specific slice from the middle of a file, which is particularly useful for data files that have a multi-line header block followed by tabular records.
# Skip 10 header lines, read 5 data rows
read_csv("data.csv", skip = 10, n_max = 5)
Compared to base R
| Feature | read_csv() | read.csv() |
|---|---|---|
| Return type | tibble | data.frame |
| Strings to factors | never | default TRUE |
| Row names | never | optional |
| Type inference | automatic | limited |
| Speed | faster | slower |
| Dependencies | readr | none |
read_csv() is faster, returns a tibble, and never surprises you with factors. read.csv() requires no dependencies but has limited type inference and converts strings to factors by default.
For unusual CSV formats, those with escape backslashes, alternative quote escaping, or unusual delimiters inside quoted fields, read_delim() exposes additional arguments that read_csv() does not.
Common problems
Type inference wrong for late-appearing values: Increase guess_max:
read_csv(I("x\n1\n"), guess_max = 2000)
“NA” in my data is being converted to logical NA: Add "NA" to na or use na = character() if empty strings are not missing values. This is a common issue when working with datasets that use “NA” as a genuine category label (such as “North America”) rather than a missing value indicator.
read_csv(I("x\nNA"), na = character()) # keeps "NA" as character
Blank rows produce all-NA rows: Set skip_empty_rows = FALSE to treat blank lines as data rows (all NA), or remove them with drop_na() after reading.
Quote handling with embedded delimiters: If a field contains a comma inside quotes, ensure the quote character is " (default) and the comma is inside the quoted region.
See also
read_delim(), general delimiter; the underlying engine forread_csv()read_csv2(), CSV with;delimiter (European format)write_csv()— write a tibble to a CSV filestr_sub()— string extraction with tidyverse interface, follows similar design principlesfct_reorder()— reordering factor levels, another tidyverse idiom that pairs well withread_csv()for data preparation- Data Manipulation with dplyr — for further processing of imported data
- Reading and Writing CSV Files in R — a practical guide to CSV workflows in R