Reading and Writing CSV Files in R
CSV files are the bread and butter of data work. Almost every dataset you encounter will come as a CSV at some point, and you’ll need to export your cleaned data the same way. R gives you two main paths: base R functions that have been around forever, and the readr package from the tidyverse that handles edge cases better. This guide covers both.
Reading CSV files in base R
Base R includes read.csv() and read.csv2() for reading comma-separated and semicolon-separated files. These functions return a data frame, which is what you want most of the time.
# Basic reading - first row becomes column names
df <- read.csv("data/survey.csv")
# If your file has no header row
df <- read.csv("data/survey.csv", header = FALSE)
# Specify column classes upfront - saves memory and parsing time
df <- read.csv("data/survey.csv",
colClasses = c("character", "numeric", "integer", "character"))
# Handle missing values - these strings become NA
df <- read.csv("data/survey.csv", na.strings = c("NA", "", "N/A", "-9999"))
The header argument matters more than you might think. Some datasets ship without headers, especially exports from legacy systems. Others have the first row as metadata rather than column names. Check your file before you import.
colClasses is worth using even when you don’t have to. It forces R to read each column as the type you specify, which prevents the common headache of strings accidentally being converted to factors or numbers being read as strings. This also makes reading large files noticeably faster.
Reading CSV files with readr
The readr package (part of tidyverse) offers a modern alternative. It returns tibbles instead of data frames, handles encoding better, and gives you progress bars for large files.
library(readr)
# Basic reading - much faster than base R for large files
df <- read_csv("data/survey.csv")
# Explicit column types - recommended for reproducibility
df <- read_csv("data/survey.csv",
col_types = cols(
id = col_integer(),
name = col_character(),
score = col_double(),
date = col_date(format = "%Y-%m-%d")
))
# Skip lines at the top - useful when there's metadata before the data
df <- read_csv("data/survey.csv", skip = 3)
# Read only specific columns - saves memory on wide files
df <- read_csv("data/survey.csv",
col_select = c(id, name, score))
# Handle missing values cleanly
df <- read_csv("data/survey.csv", na = c("NA", "", "N/A", "-9999"))
The first argument is always the file path. readr can also read directly from compressed files without you needing to unzip them first.
Tibbles behave slightly differently from data frames. They don’t do partial matching on column names and they print more cleanly. Most of the time this is exactly what you want.
Writing CSV files
Exporting data works the same way - you have base R and readr options.
# Base R - write.csv includes row names by default, usually unwanted
write.csv(df, "output/survey_cleaned.csv", row.names = FALSE)
# Specify NA representation
write.csv(df, "output/survey_cleaned.csv",
row.names = FALSE,
na = "-9999")
# readr version - cleaner output by default
library(readr)
write_csv(df, "output/survey_cleaned.csv")
# Write to a specific encoding
write_csv(df, "output/survey_cleaned.csv",
na = "-9999",
quote = "needed")
Watch out for the row.names argument in base R. It adds an unnamed first column with row numbers, which breaks downstream processing. Always set it to FALSE unless you specifically need those row names.
The readr functions are more explicit about quoting behavior. They only quote when necessary, whereas base R sometimes over-quotes.
Handling edge cases
Real CSV files are messier than textbook examples. Here are the common problems and how to solve them.
# Files with different delimiters - readr handles these
read_csv2("data/european_sales.csv") # semicolon delimited
read_tsv("data/tab_separated.txt") # tab delimited
read_delim("data/mixed_delimiter.txt", delim = "|")
# Handle quoted fields with commas inside
# "Smith, John",50000 becomes two fields in some files
df <- read_csv("data/complex.csv",
quote = "\"",
escape_backslash = TRUE)
# Multiple lines in a single cell - readr handles this automatically
df <- read_csv("data/has_newlines.csv")
# Character encoding problems - specify encoding explicitly
df <- read_csv("data/german_data.csv",
locale = locale(encoding = "ISO-8859-1"))
# Files spread across multiple lines per record
df <- read_fwf("data/fixed_width.txt",
fwf_cols(id = 1, name = 10, value = 5),
col_types = "icn")
The tidyverse uses locale() to control encoding, date formats, and decimal markers. If you’re reading data from different countries, you’ll run into different conventions for dates and decimals.
Performance tips for large files
Reading million-row CSV files can bring R to its knees if you do it wrong. These techniques help.
# Read only first N rows - useful for quick inspection
sample <- read_csv("data/million_rows.csv", n_max = 1000)
# Increase guess_max when column types are misdetected
df <- read_csv("data/messy.csv",
guess_max = 10000)
# Progress bar for long reads
df <- read_csv("data/huge.csv",
show_col_types = TRUE)
# For truly massive files, consider data.table::fread
library(data.table)
df <- fread("data/million_rows.csv",
nrows = 100000,
skip = 1000000)
The guess_max parameter controls how many rows readr looks at to guess column types. If your file has unusual data early on mixed with normal data later, increase this value.
The data.table package’s fread() function is significantly faster than both base R and readr for large files. It’s worth learning if you regularly work with datasets over 100MB. It returns a data.table, which is a different beast from data frames or tibbles.
When to use each approach
Base R’s read.csv() works fine for small files and quick one-off analyses. It’s always available, which matters when you’re sharing code with people who don’t use tidyverse.
Use readr for most everyday work. The tibble output, explicit type specification, and better defaults make your code more reliable. The progress bars are genuinely useful for files over 10MB.
Switch to data.table::fread when speed matters and you’re comfortable with data.table syntax. It’s the fastest option by a significant margin for files over 50MB.
For writing, the choice matters less. write.csv() is fine for small exports. write_csv() is cleaner and faster for everything else.
Format-specific considerations
CSV files are text-based and universally supported but have limitations: they cannot store data types (all values are strings), they cannot store multiple sheets, and commas in values require quoting conventions. For data interchange, CSV is appropriate when the recipient does not use R. For internal storage, Parquet or RDS is more efficient.
readr::read_csv() handles quoted commas correctly: "value,with,commas" is read as a single field. TSV (tab-separated) avoids the quoting complication because tab characters rarely appear in data values. read_tsv() is the equivalent for TSV files.
Large file handling
For files too large to fit in memory, chunked reading with readr::read_csv_chunked() processes the file in pieces: read_csv_chunked("large.csv", callback = function(chunk, pos) process(chunk)). The callback receives each chunk as a tibble and a byte position. Aggregate results across chunks and combine at the end.
data.table::fread() is 5-10x faster than read_csv() for large files and handles various delimiters automatically. It also supports nrows and skip for partial reads. For columnar access patterns, converting to Parquet with arrow::write_parquet() and reading with arrow::open_dataset() is faster for repeated analysis.
Writing reliably
write_csv() uses UTF-8 encoding and quotes fields containing commas or newlines. write_excel_csv() (with BOM) ensures Excel opens the file with correct encoding. For European locales where the decimal separator is a comma, use write_csv2() which uses semicolons as the field separator.
Always write to a temporary file and move it into place atomically to prevent partial writes corrupting the output file: write_csv(df, tmp <- tempfile()); file.rename(tmp, "output.csv").
Performance comparison
For reading large CSV files, data.table::fread() is the fastest option in R, typically 5-10x faster than readr::read_csv() and 20-50x faster than read.csv(). For smaller files (under a few MB), the difference is imperceptible. fread() auto-detects delimiters, column types, and encodings without configuration.
Encoding issues
CSV files from non-English sources often use Latin-1 or Windows-1252 encoding. Both read.csv() and read_csv() default to UTF-8. Specify the encoding explicitly: read_csv("file.csv", locale = locale(encoding = "latin1")). The readr::guess_encoding() function detects the most likely encoding from a sample of the file.
Writing for reproducibility
When writing CSV files for data exchange, follow these conventions: use UTF-8 encoding, use ISO 8601 format for dates (YYYY-MM-DD), avoid special characters in column names, and include a header row. write_csv() from readr follows these conventions by default. Base R write.csv() adds row names by default — always pass row.names = FALSE unless row names carry information.
CSV as the universal format
CSV is the lingua franca of data exchange. Every spreadsheet application, every database, every statistical package, and every programming language can read and write CSV. Its universality comes from its simplicity: rows are separated by newlines, fields within rows are separated by a delimiter (usually a comma), and text fields are optionally quoted. This simplicity also creates ambiguity — embedded newlines, delimiter characters in field values, and inconsistent quoting all require careful handling.
R’s base read.csv and the tidyverse’s readr::read_csv differ in speed, default types, and behavior on edge cases. readr is faster for large files and does not convert string columns to factors automatically. Base read.csv is more conservative and produces standard data frames without extra class attributes. For most purposes readr is the better choice; for scripts that must run without installing tidyverse packages, base R works fine.
Handling messy CSV files
Real-world CSV files often have non-standard structures: metadata rows at the top before the actual data, multiple header rows, inconsistent delimiters, or missing values encoded as empty strings, “NA”, ”-”, or “NULL”. The skip argument handles leading metadata rows. The na argument specifies strings that should be read as NA. The col_types argument forces column types when readr’s inference makes wrong choices.
Files with semicolon delimiters, which are common in European software that uses commas as decimal separators, require read_csv2 (readr) or read.csv2 (base R) rather than the standard read_csv. The difference is just the default delimiter and decimal separator — the functions are otherwise identical. Using the wrong function on a semicolon-delimited file reads the entire row as a single column with no error, which is a common import mistake to diagnose.
Writing for other consumers
When exporting CSV for consumption by other tools, explicitly control the encoding and line endings. readr’s write_csv writes UTF-8 with Unix line endings, which is correct for most modern consumers. Excel on Windows sometimes prefers Windows-1252 encoding and Windows line endings — write_excel_csv adds a UTF-8 byte order mark that helps Excel recognize the encoding. Include the BOM only for files specifically intended for Excel; other consumers often handle the BOM poorly.
Quoting behavior matters for portability. Fields containing the delimiter character, newlines, or double quotes must be quoted. readr quotes all fields that need quoting and handles the escaping correctly. Custom data frames written with cat or manual string building frequently get quoting wrong. Always use a proper CSV writer function rather than assembling CSV strings manually.