Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | CSV reading options

R has many options for reading csv files; here is a comparison of some of them to help with deciding which one to use. Note that these functions (or related ones in the same package) also have options for other delimited file types, such as tsv files; unless otherwise noted, the information here should also apply to those file types as well.

At a high level, vroom::vroom, data.table::fread, and arrow::read_csv_arrow are all generally comparable to each other, with each being slightly more preferred depending on which table format you want the output table to be in without any conversion. The readers from utils and readr should only be used when there is no other choice (due to package installation limitations), or if the dataset is small enough that differences in the ingestion times don't matter.

Function

Source

Description

Pros

Cons

Takeaways

read.csv read.csv2

utils (built-in package)

The built-in reading functions that come shipped with the core R distribution

- built-in to core R distribution and does not have dependencies on other packages

- very slow compared to other options

- only suitable for small tables
- use only when unable to install packages beyond core R

read_csv

readr (part of Tidyverse)

Tidyverse's basic csv reading function; returns a tibble

- if using dplyr, no conversion to tibble needed

- slow compared to other options

- can use when performing dplyr analysis on small tables

vroom

vroom (part of Tidyverse)

Alternate Tidyverse csv reader; performs lazy loading (doesn't parse any data from files until the data is used), though this can be disabled with the altrep=F option

- very fast, especially when lazy loading is enabled
- can read multiple files

- if lazy reading, data parsing errors are only surfaced when data is actually used, which may result in unpredictable bugs

- convenient if there are many source files to read in to a single table
- useful for when full source data doesn't fit into memory, but the parts that need to be processed (e.g. a sample) can fit into memory

fread

data.table

The fast file reader that data.table uses; returns either data.table or data.frame

- very fast and still reads all data into memory

- cannot scale to data that do not fit into memory

- use if data.table is the data analysis format
- good general option for single tables that are larger than a small table but can still fit in memory

read_csv_arrow

arrow from Apache Arrow

Part of the Apache arrow implementation in R; returns either an arrow Table, or data.frame. Using the former causes lazy loading (data is not parsed until used).

- extremely fast initial load time when lazy reading is enabled
- comparable to fread when lazy loading is disabled

- conversion to tibble or data.table needed if those formats are needed

- use if you want to use Apache Arrow's Table as the analysis format
- still a good option comparable to vroom and fread otherwise

Resources:

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.