Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | CSV reading options

R has many options for reading csv files; here is a comparison of some of them to help with deciding which one to use. Note that these functions (or related ones in the same package) also have options for other delimited file types, such as tsv files; unless otherwise noted, the information here should also apply to those file types as well.

At a high level, vroom::vroom, data.table::fread, and arrow::read_csv_arrow are all generally comparable to each other, with each being slightly more preferred depending on which table format you want the output table to be in without any conversion. The readers from utils and readr should only be used when there is no other choice (due to package installation limitations), or if the dataset is small enough that differences in the ingestion times don't matter.

Function	Source	Description	Pros	Cons	Takeaways
`read.csv read.csv2`	`utils` (built-in package)	The built-in reading functions that come shipped with the core R distribution	- built-in to core R distribution and does not have dependencies on other packages	- very slow compared to other options	- only suitable for small tables - use only when unable to install packages beyond core R
`read_csv`	`readr` (part of Tidyverse)	Tidyverse's basic csv reading function; returns a `tibble`	- if using `dplyr`, no conversion to `tibble` needed	- slow compared to other options	- can use when performing `dplyr` analysis on small tables
`vroom`	`vroom` (part of Tidyverse)	Alternate Tidyverse csv reader; performs lazy loading (doesn't parse any data from files until the data is used), though this can be disabled with the `altrep=F` option	- very fast, especially when lazy loading is enabled - can read multiple files	- if lazy reading, data parsing errors are only surfaced when data is actually used, which may result in unpredictable bugs	- convenient if there are many source files to read in to a single table - useful for when full source data doesn't fit into memory, but the parts that need to be processed (e.g. a sample) can fit into memory
`fread`	`data.table`	The fast file reader that `data.table` uses; returns either `data.table` or `data.frame`	- very fast and still reads all data into memory	- cannot scale to data that do not fit into memory	- use if `data.table` is the data analysis format - good general option for single tables that are larger than a small table but can still fit in memory
`read_csv_arrow`	`arrow from` Apache Arrow	Part of the Apache `arrow` implementation in R; returns either an `arrow` `Table`, or `data.frame`. Using the former causes lazy loading (data is not parsed until used).	- extremely fast initial load time when lazy reading is enabled - comparable to `fread` when lazy loading is disabled	- conversion to `tibble` or `data.table` needed if those formats are needed	- use if you want to use Apache Arrow's `Table` as the analysis format - still a good option comparable to `vroom` and `fread` otherwise

Resources:

[1] Benchmarks from August 2020: https://enpiar.com/talks/nyr-2020/benchmarks.html
- Note that this was created for a presentation [2] about Apache Arrow, so some biases may be present there
[2] https://enpiar.com/talks/nyr-2020/#29

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.