R | CSV reading options
R has many options for reading csv files; here is a comparison of some of them to help with deciding which one to use. Note that these functions (or related ones in the same package) also have options for other delimited file types, such as tsv files; unless otherwise noted, the information here should also apply to those file types as well.
At a high level, vroom::vroom
, data.table::fread
, and arrow::read_csv_arrow
are all generally comparable to each other, with each being slightly more preferred depending on which table format you want the output table to be in without any conversion. The readers from utils
and readr
should only be used when there is no other choice (due to package installation limitations), or if the dataset is small enough that differences in the ingestion times don't matter.
Function | Source | Description | Pros | Cons | Takeaways |
| The built-in reading functions that come shipped with the core R distribution | - built-in to core R distribution and does not have dependencies on other packages | - very slow compared to other options | - only suitable for small tables | |
| Tidyverse's basic csv reading function; returns a | - if using | - slow compared to other options | - can use when performing | |
| Alternate Tidyverse csv reader; performs lazy loading (doesn't parse any data from files until the data is used), though this can be disabled with the | - very fast, especially when lazy loading is enabled | - if lazy reading, data parsing errors are only surfaced when data is actually used, which may result in unpredictable bugs | - convenient if there are many source files to read in to a single table | |
| The fast file reader that | - very fast and still reads all data into memory | - cannot scale to data that do not fit into memory | - use if | |
| Part of the Apache | - extremely fast initial load time when lazy reading is enabled | - conversion to | - use if you want to use Apache Arrow's |
Resources:
[1] Benchmarks from August 2020: https://enpiar.com/talks/nyr-2020/benchmarks.html
Note that this was created for a presentation [2] about Apache Arrow, so some biases may be present there