Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | Out-of-memory tabular data formats

When working with big data, oftentimes the datasets in question do not fit into machine memory. In this case, solutions still exist for manipulating that dataset, by not pulling the entire dataset into memory at once.

Note that most of these solutions must still pull the final output of any computations done into memory, so care must be taken that the output fits into memory, typically by removing any data that isn't part part of any remaining computations as early as possible in the computing process.

NOTE: this table only contains information on out-of-memory tabular formats. For in-memory formats, which are far more commonly used, see [12].

In-Memory Format

Disk Format

Source Package

Grammar

Description

Pros

Cons

Takeaways

tibble (vroom altrep)

- delimited (e.g. csv)
- fixed-width file

vroom [5]

- dplyr

- uses an alternative representation [8] of tibble where the table values are not parsed into memory ("materialized") until they are needed
- unlike other options, computation is NOT lazy and instead is evaluated eagerly

- easy to use, can use full set of dplyr verbs as if it's a normal tibble
- can process fixed-width files as the input/output format, which is rare

- only works with delimited or fixed-width format files
- vector materialization behavior is subtle and not officially documented [9], making for tricky memory management
- cannot "de-materialize" column data out from memory once they're materialized
- slower performance on some reading-intensive operations (particularly aggregations) due to data not being parsed until used
- parsing errors are not surfaced until data is read, which can be much later after data is initially read

- use if you want an interfacing method that is as close to what you can do with an in-memory tibble as possible
- only use when you can quickly reduce original dataset to be small enough to fit into memory, due to tendency to pull out-of-memory vectors into memory permanently

DBIConnection

- various (depends on DBI backend)

DBI backends [1]

- dbplyr (dplyr subset)
- query grammar of the backend database

- an R interface for using various database management systems [4], either remote or on local disk

- have choice of dbplyr or direct query as grammar
- can send query to database without having to pull the end result into working memory, so may be able upscale to even larger datasets compared to other solutions, depending on the backend DBMS

- when reading data into R, requires data needs to first be converted to native R format, resulting in overhead
- dbplyr requires source data to be read in via query in backend DBMS, or converted from in-memory data.frame; does not have own reader
- dbplyr only implements subset of dplyr, which means some refactoring needed if moving from other Tidyverse formats (e.g. tibble)
- if working outside of dbplyr grammar, will need to know backend query language (SQL variants)

- use if you prefer to leverage a separate DBMS for computation on the data and not R itself, or expect that the datasets you're processing will eventually grow enough to require that

Dataset [10]

- parquet
- feather
- delimited (e.g. csv)

arrow (Apache Arrow) [7]

- dplyr subset

- part of the R implementation of Apache arrow
- uses similar system of multi-file sourcing as vroom and disk.frame

- can use large number of standard disk formats, making for great interop with data systems outside of R
- advanced partitioning and schema support when reading in data

- limited subset of dplyr grammar supported; no alternative way of accessing full grammar without converting into in-memory format

- good option when interop with other data systems is important (esp. when outside of the Julia-Python-R triumvirate)
- only use when supported dplyr grammar subset is sufficient to perform all computation on dataset, OR can reduce original dataset to small enough to fit and convert to in-memory format (e.g. tibble)

disk.frame

- csv (input)
- zip of csvs (input)
- fst [3] (output)

disk.frame [6]

- dplyr subset (see [10])
- data.table

- a format that divides a large table into chunks of fst files on disk that are accessed as needed for computation

- full support for data.table grammar

- only outputs to fst
- does not fully support dplyr grammar [11]
- still a new package
- developed and maintained by a single person, which increases risk of it falling out of maintenance

- use if you want to use data.table grammar

Resources:

Verification/Research: when re-verifying or updating this card, please look into the following topics if time allows:

  • [A] Do disk.frame and arrow::Dataset have the same delayed parsing drawback that vroom has?

  • [B] Need additional pros/cons of disk.frame and arrow::Dataset based on hands-on experience.

  • [C] Is disk.frame still being maintained? (Do not remove; this should be re-verified every time this card comes up for verification)

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.