Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | Out-of-memory tabular data formats

When working with big data, oftentimes the datasets in question do not fit into machine memory. In this case, solutions still exist for manipulating that dataset, by not pulling the entire dataset into memory at once.

Note that most of these solutions must still pull the final output of any computations done into memory, so care must be taken that the output fits into memory, typically by removing any data that isn't part part of any remaining computations as early as possible in the computing process.

NOTE: this table only contains information on out-of-memory tabular formats. For in-memory formats, which are far more commonly used, see [12].

In-Memory Format	Disk Format	Source Package	Grammar	Description	Pros	Cons	Takeaways
`tibble` (`vroom` altrep)	- delimited (e.g. `csv`) - fixed-width file	`vroom` [5]	- `dplyr`	- uses an alternative representation [8] of `tibble` where the table values are not parsed into memory ("materialized") until they are needed - unlike other options, computation is NOT lazy and instead is evaluated eagerly	- easy to use, can use full set of `dplyr` verbs as if it's a normal `tibble` - can process fixed-width files as the input/output format, which is rare	- only works with delimited or fixed-width format files - vector materialization behavior is subtle and not officially documented [9], making for tricky memory management - cannot "de-materialize" column data out from memory once they're materialized - slower performance on some reading-intensive operations (particularly aggregations) due to data not being parsed until used - parsing errors are not surfaced until data is read, which can be much later after data is initially read	- use if you want an interfacing method that is as close to what you can do with an in-memory `tibble` as possible - only use when you can quickly reduce original dataset to be small enough to fit into memory, due to tendency to pull out-of-memory vectors into memory permanently
`DBIConnection`	- various (depends on DBI backend)	`DBI` backends [1]	- `dbplyr` (`dplyr` subset) - query grammar of the backend database	- an R interface for using various database management systems [4], either remote or on local disk	- have choice of `dbplyr` or direct query as grammar - can send query to database without having to pull the end result into working memory, so may be able upscale to even larger datasets compared to other solutions, depending on the backend DBMS	- when reading data into R, requires data needs to first be converted to native R format, resulting in overhead - `dbplyr` requires source data to be read in via query in backend DBMS, or converted from in-memory `data.frame`; does not have own reader - `dbplyr` only implements subset of `dplyr`, which means some refactoring needed if moving from other Tidyverse formats (e.g. `tibble`) - if working outside of `dbplyr` grammar, will need to know backend query language (SQL variants)	- use if you prefer to leverage a separate DBMS for computation on the data and not R itself, or expect that the datasets you're processing will eventually grow enough to require that
`Dataset` [10]	- `parquet` - `feather` - delimited (e.g. `csv`)	`arrow` (Apache Arrow) [7]	- `dplyr` subset	- part of the R implementation of Apache `arrow`- uses similar system of multi-file sourcing as `vroom` and `disk.frame`	- can use large number of standard disk formats, making for great interop with data systems outside of R - advanced partitioning and schema support when reading in data	- limited subset of `dplyr` grammar supported; no alternative way of accessing full grammar without converting into in-memory format	- good option when interop with other data systems is important (esp. when outside of the Julia-Python-R triumvirate) - only use when supported `dplyr` grammar subset is sufficient to perform all computation on dataset, OR can reduce original dataset to small enough to fit and convert to in-memory format (e.g. `tibble`)
`disk.frame`	- `csv` (input)- `zip` of `csv`s (input) - `fst` [3] (output)	`disk.frame` [6]	- `dplyr` subset (see [10]) - `data.table`	- a format that divides a large table into chunks of `fst` files on disk that are accessed as needed for computation	- full support for `data.table` grammar	- only outputs to `fst` - does not fully support `dplyr` grammar [11] - still a new package - developed and maintained by a single person, which increases risk of it falling out of maintenance	- use if you want to use `data.table` grammar

Resources:

Verification/Research: when re-verifying or updating this card, please look into the following topics if time allows:

[A] Do disk.frame and arrow::Dataset have the same delayed parsing drawback that vroom has?
[B] Need additional pros/cons of disk.frame and arrow::Dataset based on hands-on experience.
[C] Is disk.frame still being maintained? (Do not remove; this should be re-verified every time this card comes up for verification)

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.