R | Out-of-memory tabular data formats
When working with big data, oftentimes the datasets in question do not fit into machine memory. In this case, solutions still exist for manipulating that dataset, by not pulling the entire dataset into memory at once.
Note that most of these solutions must still pull the final output of any computations done into memory, so care must be taken that the output fits into memory, typically by removing any data that isn't part part of any remaining computations as early as possible in the computing process.
NOTE: this table only contains information on out-of-memory tabular formats. For in-memory formats, which are far more commonly used, see [12].
In-Memory Format | Disk Format | Source Package | Grammar | Description | Pros | Cons | Takeaways |
| - delimited (e.g. |
| - | - uses an alternative representation [8] of | - easy to use, can use full set of | - only works with delimited or fixed-width format files | - use if you want an interfacing method that is as close to what you can do with an in-memory |
| - various (depends on DBI backend) |
| - | - an R interface for using various database management systems [4], either remote or on local disk | - have choice of | - when reading data into R, requires data needs to first be converted to native R format, resulting in overhead | - use if you prefer to leverage a separate DBMS for computation on the data and not R itself, or expect that the datasets you're processing will eventually grow enough to require that |
| - |
| - | - part of the R implementation of Apache | - can use large number of standard disk formats, making for great interop with data systems outside of R | - limited subset of | - good option when interop with other data systems is important (esp. when outside of the Julia-Python-R triumvirate) |
| - |
| - | - a format that divides a large table into chunks of | - full support for | - only outputs to | - use if you want to use |
Resources:
[2] https://cran.r-project.org/web/packages/vroom/vignettes/benchmarks.html#how-it-works
[11] https://diskframe.com/articles/intro-disk-frame.html#examples-of-not-fully-supported-dplyr-verbs
Verification/Research: when re-verifying or updating this card, please look into the following topics if time allows:
[A] Do
disk.frame
andarrow::Dataset
have the same delayed parsing drawback thatvroom
has?[B] Need additional pros/cons of
disk.frame
andarrow::Dataset
based on hands-on experience.[C] Is
disk.frame
still being maintained? (Do not remove; this should be re-verified every time this card comes up for verification)