Data | Apache Arrow
Apache arrow [1] is a tabular memory format that provides a fast, memory-efficient data processing paradigm that is unified across multiple languages.
The main selling points are:
the
arrow
library provides speedy (de)serialization of tabular data, similar to something like FST [2]; besides the advantages for storing and retrieving data, this enables the ability to work with datasets that are too large fit into local memory [6].It is optimized for speed and memory usage for some common operations [4] [5].
it has specifications across multiple languages and so makes it easy and fast to interop between any languages that it's implemented on.
Resources:
[3] R implementation: https://arrow.apache.org/docs/r/index.html
[4] "The
arrow
package works with most single-tabledplyr
verbs except those that compute aggregates, such assummarise()
andmutate()
aftergroup_by()
"source: [3]
[5] "For
dplyr
queries onTable
objects, if thearrow
package detects an unimplemented function within adplyr
verb, it automatically callscollect()
to return the data as an Rdata.frame
before processing thatdplyr
verb. For queries onDataset
objects (which can be larger than memory), it raises an error if the function is unimplemented; you need to explicitly tell it tocollect()
."source: [3]