Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

Data | Apache Arrow

Apache arrow [1] is a tabular memory format that provides a fast, memory-efficient data processing paradigm that is unified across multiple languages.

The main selling points are:

the arrow library provides speedy (de)serialization of tabular data, similar to something like FST [2]; besides the advantages for storing and retrieving data, this enables the ability to work with datasets that are too large fit into local memory [6].
It is optimized for speed and memory usage for some common operations [4] [5].
it has specifications across multiple languages and so makes it easy and fast to interop between any languages that it's implemented on.

Resources:

[1] https://arrow.apache.org/faq/
[2] R | FST package (fast read/write to disk)
[3] R implementation: https://arrow.apache.org/docs/r/index.html
[4] "The arrow package works with most single-table dplyr verbs except those that compute aggregates, such as summarise() and mutate() after group_by()"
- source: [3]
[5] "For dplyr queries on Table objects, if the arrow package detects an unimplemented function within a dplyr verb, it automatically calls collect() to return the data as an R data.frame before processing that dplyr verb. For queries on Dataset objects (which can be larger than memory), it raises an error if the function is unimplemented; you need to explicitly tell it to collect()."
- source: [3]
[6] https://arrow.apache.org/docs/r/articles/dataset.html

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.