R | In-memory tabular data formats
R has many ways of representing tabular data. Here is a comparison of some of them.
NOTE: this table only contains information on in-memory tabular formats. For out-of-memory formats (i.e. where the data is on disk or is remote), see [6].
Format | Source Package | Grammar | Description | Pros | Cons | Takeaways |
| base R | base R | the base format that serves as the foundation for other formats | - has no package dependencies | - base data grammar is low-level and becomes hard to comprehend for more complex queries | - only use this when it's important for your code to not have package dependencies. |
| Tidyverse ( |
| the most common format returned by functions in Tidyverse packages, in particular | - | - poor speed/memory performance; most noticeable when dealing with large volumes of data | - use this when performing single-shot exploratory data analysis on small tables |
|
|
| an alternative to the format and grammar offered by | - very fast and memory efficient, and takes advantage of CPU parallelization | - data grammar is aimed towards programmers and can be difficult to decipher for non-programmers | - use whenever large volumes or many iterations of data analysis is required |
| Tidyverse ( |
| a Tidyverse format that uses the | - speed/memory usage is almost comparable to | - requires conversion into one of the other formats before data can be viewed | - use to take advantage of the speed of |
| Apache Arrow project |
| the R implementation of the cross-language Apache Arrow table format [5] | - optimized for speed/memory for a subset of | - speed/memory usage only optimized for some | - use when interop with other data languages (Python, Matlab, Julia, etc.) is needed (esp. if 2 or more are involved), OR if the operations that need to be high-performance can be expressed using the supported |
Resources: