Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | In-memory tabular data formats

R has many ways of representing tabular data. Here is a comparison of some of them.

NOTE: this table only contains information on in-memory tabular formats. For out-of-memory formats (i.e. where the data is on disk or is remote), see [6].

Format	Source Package	Grammar	Description	Pros	Cons	Takeaways
`data.frame` [1]	base R	base R	the base format that serves as the foundation for other formats	- has no package dependencies - base data grammar is predictable and programmatic	- base data grammar is low-level and becomes hard to comprehend for more complex queries	- only use this when it's important for your code to not have package dependencies.
`tibble` [2]	Tidyverse (`tibble`)	`dplyr`	the most common format returned by functions in Tidyverse packages, in particular `dplyr`	- `dplyr` data grammar is similar to natural language and is easy to learn for non-programmers, and easy to read by collaborators - easy to do exploratory data analysis tasks with	- poor speed/memory performance; most noticeable when dealing with large volumes of data - troublesome to program with due to data masking - can be hard to debug since the natural-language-style grammar obscures the technical details of data manipulation	- use this when performing single-shot exploratory data analysis on small tables - do not use for large volumes or many iterations of data analysis - do not use for code that may serve as a backend for other code that will process large amounts of data
`data.table` [3]	`data.table` package	`data.table`	an alternative to the format and grammar offered by `dplyr`	- very fast and memory efficient, and takes advantage of CPU parallelization - built directly off of base R and does not have other dependencies - data grammar is programmatic and has a single, consistent form	- data grammar is aimed towards programmers and can be difficult to decipher for non-programmers	- use whenever large volumes or many iterations of data analysis is required
`lazy_dt` [4]	Tidyverse (`dtplyr` package)	`dplyr`	a Tidyverse format that uses the `dplyr` grammar but `data.table` as the backend	- speed/memory usage is almost comparable to `data.table`	- requires conversion into one of the other formats before data can be viewed - slightly slower than `data.table` (on the order of magnitude of ~10% extra time)	- use to take advantage of the speed of `data.table` but prefer the `dplyr` grammar, and are okay with the small speed decrease
`Table` [5]	Apache Arrow project	`dplyr` (subset)	the R implementation of the cross-language Apache Arrow table format [5]	- optimized for speed/memory for a subset of `dplyr` verbs	- speed/memory usage only optimized for some `dplyr` verbs [5]; mixing in non-supported verbs into a sequence may cause the whole operation to take longer than pure `dplyr`, due to need to convert into `data.frame` first [7]	- use when interop with other data languages (Python, Matlab, Julia, etc.) is needed (esp. if 2 or more are involved), OR if the operations that need to be high-performance can be expressed using the supported `dplyr` verbs

Resources:

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.