Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | In-memory tabular data formats

R has many ways of representing tabular data. Here is a comparison of some of them.

NOTE: this table only contains information on in-memory tabular formats. For out-of-memory formats (i.e. where the data is on disk or is remote), see [6].

Format

Source Package

Grammar

Description

Pros

Cons

Takeaways

data.frame [1]

base R

base R

the base format that serves as the foundation for other formats

- has no package dependencies
- base data grammar is predictable and programmatic

- base data grammar is low-level and becomes hard to comprehend for more complex queries

- only use this when it's important for your code to not have package dependencies.

tibble [2]

Tidyverse (tibble)

dplyr

the most common format returned by functions in Tidyverse packages, in particular dplyr

- dplyr data grammar is similar to natural language and is easy to learn for non-programmers, and easy to read by collaborators
- easy to do exploratory data analysis tasks with

- poor speed/memory performance; most noticeable when dealing with large volumes of data
- troublesome to program with due to data masking
- can be hard to debug since the natural-language-style grammar obscures the technical details of data manipulation

- use this when performing single-shot exploratory data analysis on small tables
- do not use for large volumes or many iterations of data analysis
- do not use for code that may serve as a backend for other code that will process large amounts of data

data.table [3]

data.table package

data.table

an alternative to the format and grammar offered by dplyr

- very fast and memory efficient, and takes advantage of CPU parallelization
- built directly off of base R and does not have other dependencies
- data grammar is programmatic
and has a single, consistent form

- data grammar is aimed towards programmers and can be difficult to decipher for non-programmers

- use whenever large volumes or many iterations of data analysis is required

lazy_dt [4]

Tidyverse (dtplyr package)

dplyr

a Tidyverse format that uses the dplyr grammar but data.table as the backend

- speed/memory usage is almost comparable to data.table

- requires conversion into one of the other formats before data can be viewed
- slightly slower than data.table (on the order of magnitude of ~10% extra time)

- use to take advantage of the speed of data.table but prefer the dplyr grammar, and are okay with the small speed decrease

Table [5]

Apache Arrow project

dplyr (subset)

the R implementation of the cross-language Apache Arrow table format [5]

- optimized for speed/memory for a subset of dplyr verbs

- speed/memory usage only optimized for some dplyr verbs [5]; mixing in non-supported verbs into a sequence may cause the whole operation to take longer than pure dplyr, due to need to convert into data.frame first [7]

- use when interop with other data languages (Python, Matlab, Julia, etc.) is needed (esp. if 2 or more are involved), OR if the operations that need to be high-performance can be expressed using the supported dplyr verbs

Resources:

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.