Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | vroom Memory Usage Behavior

The vroom package can, in theory, be used for data that do not fit into current working memory, because it does not parse any data during the initial read phase, so the memory footprint is much smaller than fully in-memory readers. However, due to some undocumented subtleties in how it pulls data into memory, it's not always straightforward to predict how much memory it will actually use and the rules governing that usage.

vroom's Vector Behavior:

  1. vroom uses the ALTREP framework [1] for its vectors, which means that the underlying implementation of the vectors differs from the standard R implementation.

  2. vroom's vector implementation is that initially, vectors don't hold the parsed values of the data, but merely information on how to locate and parse those values from the source files they point to (e.g. indexing and data type).

  3. Once the vector's value is accessed (basically, when the vector is evaluated in an expression), it replaces the initial "how to find" information in the vector with the actual value. The vector is now said to be "materialized", and its values are now fully in memory.

  4. It's possible to check whether a particular vroom vector is materialized or not, by using the .Internal(inspect()) function [2]. See below for examples.

Observations:

  1. Materialization is a vector-level, binary condition—a vroom vector either has all of its values in memory, or none of it in memory, and a table's column vectors can all be materialized independent of each other.

  2. Materialized vectors obviously take up more memory than unmaterialized vectors, but are also faster since the values do not have to be retrieved from disk.

  3. Once a vector is materialized, it can not be "un-materialized", at least not through vroom's public API.

  4. Retrieving a subset of the vector's value does NOT materialize the whole vector, even if you subset on the entire vector (see below for examples).

  5. Replacing any individual element of a vroom vector through assignment will immediately materialize the vector, then convert that vector into a regular R vector.

  6. dplyr verbs have varying and apparently undocumented effects on whether they materialize vectors or not—the full extent of this behavior is unknown, but the examples on this card show that group_by materializes the grouping column, but filter does NOT materialize the filtering column.

Examples:

# generate table for examplesset.seed(42)types <- "cidTD"x <- vroom::gen_tbl(10, 5, col_types = types)readr::write_tsv(x, "altrep.tsv")y <- vroom::vroom("altrep.tsv", col_types = types).Internal(inspect(y$X1)) # `materialized=F` initiallyy$X1 # evaluate the vector.Internal(inspect(y$X1)) # `materialized=T` now because we evaluated `y$X1`.Internal(inspect(y$X2)) # other vectors are not materialized until they're evaluatedy$X1[seq_len(nrow(y))] # "subset" the length of the entire vector.Internal(inspect(y$X1)) # will NOT materialize even though we subsetted every elementx_1 <- y$X1 # assign to another variable.Internal(inspect(x_1)) # still not materialized due to R's copy-on-modify semanticsy$X1[1] <- "foo" # will convert X1 into a regular, non-altrep vector.Internal(inspect(y, 1))z <- dplyr::filter(y, X1 == "icuMPaw") # perform some arbitrary filter.Internal(inspect(z, 1)) # will NOT materialize any vectorsw <- dplyr::group_by(y, X1).Internal(inspect(w)) # WILL materialize; see [1]

Example Materialization Printout:

> .Internal(inspect(z, 1))@7fafbe1a5ee8 19 VECSXP g0c4 [OBJ,NAM(7),ATT] (len=5, tl=0)  @7fafb87dd200 16 STRSXP g0c0 [NAM(7)] vroom_chr (len=1, materialized=F)  @7fafb87dd190 13 INTSXP g0c0 [NAM(7)] vroom_int (len=1, materialized=F)  @7fafb87dd120 14 REALSXP g0c0 [NAM(7)] vroom_dbl (len=1, materialized=F)  @7fafb87dc898 14 REALSXP g0c0 [OBJ,NAM(7),ATT] vroom_dttm (len=1, materialized=F)

Takeaways/Tips:

  • Try to subset your source data as quickly as possible in your computations, so that you don't have to deal with memory optimization at all.

  • Vector materialization in vroom is 1-way—once the values are materialized in memory, there is no returning back to its previous state.

  • If you have to use a vroom vector's values, but don't want to materialize it, you can "subset" the vector with all of its indices to get the values without triggering the materialization. This is kind of a hack though and isn't guaranteed to behave the same way in future vroom versions, so it is likely not a good idea in any code that needs to work in the long run.

Verification Notes:

  • When verifying this card, it's recommended to rerun the example code shown to make sure that the observations in the card still hold.

Resources:

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.