R | vroom Memory Usage Behavior
The vroom
package can, in theory, be used for data that do not fit into current working memory, because it does not parse any data during the initial read phase, so the memory footprint is much smaller than fully in-memory readers. However, due to some undocumented subtleties in how it pulls data into memory, it's not always straightforward to predict how much memory it will actually use and the rules governing that usage.
vroom
's Vector Behavior:
vroom
uses the ALTREP framework [1] for its vectors, which means that the underlying implementation of the vectors differs from the standard R implementation.vroom
's vector implementation is that initially, vectors don't hold the parsed values of the data, but merely information on how to locate and parse those values from the source files they point to (e.g. indexing and data type).Once the vector's value is accessed (basically, when the vector is evaluated in an expression), it replaces the initial "how to find" information in the vector with the actual value. The vector is now said to be "materialized", and its values are now fully in memory.
It's possible to check whether a particular
vroom
vector is materialized or not, by using the.Internal(inspect())
function [2]. See below for examples.
Observations:
Materialization is a vector-level, binary condition—a
vroom
vector either has all of its values in memory, or none of it in memory, and a table's column vectors can all be materialized independent of each other.Materialized vectors obviously take up more memory than unmaterialized vectors, but are also faster since the values do not have to be retrieved from disk.
Once a vector is materialized, it can not be "un-materialized", at least not through
vroom
's public API.Retrieving a subset of the vector's value does NOT materialize the whole vector, even if you subset on the entire vector (see below for examples).
Replacing any individual element of a
vroom
vector through assignment will immediately materialize the vector, then convert that vector into a regular R vector.dplyr
verbs have varying and apparently undocumented effects on whether they materialize vectors or not—the full extent of this behavior is unknown, but the examples on this card show thatgroup_by
materializes the grouping column, butfilter
does NOT materialize the filtering column.
Examples:
# generate table for examples
set.seed(42)
types <- "cidTD"
x <- vroom::gen_tbl(10, 5, col_types = types)
readr::write_tsv(x, "altrep.tsv")
y <- vroom::vroom("altrep.tsv", col_types = types)
.Internal(inspect(y$X1)) # `materialized=F` initially
y$X1 # evaluate the vector
.Internal(inspect(y$X1)) # `materialized=T` now because we evaluated `y$X1`
.Internal(inspect(y$X2)) # other vectors are not materialized until they're evaluated
y$X1[seq_len(nrow(y))] # "subset" the length of the entire vector
.Internal(inspect(y$X1)) # will NOT materialize even though we subsetted every element
x_1 <- y$X1 # assign to another variable
.Internal(inspect(x_1)) # still not materialized due to R's copy-on-modify semantics
y$X1[1] <- "foo" # will convert X1 into a regular, non-altrep vector
.Internal(inspect(y, 1))
z <- dplyr::filter(y, X1 == "icuMPaw") # perform some arbitrary filter
.Internal(inspect(z, 1)) # will NOT materialize any vectors
w <- dplyr::group_by(y, X1)
.Internal(inspect(w)) # WILL materialize; see [1]
Example Materialization Printout:
> .Internal(inspect(z, 1))
@7fafbe1a5ee8 19 VECSXP g0c4 [OBJ,NAM(7),ATT] (len=5, tl=0)
@7fafb87dd200 16 STRSXP g0c0 [NAM(7)] vroom_chr (len=1, materialized=F)
@7fafb87dd190 13 INTSXP g0c0 [NAM(7)] vroom_int (len=1, materialized=F)
@7fafb87dd120 14 REALSXP g0c0 [NAM(7)] vroom_dbl (len=1, materialized=F)
@7fafb87dc898 14 REALSXP g0c0 [OBJ,NAM(7),ATT] vroom_dttm (len=1, materialized=F)
Takeaways/Tips:
Try to subset your source data as quickly as possible in your computations, so that you don't have to deal with memory optimization at all.
Vector materialization in
vroom
is 1-way—once the values are materialized in memory, there is no returning back to its previous state.If you have to use a
vroom
vector's values, but don't want to materialize it, you can "subset" the vector with all of its indices to get the values without triggering the materialization. This is kind of a hack though and isn't guaranteed to behave the same way in futurevroom
versions, so it is likely not a good idea in any code that needs to work in the long run.
Verification Notes:
When verifying this card, it's recommended to rerun the example code shown to make sure that the observations in the card still hold.