In the programming world, laziness can often be a good thing: it is both a human quality that can motivate automation efforts, and a programming concept that avoids wasting resources such as memory. Now, when reading code or documentation, seeing the word “lazy” can be confusing, because of its polisemy: it carries several meanings. In this post, we will enumerate the different possible definitions of “lazy” in R code.
Lazy as in lazy evaluation
You might know that R provides lazy evaluation: the arguments of a function are only evaluated if they are accessed. In short, you can pass anything as an argument value to a function without any problem as long as the function does not use that value.
For instance, the code below works despite evaluation
not existing because the definition of the do_something()
function includes ellipsis and because the lazy
argument is not actually used.
do_something <- function(x, na.rm = TRUE, ...) {
mean(x, na.rm = na.rm)
}
do_something(1:10, lazy = evaluation)
#> [1] 5.5
The contrary of lazy evaluation is eager evaluation.
The Advanced R book by Hadley Wickham features a very clear introduction to lazy evaluation.
Note that the workhorse of lazy evaluation in base R is a thing called a promise that contains an expression (the recipe for getting a value), an environment (the ingredients that are around), and a value. The latter is only computed when accessed, and cached once computed.
What about {future}’s promises?
Maybe you have heard the word “promises” in R in the context of the future package by Henrik Bengtsson. It provides an implementation in R of futures, a programming concept. Its homepage state “In programming, a future is an abstraction for a value that may be available at some point in the future. The state of a future can either be unresolved or resolved.”
When using the {future} package, you create a future, that is associated to a promise, which is a placeholder for a value and then the value itself (so not the same definition of “promise” as the “promises” used by base R in the context of lazy evaluation). The value can be computed asynchronously, which means in parallel. Therefore, the futures package allows R programmers to take full advantage of their local computing resources: cores, clusters, etc.
To come back to laziness, by default a future is not lazy, it is eager. This means that it is computed immediately.
By default, the creation of a future below (eager_future
) takes as much time as not wrapping the code in a future, because the computation is immediate. Setting lazy
to TRUE
makes the future creation much faster (lazy_future
).
library("future")
bench::mark(
no_future = is.numeric(runif(n = 10000000)),
eager_future = future(is.numeric(runif(n = 10000000))),
lazy_future = future(is.numeric(runif(n = 10000000)), lazy = TRUE),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 no_future 235ms 236ms 4.19 76.3MB 2.80
#> 2 eager_future 241ms 242ms 4.06 83.1MB 4.06
#> 3 lazy_future 745µs 977µs 842. 13.8KB 10.0
If we do retrieve the value, overall the same time is spent between creating the future and our getting the value:
bench::mark(
no_future = {is.numeric(runif(n = 10000000))},
eager_future = {x <- future(is.numeric(runif(n = 10000000))); value(x)},
lazy_future = {x <- future(is.numeric(runif(n = 10000000)), lazy = TRUE); value(x)},
check = FALSE
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 no_future 236ms 239ms 4.20 76.3MB 4.20
#> 2 eager_future 241ms 241ms 4.11 76.6MB 4.11
#> 3 lazy_future 253ms 254ms 3.94 76.4MB 3.94
Therefore, the use of futures and the use of lazy evaluation are orthogonal concepts: you can use future with or without lazy evaluation. The future package is about how the value is computed (in parallel or sequentially for instance), lazy evaluation is about when the value is computed (right as it is defined, or only when it is needed).
Lazy as in lazy database operations
In the database world, queries can be lazy: the query is like a TODO list that is only executed (computed, evaluated) when you want to access the resulting table or result. Making the output tangible is called materialization.
This is vocabulary we can encounter when using:
- the dbplyr package maintained by Hadley Wickham, which is the dplyr back-end for databases. “All dplyr calls are evaluated lazily, generating SQL that is only sent to the database when you request the data.”
Slightly tweaked from the dbplyr README,
# load packages
library(dplyr, warn.conflicts = FALSE)
library("dbplyr")
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
# create the connection and refer to the table
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
# create the query
summary <- mtcars2 %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg, na.rm = TRUE)) %>%
arrange(desc(mpg))
# the object is lazy, the value is not computed yet
# here is what summary looks like at this stage
summary
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.47.1 [:memory:]
#> # Ordered by: desc(mpg)
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
nrow(summary)
#> [1] NA
# we explicitly request the data, so now it's there
answer <- collect(summary)
nrow(answer)
#> [1] 3
- the dtplyr package also maintained by Hadley Wickham, which is a data.table back-end for dplyr. The “lazy” data.table objects “captures the intent of dplyr verbs, only actually performing computation when requested” (with
collect()
for instance). The manual also explains that this allows dtplyr to make the code more performant by simplifying the data.table calls.
Slightly tweaked from dtplyr README,
# load packages
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
# create a “lazy” data table that tracks the operations performed on it.
mtcars2 <- lazy_dt(mtcars)
# create the query
summary <- mtcars2 %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k))
# the object is lazy, the value is not computed yet
summary
#> Source: local data table [3 x 2]
#> Call: `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)),
#> keyby = .(cyl)]
#>
#> cyl l100k
#> <dbl> <dbl>
#> 1 4 9.05
#> 2 6 12.0
#> 3 8 14.9
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
nrow(summary)
#> [1] NA
# we explictly request the data, so now it's there
answer <- as_tibble(summary)
nrow(answer)
#> [1] 3
- the duckplyr package which deserves its own subsection as its objects are both lazy and eager.
duckplyr, lazy evaluation and prudence
The duckplyr package is a package that uses DuckDB under the hood but that is also a drop-in replacement for dplyr. These two facts create a tension:
-
When using dplyr, we are not used to explicitly collect results: the data.frames are eager by default. Adding a
collect()
step by default would confuse users and make “drop-in replacement” an exaggeration. Therefore, duckplyr needs eagerness! -
The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table. Therefore, duckplyr needs laziness!
As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to ALTREP, a powerful R feature that among other things supports deferred evaluation.
“ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed.” Hannes Mühleisen.
If the thing accessing the duckplyr data.frame is…
- not duckplyr, then a special callback is executed, allowing materialization of the data frame.
- duckplyr, then the operations continue to be lazy (until a call to
collect.duckplyr_df()
for instance).
Therefore, duckplyr can be both lazy (within itself) and not lazy (for the outside world). 🤪
Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory? Therefore, the duckplyr package has a safeguard called prudence to control automatic materialization (from duckplyr 1.0.0). It has three possible settings:
- lavish, automatic materialization.
mtcars |>
duckplyr::as_duckdb_tibble() |>
dplyr::mutate(mpg2 = mpg + 2) |>
nrow()
#> [1] 32
- frugal, no automatic materialization ever.
mtcars |>
duckplyr::as_duckdb_tibble(prudence = "stingy") |>
dplyr::mutate(mpg2 = mpg + 2) |>
nrow()
#> Error: Materialization would result in 1 rows, which exceeds the limit of 0. Use collect() or as_tibble() to materialize.
- thrifty, automatic materialization up to 1 million cells so ok here
mtcars |>
duckplyr::as_duckdb_tibble(prudence = "thrifty") |>
dplyr::mutate(mpg2 = mpg + 2) |>
nrow()
#> [1] 32
By default,
- duckplyr frames created with, say,
duckplyr::as_duckdb_tibble()
are “lavish”, - but duckplyr frames created with ingestion functions such as
duckplyr::read_parquet_duckdb()
(presumedly large data) are “thrifty”.
Lazy as in lazy loading of data in packages (LazyData
)
If your R package exports data, and sets the LazyData
field in DESCRIPTION
to true
, then the exported datasets are lazily loaded: they’re available without the use of data()
, but they’re not actually taking up memory until they are accessed.
There’s more details on LazyData
in the R packages book by Hadley Wickham and Jenny Bryan and in Writing R Extensions.
Note that internal data is always lazily loaded, and that data that is too big1 cannot be lazily loaded.
Lazy as in frugal file modifications
The pkgdown::build_site()
function, that creates a documentation website for an R package, features a lazy
argument. “If TRUE
, will only rebuild articles and reference pages if the source is newer than the destination.”
It is a much simpler concept of laziness: decide right now whether it is needed to rebuild each page.
The potools package, that provides tools for portability and internationalization of R packages, uses “lazy” for a similar meaning.
Lazy as in frugal package testing
The lazytest package by Kirill Müller saves you time by only re-running tests that failed during the last run:
- You run all tests once with
lazytest::lazytest_local()
instead ofdevtools::test()
. The lazytest package records which tests failed. - The next call to
lazytest::lazytest_local()
only runs the tests that had failed.
This way you can iterate on fixing tests until you get a clean run. At which stage it’s probably wise to run all tests again to check you didn’t break anything else in the meantime. 😉
Lazy as in lazy quantifiers in regular expressions
In regular expressions you can use quantifiers to indicate how many times a pattern must appear: the pattern can be optional, appear several times, etc. You can also specify whether the tool should match as many repetitions as possible, or the fewest number of repetitions possible.
Matching the fewest number of repetitions possible is “lazy” (or stingy). Matching as many repetitions as possible is “eager” (or greedy).
string <- "aaaaaa"
# greedy! eager!
stringr::str_match(string, "a+")
#> [,1]
#> [1,] "aaaaaa"
# stingy! lazy!
stringr::str_match(string, "a+?")
#> [,1]
#> [1,] "a"
Conclusion
In the context of lazy evaluation and lazy database operations we can think of lazy as a sort of parcimonious procrastination. For lazy database operations, the laziness is what supports optimization of the whole pipeline. In the case of frugal file modifications in pkgdown and potools or frugal testing with lazytest, lazy means an informed decision is made on the spot on whether a computation is needed. In the case of lazy quantifiers in regular expressions, lazy means stingy.
Overall, an user can expect “lazy” to mean “less waste”, but it is crucial that the documentation of the particular piece of software at hand clarifies the meaning and the potential trade-offs.
-
“those which when serialized exceed 2GB, the limit for the format on 32-bit platforms” at the time of writing, in Writing R Extensions. ↩︎