One principle of programming that’s often encountered is “DRY”, “Don’t Repeat Yourself”, that encourages e.g. the use of functions over duplicated (read: copy-pasted and slightly amended) code. You could also interpret it as don’t let the machine repeat its calculations if useless. How about for a function with the same inputs (or with no argument!), we only run it once e.g. per R session, and save the results for later? In this post, we shall go over ways to cache results of R functions, so that you don’t need to burden machines and humans.
Caching: what is it and why use it?
Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.
Now, why use caching?
-
It might help save time.
-
It might help save other resources of users such as money: e.g. if the function calls a web API whose pricing depends on the number of hits. 😅
-
It might be more polite. That’s similar to the second item but from the perspective of e.g. a web API you keep hitting when you could have saved the result. The polite package for polite webscraping caches results.
Caching can be about results of functions but also some user inputs that won’t change for the session and you don’t want to ask every time you need it (being polite!). It could be per session caching but also persistent caching. As an example, reticulate will ask you once if you want to install miniconda by storing your answer locally if you say no and not ask again. (See internal miniconda_install_prompt()
).
Tools for caching in R
Here’s a roundup of some ways to cache results of functions in R.
The memoise package
The memoise package by Jim Hester is easy to use. Say we want to cache a function that only sleeps.
.sleep <- function() {
Sys.sleep(3)
"Rested now!"
}
sleep <- memoise::memoise(.sleep)
system.time(sleep())
utilisateur système écoulé
0.001 0.000 3.002
system.time(sleep())
utilisateur système écoulé
0.037 0.000 0.037
The second call to sleep()
is much quicker because, well, it does not call the .sleep()
function so there’s no sleep.
The memoise package also lets you
-
choose the duration of validity of the cache;
If you use the memoise package in a package, do not forget to add
@importFrom memoise memoise
in one of your R scripts (thanks Mark Padgham for this tip!) otherwise you will get a R CMD Check NOTE. This is because R will look for package usage in function bodies, whereas the call to memoise is at the top-level.
Result: NOTE
Namespace in Imports field not imported from: ‘memoise’
All declared Imports should be used.
Function factory
Now what if you want simple memoization and no dependency on the memoise package? In that case you might be interested in creating some sort of function factory. See the example below, with who_am_i_impl()
the function factory. We use it to create the who_am_i()
function whose results are then stored for the session.
who_am_i_impl <- function() {
name <- NULL
function() {
if (is.null(name)) {
name <<- readline("I don't know you. Can you tell me your name ? ")
message("Welcome ", name, "!")
} else {
message("I know you! You are ", name, ". Hello again!")
}
invisible(name)
}
}
who_am_i <- who_am_i_impl()
who_am_i()
#> I don't know you. Can you tell me your name ? cderv
#> Welcome cderv!
who_am_i()
#> I know you! You are cderv. Hello again!
In the whoami package by Gábor Csárdi, there is an internal function lookup_gh_username()
that calls the GitHub API. It is memoized without the memoise package.
- In
.onLoad()
the memoization function is called, overwritinglookup_gh_username()
with its memoized version.
.onLoad <- function(libname, pkgname) {
lookup_gh_username <<- memoize_first(lookup_gh_username)
}
- What’s the memoization function, you ask?
memoize_first <- function(fun) {
fun
cache <- list()
dec <- function(arg, ...) {
if (!is_string(arg)) return(fun(arg, ...))
if (is.null(cache[[arg]])) cache[[arg]] <<- fun(arg, ...)
cache[[arg]]
}
dec
}
This function is a closure (a function creating a function). It will take any function and makes it cache its result in a list based on first argument value (here arg
) when this is a string. If the memoized function is called again with the same first argument arg
, then the result is retrieved from the list instead of the function being executed.
In the Advanced R book by Hadley Wickham such function factories are called stateful functions that “allow you to maintain state across function invocations”.
Saving results in an environment
This is well suited for package development where:
- You create an environment internal to your package where you store a value;
- You can then store in it any computed value for the current R session where the package is loaded.
Here’s an example with only base R functions:
cache_env <- new.env(parent = emptyenv())
who_am_i <- function() {
if (is.null(cache_env$name)) {
cache_env$name <- readline("I don't know you. Can you tell me your name ? ")
}
message("Welcome ", cache_env$name, "!")
invisible(cache_env$name)
}
get_current_name <- function() {
cache_env$name
}
who_am_i()
#> I don't know you. Can you tell me your name ? cderv
#> Welcome cderv!
who_am_i()
#> Welcome cderv!
This example would be simpler if using rlang::env_cache()
by Lionel Henry (in the development version of rlang at the time of writing).
cache_env <- rlang::new_environment()
who_am_i <- function() {
name <- rlang::env_cache(
env = cache_env, nm = "name",
default = readline("I don't know you. Can you tell me your name ? ")
)
message("Welcome ", name, "!")
invisible(name)
}
get_current_name <- function() {
cache_env$name
}
who_am_i()
#> I don't know you. Can you tell me your name ? cderv
#> Welcome cderv!
who_am_i()
#> Welcome cderv!
Storing on disk?
For persistent caching across R sessions you will need to store function results on disk. On that topic see also the R-hub blog post on persistent data and config for R packages.
Where to store results on disk? Best practice is to use user data dir via the rappdirs package or tools::R_user_dir()
from R version 4.0. You might see some local caching e.g. what httr::oauth2.0_token()
does, in that case with editing of the .gitignore
file as the cached result is a secret!
How to store results on disk? Text files are great for short string values. Writing compressed RDS files is also an option. In any case, cache storage should usually be small for internal use in the package (as opposed to the huge computation caching a package like targets supports).
Packages for caching
For further tooling around caching beside the memoise package and base R functions, refer to these packages (and their reverse dependencies!):
-
storr by Rich FitzJohn. Creates and manages simple key-value stores. These can use a variety of approaches for storing the data. This package implements the base methods and support for file system, in-memory and DBI-based database stores.
-
R.cache by Henrik Bengtsson. Fast and Light-Weight Caching (Memoization) of Objects and Results to Speed Up Computations.
-
cachem by Winston Chang. Key-value stores with automatic pruning. Caches can limit either their total size or the age of the oldest object (or both), automatically pruning objects to maintain the constraints.
Documenting caching
If your package use caching,
- document that;
- and also provide ways to clear the cache (see e.g. opencage docs); this is especially crucial for persistent caching as it would be fine to simply say the user has to restart the R session.
When not to cache
We can’t end this post with a few words of caution.
Here are three cases when it’d be bad to cache:
-
The gains in time and other resources are not worth the increased complexity. You decide what’s worth it. Think of future collaborators, some of whom might encounter caching for the first time.
-
The results of a function with the same input might change. E.g. the function you call gives you the current time. Or you call a web API whose data is updated very regularly (although in that case rather than not caching you might want to look into the validity time of your cache).
-
The function should not be called several times to begin with. I.e., do not use caching as a band-aid for bad code design.
E.g.
name <- "Beyonce"
capitalize_name <- function(name) {
toupper(name)
}
say_hello <- function(name) {
name <- tolower(name)
sprintf("Hello %s", capitalize_name(name))
}
say_goodbye <- function(name) {
name <- tolower(name)
sprintf("Goodbye %s", capitalize_name(name))
}
say_hello(name)
say_goodbye(name)
You could cache capitalize_name()
but it’d make more sense to call it only once before calling say_hello()
and say_goodbye()
. This is a very simple not optimal example but the idea of not using caching is brought to you by recent experience of one of the authors. 😉 Amending the structure and logic of your code so that a function does not get called twice with the exact same input might help save time, and might make the code easier to follow, but if it actually makes the code more complicated, maybe don’t do that.
Note that as a package author, you do not know how the users will call a function. If you provide a package whose functions call a geocoding API whose data is updated only daily, you might hope your users do some sort of grouping before calling the API. But because you know they won’t necessarily do that, you can add caching!
Conclusion
In this post we summarized tools and tips on how to cache the results of functions of your R package.
We have not covered other types of caching relevant for R users: caching for R Markdown, caching for Shiny, caching in projects via the use of the targets package (or its superseded predecessor drake). Lots to explore based on your use case! 😉
Have you used caching in one of your packages or scripts? What tool did you use?