9 min read

Persistent config and data for R packages

2020/03/12

|
|

Does your R package work best with some configuration? You probably want it to be easily found by your package. Does your R package download huge datasets that don’t change much on the provider side? Maybe you want to save the corresponding data somewhere persistent so that things will go faster during the next R session. In this blog post we shall explain how an R package developer can go about using and setting persistent configuration and data on the user’s machine.

Preface: standard locations on the user’s machine

Throughout this post we’ll often refer to standard locations on the user’s machine. As explained by Gábor Csárdi in an R-pkg-devel email, “Applications can actually store user level configuration information, cached data, logs, etc. in the user’s home directory, and there is a standard way to do this [depending on the operating system].” R packages that are on CRAN cannot write to the home directory without getting confirmation from the user, but they can and should use standard locations. To find where those are, package developers can use the rappdirs package.

# Using a reference class object
rhub_app <- rappdirs::app_dir("rhub", "r-hub")
rhub_app$cache()
## [1] "/home/maelle/.cache/rhub"
# or functions
rappdirs::user_cache_dir("rhub")
## [1] "/home/maelle/.cache/rhub"

On top of these non-R specific standard locations, we’ll also mention the standard homes of R options and environment variables, .Rprofile and .Renviron.

User preferences

As written in Android developer guidance and probably every customer service guide ever, “Everyone likes it when you remember their name”. Everyone probably likes it too when the barista at their favourite coffee shop remembers their usual order. As an R package developer, what can you do for your R package to correctly assess user preferences and settings?

Using options

In R, options allow the user to set and examine a variety of global options which affect the way in which R computes and displays its results. For instance, for the usethis package, the usethis.quiet option can control whether usethis is chatty1. Users either:

Users can use a project-level or more global user-level .Rprofile. The use of a project-level .Rprofile overrides the user-level .Rprofile unless the project-level .Rprofile contains the following lines as mentioned in the blogdown book:

# in .Rprofile of the project
if (file.exists('~/.Rprofile')) {
  base::sys.source('~/.Rprofile', envir = environment())
}
# then set project options

For more startup tweaks, the user could adopt the startup package.

As a package developer in your code you can retrieve options by using getOption() whose second argument is a fallback for when the option hasn’t been set by the user. Note that an option can be any R object.

options(blabla.foo = TRUE)
if (isTRUE(getOption("blabla.foo", FALSE))) {
  message("foo!")
}
## foo!
options(blabla.bar = mean)
getOption("blabla.bar")(c(1:7))
## [1] 4

The use of options in the .Rprofile startup file is great for workflow packages like usethis, blogdown, etc., but shouldn’t be used for, say, arguments influencing the results of a statistical function.

Using environment variables

Environment variables, found via Sys.getenv() rather than getOption(), are often used for storing secrets (like GITHUB_PAT for the gh package) or the path to secrets on disk (like TWITTER_PAT for rtweet), or not secrets (e.g. the browser to use for chromote).

Similar to using options() in the console or at the top of a script the user could use Sys.setenv(). Obviously, secrets should not be written at the top of a script that’s public. To make environment variables persistent they need to be stored in a startup file, .Renviron. .Renviron does not contain R code like .Rprofile, but rather key-value pairs that are only called via Sys.getenv().

As a package developer, you probably want to at least document how to set persistent variables or provide a link to such documentation; and you could even provide helper functions like what rtweet does.

Using credential stores for secrets

Although say API keys are often stored in .Renviron, they could also be stored in a standard and more secure location depending on the operating system. The keyring package allows to interact with such credential stores. You could either take it on as a dependency like e.g. gh, or recommend the user of your package to use keyring and to add a line like

Sys.setenv(SUPERSECRETKEY = keyring::key_get("myservice"))

in their scripts.

Using a config file

The batchtools package expect its users to setup a config file somewhere if they don’t want to use the defaults. That somewhere can be several locations, as explained in the batchtools::findConfFile() manual page. Two of the possibilities are rappdirs::user_config_dir("batchtools", expand = FALSE) and rappdirs::site_config_dir("batchtools") which refer to standard locations that are different depending on the operating system.

The golem package offers its users the possibility to use a config file based on the config package.

A good default experience

Obviously, on top of letting users set their own preferences, you probably want your package functions to have sensible defaults. 😁

Asking or guessing?

For basic information such as username, email, GitHub username, the whoami package does pretty well.

whoami::whoami()
##                 username                 fullname            email_address 
##                 "maelle"          "Maëlle Salmon" "maelle.salmon@yahoo.se" 
##              gh_username 
##                 "maelle"
whoami::email_address()
## [1] "maelle.salmon@yahoo.se"

In particular, for the email address, if the R environment variable EMAIL isn’t set, whoami uses a call to git to find Git’s global configuration. Similarly, the gert package can find and return Git’s preferences via gert::git_config_global()2.

In these cases where packages guess something, their guessing is based on the use of standard locations for such information on different operating systems. Unsurprisingly, in the next section, we’ll recommend using such standard locations when caching data.

Not so temporary files3

To quote Android developers guide again, “Persist as much relevant and fresh data as possible.”.

A package that exemplifies doing so is getlandsat that downloads “Landsat 8 data from AWS public data sets” from the web. The first time the user downloads an image, the result is cached so next time no query needs to be made. A very nice aspect of getlandsat is its providing cache management functions

library("getlandsat")
# list files in cache
lsat_cache_list()
## [1] "/home/maelle/.cache/landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B3.TIF"
## [2] "/home/maelle/.cache/landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B4.TIF"
## [3] "/home/maelle/.cache/landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B7.TIF"
# List info for single files
lsat_cache_details(files = lsat_cache_list()[1])
## <landsat cached files>
##   directory: /home/maelle/.cache/landsat-pds
## 
##   file: /L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B3.TIF
##   size: 64.624 mb
lsat_cache_details(files = lsat_cache_list()[2])
## <landsat cached files>
##   directory: /home/maelle/.cache/landsat-pds
## 
##   file: /L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B4.TIF
##   size: 65.36 mb
# List info for all files
lsat_cache_details()
## <landsat cached files>
##   directory: /home/maelle/.cache/landsat-pds
## 
##   file: /L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B3.TIF
##   size: 64.624 mb
## 
##   file: /L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B4.TIF
##   size: 65.36 mb
## 
##   file: /L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B7.TIF
##   size: 62.974 mb
# delete files by name in cache
# lsat_cache_delete(files = lsat_cache_list()[1])

# delete all files in cache
# lsat_cache_delete_all()

The getlandasat uses the rappdirs package we mentioned earlier.

lsat_path <- function() rappdirs::user_cache_dir("landsat-pds")

When using rappdirs, keep caveats in mind.

If you hesitate to use e.g. rappdirs::user_cache_dir() vs rappdirs::user_data_dir(), use a GitHub code search.

rappdirs or not

To use an app directory from within your package you can use rappdirs as mentioned earlier, but also other tools.

  • Package developers might also like the hoardr package that basically creates an R6 object building on rappdirs with a few more methods (directory creation, deletion).
  • Some package authors “roll their own” like Henrik Bengtsson in R.cache.

More or less temporary solutions

This section presents solutions for caching results very temporarily, or less temporarily.

Caching results within an R session

To cache results within an R session, you could use a temporary directory for data. For any function call you could use memoise that supports, well memoization which is best explained with an example.

time <- memoise::memoise(Sys.time)
time()
## [1] "2020-03-12 11:03:10 CET"
Sys.sleep(1)
time()
## [1] "2020-03-12 11:03:10 CET"

Only the first call to time() actually calls Sys.time(), after that the results is saved for the entire session unless memoise::forget() is called. It is great for speeding up code, and for not abusing internet resources which is why the polite package wraps memoise.

Providing a ready-to-use dataset in a non-CRAN package

If your package depends on the use of a huge dataset, the same for all users, that is by definition too huge for CRAN, you can use a setup like the one presented by Brooke Anderson and Dirk Eddelbuettel in which the data is packaged up in a separate package not on CRAN, that the user will install therefore saving the data on disk somewhere where you can find it easily.5

Conclusion

In this blog post we presented ways of saving configuration options and data in a not so temporary way in R packages. We mentioned R startup files (options in .Rprofile and secrets in .Renviron, the startup package); the rappdirs and hoardr packages as well as an exciting similar feature in R devel; the keyring package. Writing in the user home directory can be viewed as invasive (and can trigger CRAN archival), hence there is a need for a good package design (asking for confirmation; providing cache management functions like getlandsat does) and documentation for transparency. Do you use any form of caching on disk with a default location in one of your packages? Do you know where your rhub email token lives?6 😉

Many thanks to Christophe Dervieux for useful feedback on this post!


  1. Note that in tests usethis suppresses the chatty behaviour by the use of withr::local_options(list(usethis.quiet = FALSE))↩︎

  2. The gert package uses libgit2, not Git directly. ↩︎

  3. We’re using the very good email subject by Roy Mendelssohn on R-pkg-devel↩︎

  4. There’s actually an R package called backports which provides backports of functions which have been introduced in one of the base packages in R version 3.0.1 or later, maybe it’ll provide backports for tools::R_user_dir()↩︎

  5. If your package has a helper for downloading and saving the dataset locally, and you don’t control the dataset source (contrary to the aforementioned approach), you might want to register several URLs for that content, as explained in the README of the conceptual contenturi package↩︎

  6. In file.path(rappdirs::user_data_dir("rhub", "rhub"), "validated_emails.csv"), /home/maelle/.local/share/rhub/validated_emails.csv in my case. ↩︎