Among DESCRIPTION usual fields is the free-text URL
field where package authors can store various links: to the development website, docs, upstream tool, etc. In this post, we shall explain why storing URLs in DESCRIPTION is important, where else you should add URLs and what kind of URLs are stored in CRAN packages these days.
Why put URLs in DESCRIPTION?
In the following we’ll assume your package has some sort of online development repository (GitHub? GitLab? R-Forge?) and a documentation website (handily created via pkgdown?). Adding URLs to your package’s online homes is extremely useful for several reasons.
As a side note: Yes, you can store several URLs under URL, even if the field name is singular. See for instance
rhub
’s DESCRIPTION 🔗 🔗
URL: https://github.com/r-hub/rhub, https://r-hub.github.io/rhub/
Why put URLs in DESCRIPTION?
-
It will help your users find your package’s pretty documentation from the CRAN page, instead of just the less pretty PDF manual.
-
Likewise, from the CRAN page your contributors can directly find where to submit patches.
-
If your package has a package-level man page, and it should (e.g. as drafted by
usethis::use_package_doc()
and then generated byroxygen2
), then after typing saylibrary("rhub")
and then?rhub
, your users will find the useful links. -
Other tools such as
helpdesk
and thepkgsearch
RStudio addin can help surface the URLs you store in DESCRIPTION. -
Indirectly, having a link to the docs website and development repo will increase their page rank, see useful comments in this discussion, so potential users and contributors find them more easily by simply searching for your package.
-
edit after Hugo Gruson’s comment “It’s also worth noting that these URLs are used by
pkgdown
:- the GitHub URL is used to automatically find out the repo containing the source code, and display a handy GitHub icon which links to the repo on the right of the top navbar (with the default theme).
- the URL to the
pkgdown
website is used to crosslink to this site from otherpkgdown
websites, as explained in this vignette, creating a decentralized mesh for documentation, instead of relying on a centralized entity such as http://rdrr.io/."
-
edit after Jim Hester’s tweet “Another reason for URLs in DESCRIPTION,
remotes::install_dev()
uses them to find the dev repo!”
Quick tip, you can add GitHub URLs (URL and BugReports) to DESCRIPTION by running
usethis::use_github_links()
. 🚀
Where else put your URLs?
For the same reasons as previously, you should make the most of all places that can store your package’s URL(s). Have you put your package’s docs URL
-
in the pkgdown config file, if that’s how you built it?
-
in the GitHub repo website field (you need admin rights), or the equivalent for your development platform, e.g. GitLab?
Have you used any of your package’s URLs
-
In your public message about your package, e.g. as an answer to someone’s question?
-
In the slides of your talk about the package?
Don’t miss any opportunity to point users and contributors in the right direction!
What URLs do people use in DESCRIPTION files of CRAN packages?
In the following, we shall parse the URL field of the CRAN packages database.
db <- tools::CRAN_package_db()
db <- tibble::as_tibble(db[, c("Package", "URL")])
db <- dplyr::distinct(db)
There are 15317 packages on CRAN at the time of writing, among which 8042 with something written in the URL field. We can parse this data.
db <- db[!is.na(db$URL),]
library("magrittr")
# function from https://github.com/r-hub/pkgsearch/blob/26c4cc24b9296135b6238adc7631bc5250509486/R/addin.R#L490-L496
url_regex <- function() "(https?://[^\\s,;>]+)"
find_urls <- function(txt) {
mch <- gregexpr(url_regex(), txt, perl = TRUE)
res <- regmatches(txt, mch)[[1]]
if(length(res) == 0) {
return(list(NULL))
} else {
list(unique(res))
}
}
db %>%
dplyr::group_by(Package) %>%
dplyr::mutate(actual_url = find_urls(URL))%>%
dplyr::ungroup() %>%
tidyr::unnest(actual_url) %>%
dplyr::group_by(Package, actual_url) %>%
dplyr::mutate(url_parts = list(urltools::url_parse(actual_url))) %>%
dplyr::ungroup() %>%
tidyr::unnest(url_parts) %>%
dplyr::mutate(scheme = trimws(scheme)) -> parsed_db
There are 7208 with at least one valid URL.
What are the packages with most links?
mostlinks <- dplyr::count(parsed_db, Package, sort = TRUE)
mostlinks
## # A tibble: 7,208 x 2
## Package n
## <chr> <int>
## 1 RcppAlgos 7
## 2 BIFIEsurvey 5
## 3 BigQuic 5
## 4 dendextend 5
## 5 PGRdup 5
## 6 vwline 5
## 7 ammistability 4
## 8 augmentedRCBD 4
## 9 dcGOR 4
## 10 dialr 4
## # … with 7,198 more rows
The package with the most links in URL
is RcppAlgos.
What is the most popular scheme, http or https?
dplyr::count(parsed_db, scheme, sort = TRUE)
## # A tibble: 2 x 2
## scheme n
## <chr> <int>
## 1 https 5936
## 2 http 2492
There is a bit less that one third of http links.
Can we identify popular domains?
dplyr::count(parsed_db, domain, sort = TRUE)
## # A tibble: 1,861 x 2
## domain n
## <chr> <int>
## 1 github.com 4670
## 2 www.r-project.org 164
## 3 cran.r-project.org 144
## 4 r-forge.r-project.org 82
## 5 bitbucket.org 67
## 6 sites.google.com 54
## 7 arxiv.org 52
## 8 gitlab.com 46
## 9 docs.ropensci.org 39
## 10 www.github.com 32
## # … with 1,851 more rows
GitHub seems to be the most popular development platform, as least from this sample of CRAN packages that indicate an URL. It is also possible that some developers set up their own GitLab server with a own domain.
Many packages link to www.r-project.org
which is not very informative, or to their own CRAN page which can be informative.
Other relatively popular domains are sites.google.com and arxiv.org. There are problably links to other venues for scientific publications than arxiv.org. What about doi.org?
dplyr::filter(parsed_db, domain %in% c("doi.org", "dx.doi.org")) %>%
dplyr::select(Package, actual_url)
## # A tibble: 44 x 2
## Package actual_url
## <chr> <chr>
## 1 abcrlda https://dx.doi.org/10.1109/LSP.2019.2918485
## 2 adwave https://doi.org/10.1534/genetics.115.176842
## 3 ammistability https://doi.org/10.5281/zenodo.1344756
## 4 anMC https://doi.org/10.1080/10618600.2017.1360781
## 5 ANOVAreplication https://dx.doi.org/10.17605/OSF.IO/6H8X3
## 6 AssocAFC https://doi.org/10.1093/bib/bbx107
## 7 augmentedRCBD https://doi.org/10.5281/zenodo.1310011
## 8 CorrectOverloadedPeaks http://dx.doi.org/10.1021/acs.analchem.6b02515
## 9 dataMaid https://doi.org/10.18637/jss.v090.i06
## 10 disclapmix http://dx.doi.org/10.1016/j.jtbi.2013.03.009
## # … with 34 more rows
The “earlier but no longer preferred” dx.doi.org is still in use.
rOpenSci docs server also make an appearance.
Note that you could do a similar analysis of the BugReports field. We’ll leave that as an exercise to the reader. 😉
Conclusion
In this note, we explained why having URLs in DESCRIPTION of your package can help users and contributors find the right venues for their needs, and we had a look at URLs currently stored in the DESCRIPTIONs of CRAN packages, in particular discussing current popular domains. How do you ensure the users of your package can find its best online home(s)? How do you look for online home(s) of the packages you use?