6 min read

About URLs in DESCRIPTION

2019/12/10

|
|

Among DESCRIPTION usual fields is the free-text URL field where package authors can store various links: to the development website, docs, upstream tool, etc. In this post, we shall explain why storing URLs in DESCRIPTION is important, where else you should add URLs and what kind of URLs are stored in CRAN packages these days.

Why put URLs in DESCRIPTION?

In the following we’ll assume your package has some sort of online development repository (GitHub? GitLab? R-Forge?) and a documentation website (handily created via pkgdown?). Adding URLs to your package’s online homes is extremely useful for several reasons.

As a side note: Yes, you can store several URLs under URL, even if the field name is singular. See for instance rhub’s DESCRIPTION 🔗 🔗

URL: https://github.com/r-hub/rhub, https://r-hub.github.io/rhub/

Why put URLs in DESCRIPTION?

  • It will help your users find your package’s pretty documentation from the CRAN page, instead of just the less pretty PDF manual.

  • Likewise, from the CRAN page your contributors can directly find where to submit patches.

  • If your package has a package-level man page, and it should (e.g. as drafted by usethis::use_package_doc() and then generated by roxygen2), then after typing say library("rhub") and then ?rhub, your users will find the useful links.

  • Other tools such as helpdesk and the pkgsearch RStudio addin can help surface the URLs you store in DESCRIPTION.

  • Indirectly, having a link to the docs website and development repo will increase their page rank, see useful comments in this discussion, so potential users and contributors find them more easily by simply searching for your package.

  • edit after Hugo Gruson’s comment “It’s also worth noting that these URLs are used by pkgdown:

    • the GitHub URL is used to automatically find out the repo containing the source code, and display a handy GitHub icon which links to the repo on the right of the top navbar (with the default theme).
    • the URL to the pkgdown website is used to crosslink to this site from other pkgdown websites, as explained in this vignette, creating a decentralized mesh for documentation, instead of relying on a centralized entity such as http://rdrr.io/."
  • edit after Jim Hester’s tweet “Another reason for URLs in DESCRIPTION, remotes::install_dev() uses them to find the dev repo!”

Quick tip, you can add GitHub URLs (URL and BugReports) to DESCRIPTION by running usethis::use_github_links(). 🚀

Where else put your URLs?

For the same reasons as previously, you should make the most of all places that can store your package’s URL(s). Have you put your package’s docs URL

Have you used any of your package’s URLs

Don’t miss any opportunity to point users and contributors in the right direction!

What URLs do people use in DESCRIPTION files of CRAN packages?

In the following, we shall parse the URL field of the CRAN packages database.

db <- tools::CRAN_package_db()

db <- tibble::as_tibble(db[, c("Package", "URL")])
db <- dplyr::distinct(db)

There are 15317 packages on CRAN at the time of writing, among which 8042 with something written in the URL field. We can parse this data.

db <- db[!is.na(db$URL),]

library("magrittr")

# function from https://github.com/r-hub/pkgsearch/blob/26c4cc24b9296135b6238adc7631bc5250509486/R/addin.R#L490-L496

url_regex <- function() "(https?://[^\\s,;>]+)"

find_urls <- function(txt) {
  mch <- gregexpr(url_regex(), txt, perl = TRUE)
  res <- regmatches(txt, mch)[[1]]

  if(length(res) == 0) {
    return(list(NULL))
  } else {
    list(unique(res))
  }
}

db %>%
  dplyr::group_by(Package)  %>%
  dplyr::mutate(actual_url = find_urls(URL))%>%
  dplyr::ungroup() %>%
  tidyr::unnest(actual_url) %>%
  dplyr::group_by(Package, actual_url) %>%
  dplyr::mutate(url_parts = list(urltools::url_parse(actual_url))) %>%
  dplyr::ungroup() %>%
  tidyr::unnest(url_parts) %>%
  dplyr::mutate(scheme = trimws(scheme)) -> parsed_db

There are 7208 with at least one valid URL.

What are the packages with most links?

mostlinks <- dplyr::count(parsed_db, Package, sort = TRUE)
mostlinks
## # A tibble: 7,208 x 2
##    Package           n
##    <chr>         <int>
##  1 RcppAlgos         7
##  2 BIFIEsurvey       5
##  3 BigQuic           5
##  4 dendextend        5
##  5 PGRdup            5
##  6 vwline            5
##  7 ammistability     4
##  8 augmentedRCBD     4
##  9 dcGOR             4
## 10 dialr             4
## # … with 7,198 more rows

The package with the most links in URL is RcppAlgos.

What is the most popular scheme, http or https?

dplyr::count(parsed_db, scheme, sort = TRUE)
## # A tibble: 2 x 2
##   scheme     n
##   <chr>  <int>
## 1 https   5936
## 2 http    2492

There is a bit less that one third of http links.

Can we identify popular domains?

dplyr::count(parsed_db, domain, sort = TRUE)
## # A tibble: 1,861 x 2
##    domain                    n
##    <chr>                 <int>
##  1 github.com             4670
##  2 www.r-project.org       164
##  3 cran.r-project.org      144
##  4 r-forge.r-project.org    82
##  5 bitbucket.org            67
##  6 sites.google.com         54
##  7 arxiv.org                52
##  8 gitlab.com               46
##  9 docs.ropensci.org        39
## 10 www.github.com           32
## # … with 1,851 more rows

GitHub seems to be the most popular development platform, as least from this sample of CRAN packages that indicate an URL. It is also possible that some developers set up their own GitLab server with a own domain. Many packages link to www.r-project.org which is not very informative, or to their own CRAN page which can be informative.

Other relatively popular domains are sites.google.com and arxiv.org. There are problably links to other venues for scientific publications than arxiv.org. What about doi.org?

dplyr::filter(parsed_db, domain %in% c("doi.org", "dx.doi.org")) %>%
  dplyr::select(Package, actual_url)
## # A tibble: 44 x 2
##    Package                actual_url                                    
##    <chr>                  <chr>                                         
##  1 abcrlda                https://dx.doi.org/10.1109/LSP.2019.2918485   
##  2 adwave                 https://doi.org/10.1534/genetics.115.176842   
##  3 ammistability          https://doi.org/10.5281/zenodo.1344756        
##  4 anMC                   https://doi.org/10.1080/10618600.2017.1360781 
##  5 ANOVAreplication       https://dx.doi.org/10.17605/OSF.IO/6H8X3      
##  6 AssocAFC               https://doi.org/10.1093/bib/bbx107            
##  7 augmentedRCBD          https://doi.org/10.5281/zenodo.1310011        
##  8 CorrectOverloadedPeaks http://dx.doi.org/10.1021/acs.analchem.6b02515
##  9 dataMaid               https://doi.org/10.18637/jss.v090.i06         
## 10 disclapmix             http://dx.doi.org/10.1016/j.jtbi.2013.03.009  
## # … with 34 more rows

The “earlier but no longer preferred” dx.doi.org is still in use.

rOpenSci docs server also make an appearance.

Note that you could do a similar analysis of the BugReports field. We’ll leave that as an exercise to the reader. 😉

Conclusion

In this note, we explained why having URLs in DESCRIPTION of your package can help users and contributors find the right venues for their needs, and we had a look at URLs currently stored in the DESCRIPTIONs of CRAN packages, in particular discussing current popular domains. How do you ensure the users of your package can find its best online home(s)? How do you look for online home(s) of the packages you use?