Custom handling of errors and warnings in R function calls

Let’s consider the following function in R, and assume we want to catch its erros and warnings for later processing:

f = function(a, ...) {
    if (a %% 3 == 0)
        warning("this is a warning")
    if (a %% 2 == 0)
        stop("this is an error")
    a
}

It’s pretty simple to catch the errors, or warnings:

# this will return a "try-error" class if an error occurred
# however, we can't handle warnings
try(f(2))

# this will catch warning messages and process them
# however, it will not continue the function execution after,
#   i.e. result is NULL
result = tryCatch(f(3),
    error=function(w) message("process error"),
    warning=function(w) message("process warning"))

Also, if we want to process errors and warnings the tryCatch() block only processes the warning:

# no result, no error processing
result = tryCatch(f(6),
    error=function(w) message("process error"),
    warning=function(w) message("process warning"))

If we want it to handle warnings and errors equally, as well as still the the function result when no error occurred, we need a more complicated construct:

f = function(...) {
    context = new.env()

    context$result = withCallingHandlers(
        withRestarts(
            fx(...),
            muffleStop = function() NULL
        ),
        warning = function(w) {
            context$warnings = c(context$warnings,
                 list(as.character(conditionMessage(w))))
            invokeRestart("muffleWarning")
        },
        error = function(e) {
            context$error = as.character(conditionMessage(e))
            invokeRestart("muffleStop")
        }
    )

    as.list(context)
}

f(a=1)
f(a=3)
f(a=2)
f(a=6)

Each of these calls returns a list with the elements result for the result of the function call, but also warnings and error if they occurred.

Shared HPC software installations

TL;DR: Keep your own installation on a computing cluster, don’t rely on others. But do use a package manager. And share the build scripts with the upstream repository, so everyone can easily maintain and extend their own tool collection.

I work at a research institute with considerable computing capabilities. Things you can’t run on your normal work machine you send to a high performance computing cluster. Administration is taken care of by a group of people specifically employed to make sure things run smoothly, so you just log in and run whatever computing task you require.

Well, almost.

Of course, there is some software needed in order to run your analyses. You installed this on your local machine a long time ago using your OS’s package manger. But is it available on the computing cluster?

Typical HPC setup

The system administrators were happy to add software available in the official distributions’ repository but not from third party sources. Of course, this makes sense because everything needed to be stable. And stable meant the system libraries are generally out of date by a couple of years.

We were lucky to have a person who, next to his regular responsibilities installed software that people needed on a semi-regular basis. Or people compiled different tools themselves. Every new tool was a new entry in $PATH, which grew continuously. And $LD_LIBRARY_PATH too, of course, which resulted in broken dependencies all the time because something didn’t quite resolve right.

Manually compiling things often resulted in disabling optional functionality in order to get it done quickly (BLAS/LAPACK support for R? Then I’d need to install two more libraries). And it took a substantial amount of time to get anything running.

Package managers to the rescue

The obvious answer to the above mess is to use a package manager to take care of dependencies automatically and provide a much larger set of available software or build scripts to draw from than was readily available on the system.

At the time (first half of 2014), however, there were not many options available. After looking for possible solutions for a couple of weeks I discovered the Gentoo Prefix project that provided a user-level package manager which could use (almost) all the build scripts from the distribution.

This was great! It provided a base installation that enabled a non-root user the use of portage in a specified directory, resolved dependencies automatically, and resolved library locations using RPATH instead of the $LD_LIBRARY_PATH mess.

emerge <whatever you want>

Magic.

Only downside was that the bootstrap and compilation options were a bit involved, so it probably wasn’t for everyone to take care of this. Hence I set this up at my institute and advocated that people use it. We had a bug tracker, software requests, and advertised updates before we performed them so users are not surprised by small changes. About 30 people relied on it, and it worked really well for three years to come.

Don’t use shared software installations

We had minor hiccups on occasion and a major hiccup once (basically, the linker added symbols from a broken internationalization library that cascaded into breaking everything newly added or updated). But these were a small price to pay for a usable and convenient system.

One thing that was annoying is that libraries and tools sometimes change API between updates, and that would break a workflow that you are currently using. But since this was shared, a change for anyone was a change for everyone.

Since then, however, things have evolved. Linuxbrew (the linux-pendant to the OS-X package manager Homebrew) and Conda were released. And they were substantially simpler to use than Gentoo.

So I don’t really see a reason to still put up with API changes for any user unless they want to update their installation. The new package managers are easy enough to use individually, and storage space is cheap so why bother.

But please contribute your build scripts

There is, however, one thing to keep in mind that often goes underappreciated. If you need a tool that doesn’t yet exist in a repository and you already put the effort in to manually install it, go one step further and add is as a build script for the package manager. Please. It helps so much if more people do that.

XML to tsv using XSLT

I’m working a lot with gene expression data, for instance from The Cancer Genome Atlas (TCGA). One big challenge is to investigate biological variation in different samples when it is potentially maked by variation in the way the samples for collected, stored or processed.

When 20,000 genes are measured simultaneously, how do you know?. The way you usually look at this is to perform some sort of dimensionality reduction on the data set and look if they cluster by the batch they were collected or analysed in. The TCGA assigns barcodes to their samples that provide a lot of information on that like the Plate ID.

However, there is other information that is not in the barcode directly, like the date the samples was shipped for sequencing. The GDC Data Portal provides additional information in the Biospecimen Supplement files, but extracting this information is not trivial for these XML files.

The information I was looking for was a table between the assigned barcode and the shipment date, which were hidden in a structure like this:

<?xml version="1.0" encoding="UTF-8"?>
<bio:tcga_bcr ...>
  <bio:patient>
    <bio:samples>
      <bio:sample>
        <bio:portions>
          <bio:portion>
            <bio:analytes>
              <bio:analyte>
                <bio:aliquots>
                  <bio:aliquot>
  <bio:bcr_aliquot_barcode>TCGA-x</bio:bcr_aliquot_barcode>
  <bio:bcr_year_of_shipment>2011<bio:year_of_shipment>
                  </bio:aliquot>
                </bio:aliquots>
              </bio:analyte>
            </bio:analytes>
          </bio:portion>
        </bio:portions>
      </bio:sample>
    </bio:samples>
  </bio:patient>
</bio:tcga_bcr>

One way to convert XML-formatted information into a table is to use XSLT. There are nice tutorials on how to use it, but I did not know how to handle XML name spaces until recently, which the GDC files contain.

The way to solve this is rather simple: look at all the xmlns: attributes in the XML header that you are interested in, and add them to your xsl:stylesheet tag. Then it’s just a matter of referencing them correctly, looping through your tags and printing out the values in a tab-separated way with a newline at the end.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:bio="http://tcga.nci/bcr/xml/biospecimen/2.7"
    xmlns:admin="http://tcga.nci/bcr/xml/administration/2.7">
<xsl:output method="text" indent="no"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/">

  <xsl:for-each select="bio:tcga_bcr/bio:patient/bio:samples/bio:sample">
  <xsl:for-each select="bio:portions/bio:portion">
  <xsl:for-each select="bio:analytes/bio:analyte">
  <xsl:for-each select="bio:aliquots/bio:aliquot">

    <xsl:value-of select="bio:bcr_aliquot_barcode"/>
    <xsl:text>&#x9;</xsl:text>
    <xsl:value-of select="bio:year_of_shipment"/>
    <xsl:text>-</xsl:text>
    <xsl:value-of select="bio:month_of_shipment"/>
    <xsl:text>&#xa;</xsl:text>

  </xsl:for-each>
  </xsl:for-each>
  </xsl:for-each>
  </xsl:for-each>

</xsl:template>
</xsl:stylesheet>

This is of course much less error-prone that using a regular expression. And produces a table for a single file.

The data I downloaded from the GDC contained 500 files, so I used the find command to execute the XSLT transformation on all of them:

tar -xf gdc_download_20170505.tar.gz
find gdc_download_20170505 -name "\*.xml" -exec xsltproc my.xslt {} \;

In the end, I got a table of a few thousand barcodes and the date they were shipped as tab-separated values.

Barcode Date
TCGA-AA-A01C-01A-01D-A008-01 2010-03
TCGA-AA-A01C-01A-01D-A00D-09 2010-03
TCGA-AA-A01C-01A-01D-A00G-10 2010-04
TCGA-AA-A01C-01A-01D-A077-02 2010-09

A sensible .Rprofile

Here are some settings that I found useful for working with R. Note that I’m mainly using basic R scripts in an HPC environment and not Rstudio most of the time.

TL;DR: have a look at the whole file.

Custom library path

R is not hard to install, but there is a couple of compile-time configuration steps to take care of:

  • Graphics backend support using cairo
  • BLAS support to speed up linear algebra
  • Profiling needs to be enabled in the configure options (this is default)

For these reasons, it will make sense to have a central install on an HPC cluster to get the details right. However, that comes with some challenges.

In particular, the separation of user-readable system packages and user-writable personal packages does not always work.

This is why I found it easiest to specify an explicit library path.

.libPaths("~/.R")

This way, R will ignore the system-provided packages and install all required packages in the above directory where you have got full write access.

Custom options

Here are some tweaks about converting strings to factors, selecting a default CRAN mirror, and warn about partial matching of subsetting and function arguments.

options(
    stringsAsFactors = FALSE,

    menu.graphics = FALSE,
    repos = c(CRAN="http://mirrors.ebi.ac.uk/CRAN/"),

    warnPartialMatchAttr = TRUE,
    warnPartialMatchDollar = TRUE
)

There is some controversy about how R is supposed to handle strings. Convert characters to factors by default or not? I’ve been a strong proponent to disabling converting, for which I’ve been accused to not understand how they work. True, code is less portable by setting a non-default option, but factors mess up enough that I don’t want to deal with it.

Another tweak is to not display the graphical menu to select a mirror when downloading from CRAN (this just delays what I actually wanted to do) and select a default mirror.

A third is to dispaly a warning when I either try to pass a named function argument that does not exactly match what the function is expecting (helps not mixing up arguments with confusing errors after) or when subsetting a list or data.frame using the $ sign.

Traceback on non-interactive scripts

When scripts run non-interactively and they fail, it is often challenging to find the source of the error or the function call in which R failed, as only the error message is reported and no additional context by default.

Using the below lines, you can get the traceback on errors when scripts fail that will not impact your interactive sessions.

if (!interactive())
    options(error = function() {
        traceback(2, max.lines=10)
        quit(save="no", status=1)
    })

The guard to exit R after printing the traceback is essential, because otherwise it will happily try to execute the rest of the script with an undefined state.

R caveats and pitfalls

Each year the EMBL PhD students in their 2nd year visit the EBI for a week-long course in bioinformatics.

For this, I gave a talk approaches and pitfalls in data analysis using R. The goal was to show some common mistakes that cost my and others a lot of time when we started out using this language so the participants could avoid them.

The slides can be found here on Google Drive.

I decided to focus on three parts, first simple indexing operations, then move on to loops vs. apply (and vectorization) and show off dplyr in the end.

Warm-up exercise using indexing

This was a warm-up exercise because all the participants already had an introductory R course the year before. We started making a data.frame and two vectors to index.

mydf = data.frame(num=1:5, let=rev(letters[1:5]))
x = setNames(letters[1:5], rev(1:5))
y = setNames(1:5, rev(letters[1:5]))

The first bit was a reminder that a data.frame could be indexed as a matrix and as a list.

mydf[1]         # first column enclosed in a list
mydf[1,]        # first row, still as data.frame
mydf[[1]]       # first column as vector

The second bit was to show that while the integer 2 and the character "2" are considered equal, indexing with them does not produce the same results.

x[2]            # "b"
y[2]            # 2
2 == "2"        # TRUE
x["2"]          # "d"
y["2"]          # NA (because no "2" in the names)

And finally the problem of indexing with a factor. Because each level is represented with an integer value internally, R falls back to use this (often invisible) value for indexing instead of the character representation of the level. Here, the counter-intuitive example is that the last line returns the number 4 even though the factor level is "d".

mydf$let[2]     # the factor level of "d" (internally a 2)
y["d"]          # 2
y[mydf$let[2]]  # the number 4

Loops, apply, and vectorization

First, the task was to write a function that takes a number n and, for each number of 1 to n multiplies this by 2 and adds 5. This is a very simple toy example but it illustrates some common coding problems.

One of the solutions was the one below. This is not optimal, but I think attendees will be more receptive to the reasons why if they chose to implement the problem in a similar manner (as it hints they might make these mistakes elsewhere).

mytest = function(n) {
    new_n = c()
    for (i in 1:n) {
        new_n[i] = i * 2 + 5
    }
    new_n
}

However, there is a better way to use loops. If we reserve memory in advance, computation is a lot bigger because R does not need to reserve new memory with each iteration.

new_n = rep(NA_real_, n)
for (i in seq_along(new_n))
    new_n[i] = n[i] * 2 - 5

Another important part here is to use seq_along() and seq_len() in order to return an empty vector if the length is 0, not c(1,0) as would be with 1:n. We also use NA_real_ instead of a plain NA to not re-allocate memory from logical to numeric (H/T @HenrikBengtsson).

new_n = sapply(seq_len(n), function(x) x * 2 - 5)

Is sapply() faster than the loop? Turns out, it is not because it uses an R loop internally (as opposed to a C loop). But of course there are good reasons to use apply instead of loops: they are more expressive and emphasize the object over the action (of looping).

However, the real answer to this is of course vectorization:

new_n = seq_len(n) * 2 - 5

This is by far the fastest, easiest to read, and easiest to debug. If you can solve a problem using vectorized code, you should do it.

Only problem with this example is that this solution is a bit too obvious. I should swap this example for something you actively have to think about how to vectorize, for instance by introducing multiple matrices and then combining them.

Handling data.frames the classic way vs. dplyr

The task here was the answer the following question on the nycflights13 dataset:

For flights between 4 and 12 hours, what are the 10 destination airports with highest average arrival delay?

So, in other words filter for the field hour to be between 4 and 12, then for each dest airport calculate the mean of the field arr_delay. In base R, a possible solution is below:

fl = subset(flights, hour >= 4 & hour <= 12)
dest = unique(fl$dest)
delay = setNames(rep(NA, length(dest)), dest)
for (d in unique(fl$dest)) {
    avg = mean(fl[fl$dest == d,]$arr_delay, na.rm=TRUE)
    delay[d] = avg
}
head(sort(delay, decreasing=TRUE), 10)

Even though the question posed is simple by comparison to research questions, the code needed to answer is already hard to interpret when looking at it in order to make out what the question was.

There is a more structured alternative: dplyr. This package defines the following verbs to select and transform data:

  • select() - filter for columns
  • filter() - filter for rows
  • group_by() - handle each variable separately
  • summarize() - summarize each variable
  • arrange() - order by field

By pure coincidence, the package vignette already uses the nycflights13 dataset and the verbs above are already in the right order to use on the task (except select, which is not needed). This, combined with the %>% operator gives the following solution:

flights %>%
    filter(hour >= 4 & hour <= 12) %>%
    group_by(dest) %>%
    summarize(delay = mean(arr_delay, na.rm=TRUE)) %>%
    arrange(-delay) %>%
    head(10)

Much better.

Notice how similar the code sounds to the verbal instructions I gave above.

One thing to note is that if anything goes wrong, the error messages produced by a pipe is not particularly helpful. To address this, @gaborcsardi wrote the R package tamper.