R caveats and pitfalls

Each year the EMBL PhD students in their 2nd year visit the EBI for a week-long course in bioinformatics.

For this, I gave a talk approaches and pitfalls in data analysis using R. The goal was to show some common mistakes that cost my and others a lot of time when we started out using this language so the participants could avoid them.

The slides can be found here on Google Drive.

I decided to focus on three parts, first simple indexing operations, then move on to loops vs. apply (and vectorization) and show off dplyr in the end.

Warm-up exercise using indexing

This was a warm-up exercise because all the participants already had an introductory R course the year before. We started making a data.frame and two vectors to index.

mydf = data.frame(num=1:5, let=rev(letters[1:5]))
x = setNames(letters[1:5], rev(1:5))
y = setNames(1:5, rev(letters[1:5]))

The first bit was a reminder that a data.frame could be indexed as a matrix and as a list.

mydf[1]         # first column enclosed in a list
mydf[1,]        # first row, still as data.frame
mydf[[1]]       # first column as vector

The second bit was to show that while the integer 2 and the character "2" are considered equal, indexing with them does not produce the same results.

x[2]            # "b"
y[2]            # 2
2 == "2"        # TRUE
x["2"]          # "d"
y["2"]          # NA (because no "2" in the names)

And finally the problem of indexing with a factor. Because each level is represented with an integer value internally, R falls back to use this (often invisible) value for indexing instead of the character representation of the level. Here, the counter-intuitive example is that the last line returns the number 4 even though the factor level is "d".

mydf$let[2]     # the factor level of "d" (internally a 2)
y["d"]          # 2
y[mydf$let[2]]  # the number 4

Loops, apply, and vectorization

First, the task was to write a function that takes a number n and, for each number of 1 to n multiplies this by 2 and adds 5. This is a very simple toy example but it illustrates some common coding problems.

One of the solutions was the one below. This is not optimal, but I think attendees will be more receptive to the reasons why if they chose to implement the problem in a similar manner (as it hints they might make these mistakes elsewhere).

mytest = function(n) {
    new_n = c()
    for (i in 1:n) {
        new_n[i] = i * 2 + 5
    }
    new_n
}

However, there is a better way to use loops. If we reserve memory in advance, computation is a lot bigger because R does not need to reserve new memory with each iteration.

new_n = rep(NA_real_, n)
for (i in seq_along(new_n))
    new_n[i] = n[i] * 2 - 5

Another important part here is to use seq_along() and seq_len() in order to return an empty vector if the length is 0, not c(1,0) as would be with 1:n. We also use NA_real_ instead of a plain NA to not re-allocate memory from logical to numeric (H/T @HenrikBengtsson).

new_n = sapply(seq_len(n), function(x) x * 2 - 5)

Is sapply() faster than the loop? Turns out, it is not because it uses an R loop internally (as opposed to a C loop). But of course there are good reasons to use apply instead of loops: they are more expressive and emphasize the object over the action (of looping).

However, the real answer to this is of course vectorization:

new_n = seq_len(n) * 2 - 5

This is by far the fastest, easiest to read, and easiest to debug. If you can solve a problem using vectorized code, you should do it.

Only problem with this example is that this solution is a bit too obvious. I should swap this example for something you actively have to think about how to vectorize, for instance by introducing multiple matrices and then combining them.

Handling data.frames the classic way vs. dplyr

The task here was the answer the following question on the nycflights13 dataset:

For flights between 4 and 12 hours, what are the 10 destination airports with highest average arrival delay?

So, in other words filter for the field hour to be between 4 and 12, then for each dest airport calculate the mean of the field arr_delay. In base R, a possible solution is below:

fl = subset(flights, hour >= 4 & hour <= 12)
dest = unique(fl$dest)
delay = setNames(rep(NA, length(dest)), dest)
for (d in unique(fl$dest)) {
    avg = mean(fl[fl$dest == d,]$arr_delay, na.rm=TRUE)
    delay[d] = avg
}
head(sort(delay, decreasing=TRUE), 10)

Even though the question posed is simple by comparison to research questions, the code needed to answer is already hard to interpret when looking at it in order to make out what the question was.

There is a more structured alternative: dplyr. This package defines the following verbs to select and transform data:

By pure coincidence, the package vignette already uses the nycflights13 dataset and the verbs above are already in the right order to use on the task (except select, which is not needed). This, combined with the %>% operator gives the following solution:

flights %>%
    filter(hour >= 4 & hour <= 12) %>%
    group_by(dest) %>%
    summarize(delay = mean(arr_delay, na.rm=TRUE)) %>%
    arrange(-delay) %>%
    head(10)

Much better.

Notice how similar the code sounds to the verbal instructions I gave above.

One thing to note is that if anything goes wrong, the error messages produced by a pipe is not particularly helpful. To address this, @gaborcsardi wrote the R package tamper.