R caveats and pitfalls
24 Nov 2016Each year the EMBL PhD students in their 2nd year visit the EBI for a week-long course in bioinformatics.
For this, I gave a talk approaches and pitfalls in data analysis using R. The goal was to show some common mistakes that cost my and others a lot of time when we started out using this language so the participants could avoid them.
The slides can be found here on Google Drive.
I decided to focus on three parts, first simple indexing operations, then move
on to loops vs. apply
(and vectorization) and show off dplyr
in the end.
Warm-up exercise using indexing
This was a warm-up exercise because all the participants already had an
introductory R course the year before. We started making a data.frame
and two
vectors to index.
The first bit was a reminder that a data.frame
could be indexed as a matrix
and as a list.
The second bit was to show that while the integer 2
and the character "2"
are considered equal, indexing with them does not produce the same results.
And finally the problem of indexing with a factor. Because each level is
represented with an integer value internally, R falls back to use this (often
invisible) value for indexing instead of the character representation of the
level. Here, the counter-intuitive example is that the last line returns the
number 4
even though the factor level is "d"
.
Loops, apply
, and vectorization
First, the task was to write a function that takes a number n
and, for each
number of 1
to n
multiplies this by 2 and adds 5. This is a very simple toy
example but it illustrates some common coding problems.
One of the solutions was the one below. This is not optimal, but I think attendees will be more receptive to the reasons why if they chose to implement the problem in a similar manner (as it hints they might make these mistakes elsewhere).
However, there is a better way to use loops. If we reserve memory in advance, computation is a lot bigger because R does not need to reserve new memory with each iteration.
Another important part here is to use seq_along()
and seq_len()
in order to
return an empty vector if the length is 0, not c(1,0)
as would be with 1:n
.
We also use NA_real_
instead of a plain NA
to not re-allocate
memory from logical
to
numeric
(H/T @HenrikBengtsson).
Is sapply()
faster than the loop? Turns out, it is not because it uses an
R loop internally (as opposed to a C loop). But of course there are good
reasons to use apply instead of loops: they are more expressive and emphasize
the object over the action (of looping).
However, the real answer to this is of course vectorization:
This is by far the fastest, easiest to read, and easiest to debug. If you can solve a problem using vectorized code, you should do it.
Only problem with this example is that this solution is a bit too obvious. I should swap this example for something you actively have to think about how to vectorize, for instance by introducing multiple matrices and then combining them.
Handling data.frame
s the classic way vs. dplyr
The task here was the answer the following question on the nycflights13
dataset:
For flights between 4 and 12 hours, what are the 10 destination airports with highest average arrival delay?
So, in other words filter for the field hour
to be between 4
and 12
, then
for each dest
airport calculate the mean of the field arr_delay
. In base R,
a possible solution is below:
Even though the question posed is simple by comparison to research questions, the code needed to answer is already hard to interpret when looking at it in order to make out what the question was.
There is a more structured alternative: dplyr. This package defines the following verbs to select and transform data:
select()
- filter for columnsfilter()
- filter for rowsgroup_by()
- handle each variable separatelysummarize()
- summarize each variablearrange()
- order by field
By pure coincidence, the
package vignette
already uses the nycflights13
dataset and the verbs above are already in the
right order to use on the task (except select
, which is not needed). This,
combined with the %>%
operator gives the following solution:
Much better.
Notice how similar the code sounds to the verbal instructions I gave above.
One thing to note is that if anything goes wrong, the error messages produced by a pipe is not particularly helpful. To address this, @gaborcsardi wrote the R package tamper.