# R caveats and pitfalls

Each year the EMBL PhD students in their 2nd year visit the EBI for a week-long course in bioinformatics.

For this, I gave a talk approaches and pitfalls in data analysis using R. The goal was to show some common mistakes that cost my and others a lot of time when we started out using this language so the participants could avoid them.

The slides can be found here on Google Drive.

I decided to focus on three parts, first simple indexing operations, then move on to loops vs. `apply` (and vectorization) and show off `dplyr` in the end.

### Warm-up exercise using indexing

This was a warm-up exercise because all the participants already had an introductory R course the year before. We started making a `data.frame` and two vectors to index.

The first bit was a reminder that a `data.frame` could be indexed as a matrix and as a list.

The second bit was to show that while the integer `2` and the character `"2"` are considered equal, indexing with them does not produce the same results.

And finally the problem of indexing with a factor. Because each level is represented with an integer value internally, R falls back to use this (often invisible) value for indexing instead of the character representation of the level. Here, the counter-intuitive example is that the last line returns the number `4` even though the factor level is `"d"`.

### Loops, `apply`, and vectorization

First, the task was to write a function that takes a number `n` and, for each number of `1` to `n` multiplies this by 2 and adds 5. This is a very simple toy example but it illustrates some common coding problems.

One of the solutions was the one below. This is not optimal, but I think attendees will be more receptive to the reasons why if they chose to implement the problem in a similar manner (as it hints they might make these mistakes elsewhere).

However, there is a better way to use loops. If we reserve memory in advance, computation is a lot bigger because R does not need to reserve new memory with each iteration.

Another important part here is to use `seq_along()` and `seq_len()` in order to return an empty vector if the length is 0, not `c(1,0)` as would be with `1:n`. We also use `NA_real_` instead of a plain `NA` to not re-allocate memory from `logical` to `numeric` (H/T @HenrikBengtsson).

Is `sapply()` faster than the loop? Turns out, it is not because it uses an R loop internally (as opposed to a C loop). But of course there are good reasons to use apply instead of loops: they are more expressive and emphasize the object over the action (of looping).

However, the real answer to this is of course vectorization:

This is by far the fastest, easiest to read, and easiest to debug. If you can solve a problem using vectorized code, you should do it.

Only problem with this example is that this solution is a bit too obvious. I should swap this example for something you actively have to think about how to vectorize, for instance by introducing multiple matrices and then combining them.

### Handling `data.frame`s the classic way vs. `dplyr`

The task here was the answer the following question on the `nycflights13` dataset:

For flights between 4 and 12 hours, what are the 10 destination airports with highest average arrival delay?

So, in other words filter for the field `hour` to be between `4` and `12`, then for each `dest` airport calculate the mean of the field `arr_delay`. In base R, a possible solution is below:

Even though the question posed is simple by comparison to research questions, the code needed to answer is already hard to interpret when looking at it in order to make out what the question was.

There is a more structured alternative: dplyr. This package defines the following verbs to select and transform data:

• `select()` - filter for columns
• `filter()` - filter for rows
• `group_by()` - handle each variable separately
• `summarize()` - summarize each variable
• `arrange()` - order by field

By pure coincidence, the package vignette already uses the `nycflights13` dataset and the verbs above are already in the right order to use on the task (except `select`, which is not needed). This, combined with the `%>%` operator gives the following solution:

Much better.

Notice how similar the code sounds to the verbal instructions I gave above.

One thing to note is that if anything goes wrong, the error messages produced by a pipe is not particularly helpful. To address this, @gaborcsardi wrote the R package tamper.