Let’s consider the following function in R, and assume we want to catch its
erros and warnings for later processing:
It’s pretty simple to catch the errors, or warnings:
Also, if we want to process errors and warnings the tryCatch() block
only processes the warning:
If we want it to handle warnings and errors equally, as well as still the the
function result when no error occurred, we need a more complicated construct:
Each of these calls returns a list with the elements result for the result of
the function call, but also warnings and error if they occurred.
TL;DR: Keep your own installation on a computing cluster, don’t rely on
others. But do use a package manager. And
share the
build scripts with the upstream repository, so everyone can easily maintain and
extend their own tool collection.
I work at a research institute with considerable computing capabilities. Things
you can’t run on your normal work machine you send to a high performance
computing cluster. Administration is taken care of by a group of people
specifically employed to make sure things run smoothly, so you just log in and
run whatever computing task you require.
Well, almost.
Of course, there is some software needed in order to run your analyses. You
installed this on your local machine a long time ago using your OS’s package
manger. But is it available on the computing cluster?
Typical HPC setup
The system administrators were happy to add software available in the official
distributions’ repository but not from third party sources. Of course, this
makes sense because everything needed to be stable. And stable meant the system
libraries are generally out of date by a couple of years.
We were lucky to have a person who, next to his regular responsibilities
installed software that people needed on a semi-regular basis. Or people
compiled different tools themselves. Every new tool was a new entry in $PATH,
which grew continuously. And $LD_LIBRARY_PATH too, of course, which
resulted in broken dependencies all the time because something didn’t
quite resolve right.
Manually compiling things often resulted in disabling optional functionality in
order to get it done quickly (BLAS/LAPACK support for R? Then I’d need to
install two more libraries). And it took a substantial amount of time
to get anything running.
Package managers to the rescue
The obvious answer to the above mess is to use a package manager to take care
of dependencies automatically and provide a much larger set of available
software or build scripts to draw from than was readily available on the
system.
At the time (first half of 2014), however, there were not many options
available. After looking for possible solutions for a couple of weeks I
discovered the Gentoo Prefix
project that provided a
user-level package manager which could use (almost) all the build scripts from
the distribution.
This was great! It provided a base installation that enabled a non-root user
the use of portage in a
specified directory, resolved dependencies automatically, and resolved library
locations using RPATH instead of the $LD_LIBRARY_PATH mess.
Magic.
Only downside was that the bootstrap and compilation options were a bit
involved, so it probably wasn’t for everyone to take care of this. Hence I set
this up at my institute and
advocated that people use it. We had a bug
tracker, software
requests,
and advertised
updates
before we performed them so users are not surprised by small changes. About 30
people relied on it, and it worked really well for three years to come.
Don’t use shared software installations
We had minor
hiccups
on occasion and a major
hiccup once
(basically, the linker added symbols from a broken internationalization library
that cascaded into breaking everything newly added or updated). But these were
a small price to pay for a usable and convenient system.
One thing that was annoying is that libraries and tools sometimes change
API between updates, and that would break a workflow that you are
currently using. But since this was shared, a change for anyone was a change for
everyone.
Since then, however, things have evolved. Linuxbrew
(the linux-pendant to the OS-X package manager Homebrew)
and Conda were released. And they were substantially
simpler to use than Gentoo.
So I don’t really see a reason to still put up with API changes for any user
unless they want to update their installation. The new package managers are
easy enough to use individually, and storage space is cheap so why bother.
But please contribute your build scripts
There is, however, one thing to keep in mind that often goes underappreciated.
If you need a tool that doesn’t yet exist in a repository and you already put
the effort in to manually install it, go one step further and add is as a build
script for the package manager. Please. It helps so much if more people do that.
I’m working a lot with gene expression data, for instance from The Cancer
Genome Atlas (TCGA). One big challenge is to
investigate biological variation in different samples when it is potentially
maked by variation in the way the samples for collected, stored or processed.
When 20,000 genes are measured simultaneously, how do you know?. The way you
usually look at this is to perform some sort of dimensionality reduction on the
data set and look if they cluster by the batch they were collected or analysed
in. The TCGA assigns
barcodes to their samples
that provide a lot of information on that like the Plate ID.
However, there is other information that is not in the barcode directly, like
the date the samples was shipped for sequencing. The GDC Data
Portal provides additional information in the
Biospecimen Supplement files, but extracting this information is not trivial
for these XML files.
The information I was looking for was a table between the assigned barcode and
the shipment date, which were hidden in a structure like this:
One way to convert XML-formatted information into a table is to use
XSLT. There are nice
tutorials on how to use it,
but I did not know how to handle XML name spaces until recently, which the GDC
files contain.
The way to solve this is rather simple: look at all the xmlns: attributes in
the XML header that you are interested in, and add them to your
xsl:stylesheet tag. Then it’s just a matter of referencing them correctly,
looping through your tags and printing out the values in a tab-separated way
with a newline at the end.
This is of course much less error-prone that using a regular expression. And
produces a table for a single file.
The data I downloaded from the GDC contained 500 files, so I used the find
command to execute the XSLT transformation on all of them:
In the end, I got a table of a few thousand barcodes and the date they were shipped as tab-separated values.
Here are some settings that I found useful for working with R. Note that I’m
mainly using basic R scripts in an HPC environment and not
Rstudio most of the time.
Profiling needs to be enabled in the configure options (this is default)
For these reasons, it will make sense to have a central install on an HPC
cluster to get the details right. However, that comes with some challenges.
In particular, the separation of user-readable system packages and
user-writable personal packages does not always
work.
This is why I found it easiest to specify an explicit library path.
This way, R will ignore the system-provided packages and install all required
packages in the above directory where you have got full write access.
Custom options
Here are some tweaks about converting strings to factors, selecting a default
CRAN mirror, and warn about partial matching of subsetting and function
arguments.
There is some controversy about how R is supposed to handle strings. Convert
characters to factors by default or not? I’ve been a strong proponent to
disabling converting, for which I’ve been accused to not understand how they
work. True, code is less portable by
setting a non-default option, but factors mess up enough that I don’t want to
deal with it.
Another tweak is to not display the graphical menu to select a mirror when
downloading from CRAN (this just delays what I
actually wanted to do) and select a default mirror.
A third is to dispaly a warning when I either try to pass a named function
argument that does not exactly match what the function is expecting (helps not
mixing up arguments with confusing errors after) or when subsetting a list or
data.frame using the $ sign.
Traceback on non-interactive scripts
When scripts run non-interactively and they fail, it is often challenging to
find the source of the error or the function call in which R failed, as only
the error message is reported and no additional context by default.
Using the below lines, you can get the traceback on errors when scripts fail
that will not impact your interactive sessions.
The guard to exit R after printing the traceback is essential, because
otherwise it will happily try to execute the rest of the script with an
undefined state.
Each year the EMBL PhD students in their 2nd year visit
the EBI for a week-long course in bioinformatics.
For this, I gave a talk approaches and pitfalls in data analysis using R. The
goal was to show some common mistakes that cost my and others a lot of time
when we started out using this language so the participants could avoid them.
I decided to focus on three parts, first simple indexing operations, then move
on to loops vs. apply (and vectorization) and show off dplyr in the end.
Warm-up exercise using indexing
This was a warm-up exercise because all the participants already had an
introductory R course the year before. We started making a data.frame and two
vectors to index.
The first bit was a reminder that a data.frame could be indexed as a matrix
and as a list.
The second bit was to show that while the integer 2 and the character "2"
are considered equal, indexing with them does not produce the same results.
And finally the problem of indexing with a factor. Because each level is
represented with an integer value internally, R falls back to use this (often
invisible) value for indexing instead of the character representation of the
level. Here, the counter-intuitive example is that the last line returns the
number 4 even though the factor level is "d".
Loops, apply, and vectorization
First, the task was to write a function that takes a number n and, for each
number of 1 to n multiplies this by 2 and adds 5. This is a very simple toy
example but it illustrates some common coding problems.
One of the solutions was the one below. This is not optimal, but I think
attendees will be more receptive to the reasons why if they chose to implement
the problem in a similar manner (as it hints they might make these mistakes
elsewhere).
However, there is a better way to use loops. If we reserve memory in advance,
computation is a lot bigger because R does not need to reserve new memory with
each iteration.
Another important part here is to use seq_along() and seq_len() in order to
return an empty vector if the length is 0, not c(1,0) as would be with 1:n.
We also use NA_real_ instead of a plain NA to not re-allocate
memory from logical to
numeric (H/T @HenrikBengtsson).
Is sapply() faster than the loop? Turns out, it is not because it uses an
R loop internally (as opposed to a C loop). But of course there are good
reasons to use apply instead of loops: they are more expressive and emphasize
the object over the action (of looping).
However, the real answer to this is of course vectorization:
This is by far the fastest, easiest to read, and easiest to debug. If you can
solve a problem using vectorized code, you should do it.
Only problem with this example is that this solution is a bit too obvious. I
should swap this example for something you actively have to think about how to
vectorize, for instance by introducing multiple matrices and then combining
them.
Handling data.frames the classic way vs. dplyr
The task here was the answer the following question on the nycflights13
dataset:
For flights between 4 and 12 hours, what are the 10 destination airports with
highest average arrival delay?
So, in other words filter for the field hour to be between 4 and 12, then
for each dest airport calculate the mean of the field arr_delay. In base R,
a possible solution is below:
Even though the question posed is simple by comparison to research questions,
the code needed to answer is already hard to interpret when looking at it in
order to make out what the question was.
There is a more structured alternative: dplyr.
This package defines the following verbs to select and transform data:
select() - filter for columns
filter() - filter for rows
group_by() - handle each variable separately
summarize() - summarize each variable
arrange() - order by field
By pure coincidence, the
package vignette
already uses the nycflights13 dataset and the verbs above are already in the
right order to use on the task (except select, which is not needed). This,
combined with the %>% operator gives the following solution:
Much better.
Notice how similar the code sounds to the verbal instructions I gave above.
One thing to note is that if anything goes wrong, the error messages produced
by a pipe is not particularly helpful. To address this,
@gaborcsardi wrote the R package
tamper.