This package will allow you to send function calls as jobs on a computing cluster with a minimal interface provided by the
Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.
Browse the vignettes here:
First, we need the ZeroMQ system library. This is probably already installed on your system. If not, your package manager will provide it:
# You can skip this step on Windows and macOS, the package binary has it # On a computing cluster, we recommend to use Conda or Linuxbrew brew install zeromq # Linuxbrew, Homebrew on macOS conda install zeromq # Conda, Miniconda sudo apt-get install libzmq3-dev # Ubuntu sudo yum install zeromq-devel # Fedora pacman -S zeromq # Arch Linux
Then install the
clustermq package in R from CRAN:
Alternatively you can use the
remotes package to install directly from Github:
# install.packages('remotes') remotes::install_github('mschubert/clustermq') # remotes::install_github('mschubert/clustermq', ref="develop") # dev version
An HPC cluster’s scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.
We currently support the following schedulers (either locally or via SSH):
The most common arguments for
fun- The function to call. This needs to be self-sufficient (because it will not have access to the
...- All iterated arguments passed to the function. If there is more than one, all of them need to be named
const- A named list of non-iterated arguments passed to
export- A named list of objects to export to the worker environment
The documentation for other arguments can be accessed by typing
?Q. Examples of using
export would be:
library(foreach) register_dopar_cmq(n_jobs=2, memory=1024) # see `?workers` for arguments foreach(i=1:3) %dopar% sqrt(i) # this will be executed as jobs
More examples are available in the user guide.
There are some packages that provide high-level parallelization of R function calls on a computing cluster. We compared
batchtools for processing many short-running jobs, and found it to have approximately 1000x less overhead cost.
In short, use
clustermq if you want:
batchtools if you:
We use Github’s Issue Tracker to coordinate development of
clustermq. Contributions are welcome and they come in many different forms, shapes, and sizes. These include, but are not limited to:
good first issuetag. Please discuss anything more complicated before putting a lot of work in, I’m happy to help you get started.
This project is part of my academic work, for which I will be evaluated on citations. If you like me to be able to continue working on research support tools like
clustermq, please cite the article when using it for publications:
M Schubert. clustermq enables efficient parallelisation of genomic analyses. Bioinformatics (2019). doi:10.1093/bioinformatics/btz284