From tasks to workflows
So far, we have parallelized computations across either CPU cores or Slurm tasks. We could do this because the computations we wanted to parallelize were largely homogeneous.
Real-world projects, however, will involve many different processing steps, and only some parts will lend themselves to run arbitrarily parallel.
In this section, we will use the breakpoint example to treat as a
workflow. The idea is to split the computing of the breakpoints
(compute_breakpt.r
) from plotting each individual resulting
model (plot.r
) and a summary report
(report.r
).
We will try to run the breakpoint computation and plotting in parallel, and finish this project off with making a simple report of the data. We will also try to run all computations in a job on Sulis.
Shell scripting
In a job script with multiple cores, we can run multiple
srun
commands in parallel by using the &
sign at the end of a command (denotes that the script will not wait
until the command is done, but continue) and the wait
command to stop until the parallel commands are done.
This will look like the following:
#!/bin/sh
#SBATCH commands go here
RDS="bp1.rds bp2.rds bp3.rds ..."
for FILE in $RDS; do
( srun Rscript compute_breakpt.r $FILE && Rscript plot.r $FILE ) &
done
wait
srun Rscript report.r
Here, we additionally use &&
to chain together
commands: The second will be run after the first one completed (because
plotting the data only makes sense if we generated the data in the first
place).
GNU parallel
GNU parallel is able to distribute a number of script calls in
parallel, using available CPU cores, or available tasks in combination
with srun
.
It is extensively documented at https://sulis-hpc.github.io/advanced/ensemble/gnuparallel.html.
GNU make
GNU make is a tool commonly used to compile and assemble software executables, where it schedules the execution of different steps within a number of available jobs. In contrast to GNU parallel, it can distribute different kinds of calls together in a common framework.
An example Makefile
for processing our breakpoint
example would look something like this:
R = Rscript
ALL_IDX = $(shell seq 1 10)
ALL_BP = $(ALL_IDX:%=bp%.rds)
report.pdf: $(ALL_BP)
$(R) report.r $@ $^
# create a breakpoint result file
bp%.rds: data.rds
$(R) compute_breakpt.r $@ $^
# convert a breakpoint result file to a plot file
bp%.pdf: bp%.rds
$(R) plot.r $@ $^
The placeholder variables $^
and $@
refer
to the input (right of the :
in a rule) and output (left of
the :
in a rule), respectively. They can be queried using
the commandArgs()
function in R.
You can type make -n
to see which commands GNU make
would execute without actually running it.
snakemake
Snakemake is a Python package and command-line tool that lets you
write rules to chain multiple computations together, and run them in
parallel. A typical Snakefile
may look something like
this:
rule report:input:
= "report.r",
rscript = expand("{index}.rds", index=range(10))
infiles
output:= "report.pdf"
report
shell:"Rscript {input.rscript} {output.report} {input.infiles}"
rule compute:input:
= "compute_breakpt.r",
rscript = "data.rds"
datafile
output:= "bp{index}.rds"
outfile
shell:"Rscript {input.rscript} {input.datafile} {output.outfile}"
rule plot:input:
= "plot.r",
rscript = "bp{index}.rds"
infile
output:= "bp{index}.pdf"
plotfile
shell:"Rscript {input.rscript} {input.infile} {output.plotfile}"
Snakemake looks more complicated than GNU make on the first look, but it has some features that the former does not have. For instance, it supports multiple wildcards per file name, a feature that can be very useful with more complicated workflows.
You can type snakemake -np
to see which commands
Snakemake would execute without actually running it.
Note that if you want to use Snakemake regularly, there is a Slurm profile available which is not part of the excercises below.
targets
The R package targets
provides a workflow engine in pure
R. It is well documented at https://books.ropensci.org/targets/.