1. SLURM Job Manager, smanage
Slurm Manage, for submitting and reporting on job arrays run on slurm
Practices for Reproducible Research
Today I want to introduce you to a script, smanage.sh created by @esurface from Harvard Research Computing. smanage manages jobs running on a slurm compute cluster. It was developed in bash to take advantage of the automatic output from the slurm programs available on the command line, namely sacct and sbatch.
After you launch a bunch of jobs, it’s hard to keep track of them. How many completed? How many crashed and failed in a horrific ending of memory exceed exceptions? In a nutshell, smanage enables you to submit and track large batches of jobs beyond the MaxArraySize limit set by slurm.
Wait, but what is a job array anyway? If you are used to submitting jobs on a SLURM cluster, you are probably used to the standard sbatch command:
$ sbatch myanalysis.job
If you are like me, you’ve probably written some kind of Python/R/Bash or other script that loops through some set of variables and programmatically generates and/or submits job files. Here is an example. of some of the nonsense that I (contributor @vsoch) went through in graduate school. If only I had known about job arrays!
a job array lets you submit a ton of similar jobs using a template script.
Actually, it’s just another SBATCH header. It looks like this:
#A job array with index values of 1, 2, 5, 19, 27:
#SBATCH --array=1,2,5,19,27
How would we use this? Let’s start with a simple example, and say that we have 100 text files to process. We have them in a folder, and they are labeled cookie1.txt through cookie100.txt. We could use arrays to process these files without any extraneous for loops:
#!/bin/bash
#SBATCH -J cookies-job # A single job name for the array
#SBATCH -n 1 # One Core
#SBATCH -N 1 # All cores on one machine
#SBATCH -p owners # Partition name
#SBATCH --mem 2000 # Memory (2Gb)
#SBATCH -t 0-1:00 # Maximum execution time (D-HH:MM)
#SBATCH -o cookie_%A_%a.out # Standard output
#SBATCH -e cookie_%A_%a.err # Standard error
#SBATCH --array=1-100 # maps 1 to 100 to SLURM_ARRAY_TASK_ID below
/bin/bash "${SCRATCH}/cookies/cookie${SLURM_ARRAY_TASK_ID}".txt
We would then submit that one job file, and 100 jobs would be run to process our cookie text files!
$ sbatch cookie-job.sbatch
That’s the essense of a job array. It’s actually exactly as it sounds - an array of jobs.
Once you launch your jobs, you lose them to some extent because they are all individual. jobs. There are technically command line ways to interact and control them, but it’s yet another hard-to-learn thing and (wouldn’t it be nice) if there was a tool to manage arrays for us?
This is the goal of smanage. Now that you understand, let’s briefly look at usage. For more detailed examples, see the smanage repository. If you want to learn more about Job arrays, check out this AskCI post.
You can install the script by adding an alias to the program. First, wget it.
wget https://raw.githubusercontent.com/esurface/smanage/master/smanage.sh
chmod u+x smanage.sh # makes it executable
Run the following line of code or copy it into the file ‘~/.bashrc’ to make it permanent:
alias smanage='<pathto>/smanage.sh'
smanage has two basic modes described below.
To really see how this works, we need to write a dummy job. Let’s write one
that will simply sleep for a bit, and exit. Here is our sbatch job file,
called sleepy.sbatch
.
#!/bin/bash
#SBATCH -J sleepy-job
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -p owners
#SBATCH --mem 1000
#SBATCH -t 0-0:03
#SBATCH --array=1-100 # maps 1 to 100 to SLURM_ARRAY_TASK_ID below
echo "This is job number ${SLURM_ARRAY_TASK_ID}"
sleep 120
We will launch this job like this:
sbatch sleepy.sbatch
and then use it to learn about the different commands for smanage below.
The reporting mode gives you a summary of your jobs. Here is what we see after launching sleepy.sbatch
[vsochat@sh-ln06 login /scratch/users/vsochat/sleepy]$ smanage report
Finding jobs using: /usr/bin/sacct -XP --noheader
Found jobs: 35512888
0 COMPLETED jobs
0 FAILED jobs
0 TIMEOUT jobs
98 RUNNING jobs
0 PENDING jobs
2 jobs with untracked status
We could have also used old school squeue:
[vsochat@sh-ln06 login /scratch/users/vsochat/sleepy]$ squeue -u vsochat
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
35386489_[1-100] owners sleepy-j vsochat PD 0:00 1 (None)
How does the first commad work? It parses and summarized the output from sacct. Here is what the output looks like when I’ve launched a TON of jobs - note that we truncate the job numbers so it’s readable.
$ smanage report
Finding jobs using: /usr/bin/sacct -XP --noheader
Found 1016 jobs: 36094475,36123253,36160821,36160826,36160831,36160945,36160947,36160950,36161025 (and more)
569 COMPLETED jobs
0 FAILED jobs
15 RUNNING jobs
1 PENDING jobs
Now we can try the command and look specifically for our “sleepy-job” by name. Let’s run the command again, so we get two sets of sleepy jobs.
$ sbatch sleepy.job
$ smanage report --sacct --name=sleepy-job
Finding jobs using: /usr/bin/sacct -XP --noheader --name=sleepy-job
Found jobs: 35386489,35386615
100 COMPLETED jobs
0 FAILED jobs
0 TIMEOUT jobs
0 RUNNING jobs
51 PENDING jobs
Now we see the first 100 completed, and the second batch are started up! Under the hood, this is using ` ‘sacct –array’ named “sleepy-job”`. If you are familiar with sacct, any sacct commands can be added to this command. For example, to report the jobs ran on a specific date, we might do this:
$ smanage report --sacct --name=sleepy-job --starttime=2019-01-07
You can also add the ‘–verbose’ flag to add more useful information about the jobs.
$ smanage --verbose report --sacct --name=sleepy-job --verbose --state=COMPLETED
The examples above outline the simple reporting feature of smanage, which is likely what you will find most useful. For advanced users, know that smanage also handles reporting errors and customized job submission. For these advanced examples, or to ask a question or get help with the tool, see the smanage repository.
Do you have questions or want to see another tutorial? Please reach out!
This series guides you through getting started with HPC cluster computing.
Slurm Manage, for submitting and reporting on job arrays run on slurm
A Quick Start to using Singularity on the Sherlock Cluster
Use the Containershare templates and containers on the Stanford Clusters
A custom built pytorch and Singularity image on the Sherlock cluster
Use Jupyter Notebooks via Singularity Containers on Sherlock with Port Forwarding
Use R via a Jupyter Notebook on Sherlock
Use Jupyter Notebooks (optionally with GPU) on Sherlock with Port Forwarding
Use Jupyter Notebooks on Sherlock with Port Forwarding
A native and container-based approach to using Keras for Machine learning with R
How to create and extract rar archives with Python and containers
Getting started with SLURM
Getting started with the Sherlock Cluster at Stanford University
Getting started with the Sherlock Cluster at Stanford University
Using Kerberos to authenticate to a set of resources