SLURM Job Manager, smanage

Today I want to introduce you to a script, smanage.sh created by @esurface from Harvard Research Computing. smanage manages jobs running on a slurm compute cluster. It was developed in bash to take advantage of the automatic output from the slurm programs available on the command line, namely sacct and sbatch.

What do I need this for?

After you launch a bunch of jobs, it’s hard to keep track of them. How many completed? How many crashed and failed in a horrific ending of memory exceed exceptions? In a nutshell, smanage enables you to submit and track large batches of jobs beyond the MaxArraySize limit set by slurm.

What is a job array?

Wait, but what is a job array anyway? If you are used to submitting jobs on a SLURM cluster, you are probably used to the standard sbatch command:

$ sbatch myanalysis.job

If you are like me, you’ve probably written some kind of Python/R/Bash or other script that loops through some set of variables and programmatically generates and/or submits job files. Here is an example. of some of the nonsense that I (contributor @vsoch) went through in graduate school. If only I had known about job arrays!

a job array lets you submit a ton of similar jobs using a template script.

Actually, it’s just another SBATCH header. It looks like this:

#A job array with index values of 1, 2, 5, 19, 27:
#SBATCH --array=1,2,5,19,27

How would we use this? Let’s start with a simple example, and say that we have 100 text files to process. We have them in a folder, and they are labeled cookie1.txt through cookie100.txt. We could use arrays to process these files without any extraneous for loops:

#!/bin/bash
#SBATCH -J cookies-job # A single job name for the array
#SBATCH -n 1 # One Core
#SBATCH -N 1 # All cores on one machine
#SBATCH -p owners # Partition name
#SBATCH --mem 2000 # Memory (2Gb)
#SBATCH -t 0-1:00 # Maximum execution time (D-HH:MM)
#SBATCH -o cookie_%A_%a.out # Standard output
#SBATCH -e cookie_%A_%a.err # Standard error
#SBATCH --array=1-100   # maps 1 to 100 to SLURM_ARRAY_TASK_ID below

/bin/bash "${SCRATCH}/cookies/cookie${SLURM_ARRAY_TASK_ID}".txt

We would then submit that one job file, and 100 jobs would be run to process our cookie text files!

$ sbatch cookie-job.sbatch

That’s the essense of a job array. It’s actually exactly as it sounds - an array of jobs.

Why a tool like smanage?

Once you launch your jobs, you lose them to some extent because they are all individual. jobs. There are technically command line ways to interact and control them, but it’s yet another hard-to-learn thing and (wouldn’t it be nice) if there was a tool to manage arrays for us?

This is the goal of smanage. Now that you understand, let’s briefly look at usage. For more detailed examples, see the smanage repository. If you want to learn more about Job arrays, check out this AskCI post.

Usage

You can install the script by adding an alias to the program. First, wget it.


wget https://raw.githubusercontent.com/esurface/smanage/master/smanage.sh
chmod u+x smanage.sh # makes it executable

Run the following line of code or copy it into the file ‘~/.bashrc’ to make it permanent:

alias smanage='<pathto>/smanage.sh'

smanage has two basic modes described below.

Step 1. Write a Dummy Job

To really see how this works, we need to write a dummy job. Let’s write one that will simply sleep for a bit, and exit. Here is our sbatch job file, called sleepy.sbatch.

#!/bin/bash
#SBATCH -J sleepy-job
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -p owners
#SBATCH --mem 1000
#SBATCH -t 0-0:03
#SBATCH --array=1-100   # maps 1 to 100 to SLURM_ARRAY_TASK_ID below

echo "This is job number ${SLURM_ARRAY_TASK_ID}"
sleep 120

We will launch this job like this:

sbatch sleepy.sbatch

and then use it to learn about the different commands for smanage below.

Report Mode (default)

The reporting mode gives you a summary of your jobs. Here is what we see after launching sleepy.sbatch

[vsochat@sh-ln06 login /scratch/users/vsochat/sleepy]$ smanage report
Finding jobs using: /usr/bin/sacct -XP --noheader
Found jobs: 35512888
0 COMPLETED jobs
0 FAILED jobs
0 TIMEOUT jobs
98 RUNNING jobs
0 PENDING jobs

2 jobs with untracked status

We could have also used old school squeue:

[vsochat@sh-ln06 login /scratch/users/vsochat/sleepy]$ squeue -u vsochat
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  35386489_[1-100]    owners sleepy-j  vsochat PD       0:00      1 (None)

How does the first commad work? It parses and summarized the output from sacct. Here is what the output looks like when I’ve launched a TON of jobs - note that we truncate the job numbers so it’s readable.

$ smanage report
Finding jobs using: /usr/bin/sacct -XP --noheader
Found 1016 jobs: 36094475,36123253,36160821,36160826,36160831,36160945,36160947,36160950,36161025 (and more)
569 COMPLETED jobs
0 FAILED jobs
15 RUNNING jobs
1 PENDING jobs

Now we can try the command and look specifically for our “sleepy-job” by name. Let’s run the command again, so we get two sets of sleepy jobs.

$ sbatch sleepy.job
$ smanage report --sacct --name=sleepy-job
Finding jobs using: /usr/bin/sacct -XP --noheader --name=sleepy-job
Found jobs: 35386489,35386615
100 COMPLETED jobs
0 FAILED jobs
0 TIMEOUT jobs
0 RUNNING jobs
51 PENDING jobs

Now we see the first 100 completed, and the second batch are started up! Under the hood, this is using ` ‘sacct –array’ named “sleepy-job”`. If you are familiar with sacct, any sacct commands can be added to this command. For example, to report the jobs ran on a specific date, we might do this:

$ smanage report --sacct --name=sleepy-job --starttime=2019-01-07

You can also add the ‘–verbose’ flag to add more useful information about the jobs.

$ smanage --verbose report --sacct --name=sleepy-job --verbose --state=COMPLETED

Reporting Errors and Submitting Jobs

The examples above outline the simple reporting feature of smanage, which is likely what you will find most useful. For advanced users, know that smanage also handles reporting errors and customized job submission. For these advanced examples, or to ask a question or get help with the tool, see the smanage repository.

Do you have questions or want to see another tutorial? Please reach out!