Today I had about 10,000 scripts I wanted to run on a SLURM cluster, and I also have an upper limit of jobs I’m allowed to run at once. I’m too lazy to wait and monitor the jobs, so I prepared the sbatch commands in a large file that looks something like this:
#!/bin/bash sbatch /scratch/users/vsochat/zenodo-ml/slurm/jobs/run_806345.sh sbatch /scratch/users/vsochat/zenodo-ml/slurm/jobs/run_1002155.sh sbatch /scratch/users/vsochat/zenodo-ml/slurm/jobs/run_1245189.sh sbatch /scratch/users/vsochat/zenodo-ml/slurm/jobs/run_835590.sh ...
Normally I can just run this bash script and the number is within the limit (and I’m fine):
$ /bin/bash run_jobs.sh
but today I was way over the limit, and had I done this would have just hit the limit and found a bunch of error messages when I returned from my afternoon dinosaur frolicking. Instead of re-writing the script to generate the commands on demand as space opens up (I’ve done this before) I decided to write a script to read in the commands, check the count in the queue, and submit jobs when there is an opening.
I can easily check which jobs still need to run by way of the output being organized by the identifier that is also captured in the script name. If I needed to run this again (and not redo runs) I would simply check for the existence of this folder, and skip if it’s found.
Yes, I’m spamming the queue, and I’m lazy. I’m really that terrible.
Sochat, Vanessa. "Spam the Queue." @vsoch (blog), 28 May 2018, https://vsoch.github.io/2018/spam-the-queue/ (accessed 20 Mar 23).