1. SLURM Job Manager, smanage
Slurm Manage, for submitting and reporting on job arrays run on slurm
Practices for Reproducible Research
Small to medium scale jobs can be done on Sherlock, a local computing cluster at Stanford. This will serve as a “quick start” guide, as well as documentation for running jobs. To see the full documentation base, see here. You might also be interested in these related resources:
Sherlock is a large compute cluster at Stanford University. This means that you can login to Sherlock from a secure shell (SSH), and run scripts across many computational nodes. After you have requested an acconut, this quick start guide should get your familiar with working with Sherlock.
Sherlock uses Kerberos Authentication, and you can read more and find instructions for setting it up by following the link. Basically, once you have this setup, you will be able to authenticate and ssh to sherlock as follows:
$ kinit yourSUnet@stanford.edu
# see your authentications and check when your identification expires
$ klist
$ ssh –XY yourSUnet@sherlock.stanford.edu
When your login to Sherlock doesn’t work, you likely just need to re-initialize your kinit:
kinit myemail@stanford.edu
because your ticket is expired.
In a nutshell, Sherlock gives you access to a job manager to run jobs at scale (SLURM), an ability to run programs that require graphics or display (with X11), data transfer options, and installation of custom software. Summary of the above functions will be shown here, and more detailed tutorials are provided.
Okay, here is the organization of sherlock in a nutshell. The places that you hit when you login
are the login nodes
. They are literally a set of machines that exist just for this purpose.
From the login nodes, you can submit jobs via the SLURM job manager
. The submission is handled
by a master node queue, which sends out jobs to the workers, which are other machines on the cluster.
+------------+
| |
| Worker 2 |
+------------+ | (GPU) |
| | | |
| Worker 1 | +------------+
| |
+----^-------+ +----------------------+
| | Priority Queue |
+-----------+ 1. Job A UserMark |
2. Run Job | 2. Job B UserFred |
| |
+-^--------------------+
|
|
|
| 1. Submit Job (qsub...)
|
+-------------+-+ +---------------+
| | | |
| | | |
| Login Node 1 | | Login Node 2 |
| | | |
| | | |
+---------------+ +---------------+
In the diagram above, you hit one of the two login nodes from your local machine. You then issue commands to interact with the SLURM priority queue (center node). It’s this master that then submits jobs to the various worker nodes on the cluster. Both the login nodes and the worker nodes have access to two important places - various “homes” that are backed up where you can store files and code, and “scratch” or working spaces for larger datasets.
.bash_profile
here to add programs and other environment variables to your path, or have modules for software you use often loaded automatically (we will write another post on modules). You should be very careful about putting large content (whether data or programs) in your $HOME, because the space can fill up quickly.For quotas, technical specifications, and more detail on these systems, see this page. For more information on running jobs with slurm, see this page
When you first login, you are in your home folder. If you type ls
there will be nothing there. This is where you should create a nice organized file structure for your scripts, and any software that you install locally. If you type
$ echo $HOME
you will see the path with your SUnet id. At any point in time you can get your present working directory (pwd) if you type:
$ echo $PWD
# or
$ pwd
If you do:
$ echo $PI_HOME
you will see that you also have a shared group space under your PI’s name. The quota is slightly larger here, and it would be good for shared group files and software.
$SCRATCH
and $PI_SCRATCH
, and the latter is shared with your group.$LOCAL_SCRATCH
: is used for storage of temporary files. These files are not backed up, and are deleted when a job finishes running.How do I transfer files here?
In short, you can use secure copy (the scp
command), Globus Online, a secure shell client, or our data transfer note. We have prepared a nice walkthrough of these options here.
Everything in $LOCAL_SCRATH
is deleted after the job is run. $SCRATCH
is purged when necessary. If your lab has storage on $OAK
, this is
where you can reliably store data and it won’t be purged.
How do I check disk quota
$ lfs quota -u <sunetid> -h /scratch/ # or
$ lfs quota -g <pi_sunetid> -h /scratch/
How do I count files?
find . -type f | wc -l
How is the quota calculated?
Keep your space tidy! The data that are stored in your personal $SCRATCH
or $HOME
folders can
also contribute to the quota in $PI_SCRATCH
and $PI_HOME
. For example:
$USER
) count towards your user quota$USER
:group count towards both quotas, user and groupIf you belong to multiple groups, you can change the ownership of the files to any of the groups, and the quota attribution will change along with it.
How do I run jobs? This is the most important function that you must do! There are two kinds of jobs:
See this guide for a detailed introduction and walk through.
Software comes by way of system installed software (modules) and software that you can bring (containers) or install in your user home. We will have a separate, detailed page on Software.
This series guides you through getting started with HPC cluster computing.
Slurm Manage, for submitting and reporting on job arrays run on slurm
A Quick Start to using Singularity on the Sherlock Cluster
Use the Containershare templates and containers on the Stanford Clusters
A custom built pytorch and Singularity image on the Sherlock cluster
Use Jupyter Notebooks via Singularity Containers on Sherlock with Port Forwarding
Use R via a Jupyter Notebook on Sherlock
Use Jupyter Notebooks (optionally with GPU) on Sherlock with Port Forwarding
Use Jupyter Notebooks on Sherlock with Port Forwarding
A native and container-based approach to using Keras for Machine learning with R
How to create and extract rar archives with Python and containers
Getting started with SLURM
Getting started with the Sherlock Cluster at Stanford University
Getting started with the Sherlock Cluster at Stanford University
Using Kerberos to authenticate to a set of resources