Sherlock Cluster

Small to medium scale jobs can be done on Sherlock, a local computing cluster at Stanford. This will serve as a “quick start” guide, as well as documentation for running jobs. To see the full documentation base, see here. You might also be interested in these related resources:

What is Sherlock?

Sherlock is a large compute cluster at Stanford University. This means that you can login to Sherlock from a secure shell (SSH), and run scripts across many computational nodes. After you have requested an acconut, this quick start guide should get your familiar with working with Sherlock.

1. Connect to Sherlock

Sherlock uses Kerberos Authentication, and you can read more and find instructions for setting it up by following the link. Basically, once you have this setup, you will be able to authenticate and ssh to sherlock as follows:

$ kinit yourSUnet@stanford.edu

# see your authentications and check when your identification expires
$ klist 
$ ssh –XY yourSUnet@sherlock.stanford.edu

When your login to Sherlock doesn’t work, you likely just need to re-initialize your kinit:

kinit myemail@stanford.edu

because your ticket is expired.

2. What can you do?

In a nutshell, Sherlock gives you access to a job manager to run jobs at scale (SLURM), an ability to run programs that require graphics or display (with X11), data transfer options, and installation of custom software. Summary of the above functions will be shown here, and more detailed tutorials are provided.

3. Important Locations

Okay, here is the organization of sherlock in a nutshell. The places that you hit when you login are the login nodes. They are literally a set of machines that exist just for this purpose. From the login nodes, you can submit jobs via the SLURM job manager. The submission is handled by a master node queue, which sends out jobs to the workers, which are other machines on the cluster.

                +------------+
                |            |
                |  Worker 2  |
+------------+  |  (GPU)     |
|            |  |            |
|  Worker 1  |  +------------+
|            |
+----^-------+   +----------------------+
     |           | Priority Queue       |
     +-----------+ 1. Job A UserMark    |
     2. Run Job  | 2. Job B UserFred    |
                 |                      |
                 +-^--------------------+
                   |
                   |
                   |
                   |  1. Submit Job (qsub...)
                   |
     +-------------+-+  +---------------+
     |               |  |               |
     |               |  |               |
     |  Login Node 1 |  |  Login Node 2 |
     |               |  |               |
     |               |  |               |
     +---------------+  +---------------+

In the diagram above, you hit one of the two login nodes from your local machine. You then issue commands to interact with the SLURM priority queue (center node). It’s this master that then submits jobs to the various worker nodes on the cluster. Both the login nodes and the worker nodes have access to two important places - various “homes” that are backed up where you can store files and code, and “scratch” or working spaces for larger datasets.

  • $HOME is your home folder, the entrypoint at login. It’s a high speed NFS filesystem (Isilon) and is backed up for you. You can edit the .bash_profile here to add programs and other environment variables to your path, or have modules for software you use often loaded automatically (we will write another post on modules). You should be very careful about putting large content (whether data or programs) in your $HOME, because the space can fill up quickly.
  • SCRATCH is a larger storage space where you are encouraged to put data (and larger) files. This could include caching of container images (e.g., Singularity) or other intermediate data files.
  • $PI_SCRATCH is a scratch space allocated to a PI, which usually is best used for shared data in a lab.

For quotas, technical specifications, and more detail on these systems, see this page. For more information on running jobs with slurm, see this page

Filesystems

Getting Comfy In Home

When you first login, you are in your home folder. If you type ls there will be nothing there. This is where you should create a nice organized file structure for your scripts, and any software that you install locally. If you type

$ echo $HOME

you will see the path with your SUnet id. At any point in time you can get your present working directory (pwd) if you type:

$ echo $PWD

# or
$ pwd

If you do:

$ echo $PI_HOME

you will see that you also have a shared group space under your PI’s name. The quota is slightly larger here, and it would be good for shared group files and software.

Getting Comfy with Scratch

  • We have both $SCRATCH and $PI_SCRATCH, and the latter is shared with your group.
  • $LOCAL_SCRATCH: is used for storage of temporary files. These files are not backed up, and are deleted when a job finishes running.
  • It’s recommended to create a good organizational hierarchy for your personal and group data before putting it there.
  • It’s also recommended to discuss a data storage and archival strategy, which will be best appreciated down the road when you lab amasses a lot of it.

How do I transfer files here? In short, you can use secure copy (the scp command), Globus Online, a secure shell client, or our data transfer note. We have prepared a nice walkthrough of these options here.

Quotas

Everything in $LOCAL_SCRATH is deleted after the job is run. $SCRATCH is purged when necessary. If your lab has storage on $OAK, this is where you can reliably store data and it won’t be purged.

How do I check disk quota

$ lfs quota -u <sunetid> -h /scratch/ # or
$ lfs quota -g <pi_sunetid> -h /scratch/

How do I count files?

find . -type f | wc -l

How is the quota calculated?

Keep your space tidy! The data that are stored in your personal $SCRATCH or $HOME folders can also contribute to the quota in $PI_SCRATCH and $PI_HOME. For example:

  • Files that belong to you as a user ($USER) count towards your user quota
  • Files that belong to your group count towards your group quota
  • Files that belong to $USER:group count towards both quotas, user and group

If you belong to multiple groups, you can change the ownership of the files to any of the groups, and the quota attribution will change along with it.

Jobs

How do I run jobs? This is the most important function that you must do! There are two kinds of jobs:

  • interactive jobs: means jumping yourself onto a worker node, and issuing commands interactively. If you want to test or otherwise run commands manually, you should ask for an interactive node. You should not be running things on the login node because it hinders performance for other users. Your processes will be killed if they take up too much memory.
  • batch jobs means handing a submission script to the SLURM batch scheduler to give instruction for running one or more jobs. We provide detailed instructions in the link above, and quick command references below.

See this guide for a detailed introduction and walk through.

Software

Software comes by way of system installed software (modules) and software that you can bring (containers) or install in your user home. We will have a separate, detailed page on Software.