1. SLURM Job Manager, smanage
Slurm Manage, for submitting and reporting on job arrays run on slurm
Practices for Reproducible Research
Keras is a library that lets you create neural networks. It’s sticking point is that it wants to get you from 0 to trained model in a jiffy. For more detail, read about the integration with R. In this tutorial, we are going to be stepping through using Keras (via R) on a high performance computing (HPC) cluster at Stanford, specifically the Sherlock 2 cluster.
This tutorial was made possible by community members:
If you are like me, you don’t want to have to do this twice! The benefits of using a container are exactly that.
Once it’s built and in a container registry, you don’t need to do it again. Today we are going to be writing a Docker recipe called
a Dockerfile. The benefits of using Docker is that we can convert a Docker image directly to Singularity, and with Singularity
we can use the container on a shared resource. This means that we get two birds
with one stone, and only need to maintain one recipe file. I’ll save the “how do I create a container recipe” for another
post, let’s look at the keras-r
container!
You should generally not run compute on a login node. If you are working on your local machine, you can skip this step. If you are on a shared resource, you should jump onto an interactive node. Here is how to do that, asking for a GPU node:
$ srun -p gpu --gres gpu:1 --pty bash
srun
is the SLURM executable to run parallel jobs.-p
means partiion, which is in reference to a group of worker nodes you have access to.--gres
means “generic consumable resource,” which you can think about is a kind of attribute you are asking for your node. The count of how many comes after the name, so gpu:1
means that you want one GPU.--pty
means “pseudo terminal mode” and without it, we wouldn’t get an interactive node.Asking for an interactive node, period, is important to do because assembling the container can take up enough memory that the process will be killed. It’s nothing personal, it’s just a friendly reminder that you are sharing the login nodes with other users!
Another terrible situation you can avoid is filling up the default cache (in your $HOME) so that you use your entire quota, are unable to write files, and then get locked out of your own $HOME. This has happened to me before, and it’s terrible and humiliating because you have to ask research computing to bail you out. By default, Singularity would use $HOME/.singularity as a cache, so let’s tell it to use somewhere else (that doesn’t have the space limitation!).
$ SINGULARITY_CACHEDIR=$SCRATCH/.singularity
$ export SINGULARITY_CACHEDIR
For a one time definition, you can prepend the variable to a command:
$ SINGULARITY_CACHEDIR=$SCRATCH/.singularity singularity pull docker://ubuntu
I find it easiest to put the first export lines in my $HOME/.bash_profile so that I don’t need to
remember to do it. If you just added these lines to yours, just source $HOME/.bash_profile
to make sure it’s
active!
You also need to load the module for Singularity, of course, if it’s not already loaded or on your path:
$ module load singularity
# search for packages
# module spider singularity
This is the last step! Let’s pull the image. It’s here on Docker Hub, and we pull as follows:
# One time definition of cache
$ SINGULARITY_CACHEDIR=/scratch/users/vsochat/.singularity singularity pull docker://vanessa/keras-r
# After you've exported
$ singularity pull docker://vanessa/keras-r
WARNING: pull for Docker Hub is not guaranteed to produce the
WARNING: same image on repeated pull. Use Singularity Registry
WARNING: (shub://) to pull exactly equivalent images.
Docker image path: index.docker.io/vanessa/keras-r:latest
Cache folder set to /scratch/users/vsochat/.singularity/docker
[13/13] |===================================| 100.0%
Importing: base Singularity environment
...
WARNING: Building container as an unprivileged user. If you run this container as root
WARNING: it may be missing some functionality.
Building Singularity image...
Singularity container built: /scratch/users/vsochat/.singularity/keras-r.simg
Cleaning up...
Done. Container is at: /scratch/users/vsochat/.singularity/keras-r.simg
We can actually just run the image to get an interactive R session!
$ singularity run /scratch/users/vsochat/.singularity/keras-r.simg
R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library('keras')
>
R Version 3.5 is called “Joy in Playing.” What exactly are you playing with, R? :) I digress! You can have the same interaction with shell, and then start R on your own:
$ singularity shell /scratch/users/vsochat/.singularity/keras-r.simg
Singularity: Invoking an interactive shell within container...
Singularity keras-r.simg:~> which R
/usr/bin/R
Singularity keras-r.simg:~> R
...
For a more substantial test, the container has the additional packages kerasformula
and ggplot2
:
> require(kerasformula)
> require(ggplot2)
> out = kms(mpg ~ ., mtcars) # use X11 to see epoch-by-epoch progress
> plot(out$history)
The above should run without error, and the plot produced is the following:
What is the intended use of your container? If it serves as a working environment, you probably can extend it most easily by installing packages that save to your $HOME on the cluster (where you have writable). If this is intended for a reproducible analysis, you can use the container for development but ultimately save all your changes to the final container for dissemination. We discuss both more in detail.
While containers on the cluster are read only, one (potentially error prone) but nice feature is that you can maintain a library of packages (installed in your $HOME) for just your user, and since Singularity mounts your $HOME (where you have write) it will feel just like installing locally. You should be careful, however, because these packages are not being installed into your container! If you need a working environment to mess around, this is OK. But if you are creating what you hope to be a reproducible environment, you really need to go back to the container recipe and add the packages to be installed (along with your scripts). If you need help with this step, please reach out.
A development container is the same as a working container, but the goals are different. Your aim is to develop a recipe (and container to share in a registry) that delivers a frozen version of your software. While this is somewhat possible given that you can install packages to your $HOME in a read only environment, it’s sometimes easier to develop with a writable image on your local machine. How to do this? You would create a container recipe that uses this keras-r container (or another of your choice) as a base, and add commands on to it. Here is how you would use it as a base in a Dockerfile:
FROM vanessa/r-keras
and in a Singularity Recipe
Bootstrap: docker
From: vanessa/r-keras
For each, you would start with this base in a Dockerfile
or Singularity
recipe
on your local machine (text files with the above text) and then create writable
images that you can shell into and test commands. Your workflow should look like
this:
And at the end of each iteration when you are happy with local testing, you can move the container onto your shared resource and test for the intended use case, which is likely running something at scale.
Given the above text in a file called Singularity
, to build a writable (sandbox) image, you can do:
$ sudo singularity build --sandbox sandbox Singularity
and then you would want to shell inside also with writable
$ sudo singularity shell --writable sandbox
If you don’t have the Singularity file yet and want a quick development environment, you can do this:
$ sudo singularity build --sandbox sandbox docker://vanessa/r-keras
Notice that the only example is the source of the bootstrap, which is a docker identifier.
And when you like a command, add it to the %post
section of your Singularity file.
When you are ready to give the entire thing a test run, build in squashfs (read only)
format:
$ sudo singularity build my-keras.simg Singularity
My preference for development is to use Docker, and the reason is because the build layers
get cached, and I don’t need to wait for long, complicated compile routines as I would with
Singularitys. To build a Docker image, given a Dockerfile
with the text above:
$ docker build -t vanessa/r-keras .
If you need to disable the cache (for example, a remote resource changes but the file doesn’t have changes and so the cache would be used) you can do this:
$ docker build --no-cache -t vanessa/r-keras .
And there you have it, a handy-dandy container for bringing around Keras, R, and other cool things! You can use as is, as a working environment that you customize, or as a base for your own work.
If you aren’t into using containers, you of course can get this working natively. We need Python, R, and a few libraries. In the example below, we will use LMOD on Sherlock to manage software modules. This basically means you can type “module load” to fuss around with the path. Without a container, you need to have the libraries and software on the host, and yes versions matter.
In the examples below, we are first going to get an interactive node. The reason is because we should generally not run things on the login nodes! You must get a GPU node or you will get a bunch of weird errors.
# Grab an interactive node, you get an hour default!
$ srun -p gpu --gres gpu:1 --pty bash
Let’s pretend we have forgotten how to find tensorflow. We can use spider to see what is available:
[vsochat@sh-08-37 ~]$ module spider tensorflow
----------------------------------------------------------------------------
py-tensorflow:
----------------------------------------------------------------------------
Description:
TensorFlow™ is an open source software library for numerical
computation using data flow graphs.
Versions:
py-tensorflow/1.4.0_py27
py-tensorflow/1.5.0_py27
py-tensorflow/1.5.0_py36
py-tensorflow/1.6.0_py27
py-tensorflow/1.6.0_py36
py-tensorflow/1.7.0_py27
py-tensorflow/1.8.0_py27
You can also use module avail tensorflow
for more detailed output.
[vsochat@sh-08-37 ~]$ module avail tensorflow
--- math -- numerical libraries, statistics, deep-learning, computer science ---
py-tensorflow/1.4.0_py27 (g) py-tensorflow/1.6.0_py36 (g)
py-tensorflow/1.5.0_py27 (g) py-tensorflow/1.7.0_py27 (g)
py-tensorflow/1.5.0_py36 (g) py-tensorflow/1.8.0_py27 (g)
py-tensorflow/1.6.0_py27 (g,D)
Where:
D: Default Module
g: GPU support
Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".
We are interested in having gpu support, so it’s good to see that we have it. Note that in the above we can choose between python 2 OR 3, and if we just loaded py-tensorflow without a version string, the default we would get is python 2.7 with tensorflow 1.6.0. Let’s make the assumption that the sys admin had some wise rationale for this to be default (and the others are user requests?) and use the default.
$ module load py-tensorflow
The following have been reloaded with a version change:
1) python/3.6.1 => python/2.7.13
It should show you that you are also loading python 2.7. This needs this version to work, so if you don’t see that, make sure you have a Python 2.7 somewhere! Peeking into the install base for py-tensorflow, we see that loading the module basically adds one of these to the path:
[vsochat@sh-08-37 ~]$ ls /share/software/user/open/py-tensorflow/
1.3.0_py27 1.4.0_py27 1.5.0_py27 1.6.0_py27 1.7.0_py27
1.3.0_py36 1.4.0_py36 1.5.0_py36 1.6.0_py36 1.8.0_py27
and here is how to check your python:
[vsochat@sh-101-58 ~]$ which python
/share/software/user/open/python/2.7.13/bin/python
The Keras API is a Python module, so we install it with pip. On Sherlock, we need the open ssl library loaded:
$ module load libressl/2.5.3
Install keras as a user:
$ pip install keras utils np_utils tensorflow --user
The --user
flag will install the extra modules to my $USER home. If we don’t do
this, then it would try to install to the system site-packages
and you would
get a permissions error.
We need to now install the wrapper for Keras in R, along with other supporting packages. But first! We need… additional libraries! These will be needed to run R.
$ module load openblas/0.2.19
And of course we need to load R
$ module load R/3.4.0
$ which R
/share/software/user/open/R/3.4.0/bin/R
$ R
Oh darn, looks like we only have verison 3.4.0, code name “You Stupid Darkness.” Darkness is good for sleeping, and it inspires insightful reflection. If you have issues with Darkness, you should complain to the sun. I’m pretty sure the sun would just engulf you in a fiery arm and go on with his day. Oup, I am digressing again…
Once in R, we can install keras, tensorflow, and reticulate. I’m adding
the extra packages kerasformula
and ggplot2
so we can do the same
test as we did previously. To find the python on your path:
# Here is how I found the path
> system('which python')
/share/software/user/open/python/2.7.13/bin/python
Note the below is the same script that I use to generate the container!
install.packages('reticulate')
reticulate::use_python('/share/software/user/open/python/2.7.13/bin/python')
install.packages('devtools')
devtools::install_github('rstudio/keras')
require(tensorflow)
require(reticulate)
require(keras)
Sys.setenv(TENSORFLOW_PYTHON='/share/software/user/open/python/2.7.13/bin/python')
use_python('/share/software/user/open/python/2.7.13/bin/python')
py_discover_config('tensorflow')
py_discover_config('keras')
is_keras_available()
packages = c("kerasformula",
"kerasR",
"ggplot2",
"dplyr",
"magrittr",
"zeallot",
"tfruns")
for (package in packages) {
install.packages(package)
}
If you didn’t get a GPU node, you’d see an error like this:
> reticulate:::import('keras')
Error in py_module_import(module, convert = convert) :
ImportError: cannot import name np_utils
Detailed traceback:
File "/home/users/vsochat/.virtualenvs/r-tensorflow/lib/python2.7/site-packages/keras/__init__.py", line 3, in <module>
from . import utils
File "/home/users/vsochat/.virtualenvs/r-tensorflow/lib/python2.7/site-packages/keras/utils/__init__.py", line 2, in <module>
from . import np_utils
The real issue is that the tensorflow backend couldn’t import CUDA. You could verify this by trying to use tensorflow:
> library(tensorflow)
Error: package or namespace load failed for ‘tensorflow’:
.onLoad failed in loadNamespace() for 'tensorflow', details:
call: py_module_import(module, convert = convert)
error: ImportError: Traceback (most recent call last):
File "/share/software/user/open/py-tensorflow/1.6.0_py27/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/share/software/user/open/py-tensorflow/1.6.0_py27/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/share/software/user/open/py-tensorflow/1.6.0_py27/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
And this ladies and gents is why wrapping things in layers of other things is so dangerous! Here is the other issue you might hit, if you want to plot things:
In (function (display = "", width, height, pointsize, gamma, bg, :
unable to open connection to X11 display ''
>
You might try going back and adding --x11
to your node request.
From personal experience, you can get different results depending on your internet connectivity, the mirror that you use, and even different system environment variables. Hence why (in some cases) it’s nice to use a container. Some other bugs you might run into:
This part of R - installing and loading packages - is the most fragile step. Importantly, when you get an error message please read the entire text output. Most of the time, an error message will tell you exactly what the issue is. If you aren’t sure what it means, try a Google search. If you are still lost, then contact research computing support.
Activate your new packages in the session:
> require(tensorflow)
> require(reticulate)
> require(keras)
Let’s now test as we did before!
> require(kerasformula)
> require(ggplot2)
> out = kms(mpg ~ ., mtcars) # use X11 to see epoch-by-epoch progress
> plot(out$history)
I’ll spare you the redundant picture! It looks the same.
The next time around, since you have everything installed, you mostly need to load things again. You can have a small script to do that. You would run this on an interactive node.
$ srun -p gpu --gres gpu:1 --x11 --pty bash
#!/bin/bash
module load py-tensorflow
module load R/3.4.0
module load openblas/0.2.19
exec R
# And then interact with keras
I hope that you can see the huge benefit of using a container here. With the container, you can go to a cluster anywhere that has Singularity installed, and be up and running just in the amount of time it takes to pull the container. The native method, although using modules, is fragile because of the huge number of dependencies. You very likely will log in at some future date, and it won’t reproduce, and you may not know why. A colleage on a different resource also can’t easily reproduce your software. As a strong example, just in deriving this native tutorial it took me over a few days to get working based on the notes from my colleague primarily because of subtle differences in modules and libraries that were loaded. Save your future self the time and figure out your software base once, put it in a container, and be done!
This series guides you through getting started with HPC cluster computing.
Slurm Manage, for submitting and reporting on job arrays run on slurm
A Quick Start to using Singularity on the Sherlock Cluster
Use the Containershare templates and containers on the Stanford Clusters
A custom built pytorch and Singularity image on the Sherlock cluster
Use Jupyter Notebooks via Singularity Containers on Sherlock with Port Forwarding
Use R via a Jupyter Notebook on Sherlock
Use Jupyter Notebooks (optionally with GPU) on Sherlock with Port Forwarding
Use Jupyter Notebooks on Sherlock with Port Forwarding
A native and container-based approach to using Keras for Machine learning with R
How to create and extract rar archives with Python and containers
Getting started with SLURM
Getting started with the Sherlock Cluster at Stanford University
Getting started with the Sherlock Cluster at Stanford University
Using Kerberos to authenticate to a set of resources