Standard Container Integration Format

We are making progress in reproducibility by way of Linux Containers, and as a developer for Singularity I am throwing my latkes in the air! However, what is currently lacking is transparency and organization about what is inside containers. Right now, it’s easy to create a container, call it “do-a-thing” and then make some assumption that executing the container will (hopefully) do the thing. But what if there is more than one application provided? How do I programmatically sniff the container and know, perhaps statically, where the software and dependencies can be found to make the thing? How do I tween apart the scientist’s specific executables from standard libraries that are used?

How can I contribute?

We are developing a Standard Container Integration Format (SCI-F) that will make it easy for scientists to generate internally modular containers. Why? Because we need modularity on the level of the software inside the container. The current Singularity software encourages reproducibility on the level of the entire operating system. If it’s easy to generate programatically predictible and accessible containers via internal modularity, this makes integration and understanding of these containers possible. We can start to do things like parse application folders to know the exact files and environment variables in the container for that application. We can start to seriously consider algorithms that classify software in the container, and then make them searchable.

If you don’t want to read the rest, in a nutshell we want your feedback on the current draft. If you have an interesting use case, let’s work together to implement it, and then it will be written into the publication. For the rest of this post, I will walk through a working example of the current implementation.

The Singularity Recipe

Here we introduce new sections to a traditional Singularity recipe. The current version might look like this:


BootStrap: docker
From: ubuntu:latest

%help
This container has applications foo and bar

%runscript

if [ "$1" = "bar" ]; then
    exec bar
elseif [ "$1" = "foo" ]; then 
    exec foo
else
    echo "Invalid input $1"
fi

%post

   echo 'echo "RUNNINGBAR"' >> /opt/bar
   echo 'echo "RUNNINGFOO"' >> /opt/foo

%environment
PATH=$PATH:/opt
HELLOTHISIS=foobar
export PATH HELLOTHISIS

In the example above, we bootstrap an ubuntu docker image, and then create a runscript that expects the user to provide “foo” or “bar” to call one of two executables. Installation is done in %post, and to make the executables available, we add them to the path in the %environment section. The %help section alerts the user that the container provides these two applications. Do you see problems with this approach? Even with two simple executables that perhaps share all dependencies (and it would be bloaty to create two different containers) they are tangled inside the container. When I issue a command to the container, their environment variables are both sourced, and I have no way to distinguish a command to run, shell, or exec to be relevant to a particular application. Let’s see if we can do better:


BootStrap: docker
From: ubuntu:latest

%help
This container has applications foo and bar

%appinstall foo

    echo "INSTALLING FOO"
    touch filebar.exec

%appinstall bar

    echo "INSTALLING BAR"
    touch filebar.exec

%apphelp foo
This is the help for foo!

%applabels foo
HELLOTHISIS foo

%applabels bar
HELLOTHISIS bar

%appenv foo
HELLOTHISIS=foo
export HELLOTHISIS

%apprun foo
echo "RUNNING FOO"

%post
echo "POST"

In the example above, you will notice the overall container still has a section for %post, %help, and could also have a %runscript (although I’ve removed it.) Importantly, I have defined separate installation, help, labels, and environment sections for each of apps “foo” and “bar.” I’m not required to have all sections for each. I don’t need to tell the recipe where to install my applications, because it knows. This means that I can issue commands that are finer grained, and targeted for a specific application. A user that discovers my container can list installed apps, and then issue commands to them without needing to know install paths or specific exec commands. Let’s look at some examples of that.

Interaction

A powerful feature of container software applications is allowing for programmatic accessibility to a specific application within a container. With SCI-F, simply by specifying separate sections of the build file, as shown above, each application can be interacted with directly, and predictibly. For each of the Singularity software’s main commands, run, exec, shell, inspect and test, the same commands can be easily run for an application.

Listing Applications

If I wanted to see all applications provided by a container, I could use singularity apps:

singularity apps container.img
total 8.0K
4.0K bar
4.0K foo

For each application, I am also told the size, and the total size of all applications in the container.

Application Run

To run a specific application, I can use apprun:


singularity apprun container.img foo
RUNNING FOO

This ability to run an application means that the application has its own runscript, defined in the build recipe with %apprun foo. In the case that an application doesn’t have a runscript, the default action is taken, shelling into the container:


singularity apprun container.img bar
No Singularity runscript found, executing /bin/sh into bar
Singularity>

Note that unlike a traditional shell command, we are shelling into the base location for the application, in this case at /scif/apps/bar as running the command makes the assumption the user wants to interact with this software.

Application Shell, Exec, Test

As was shown above, a user can specify the shell command targeted to a specific application, and shell directly to its base:


singularity shell --app foo container.img
Singularity: Invoking an interactive shell into application...

Singularity container.img:/scif/apps/foo>

Notice that the command syntax has changed a bit. Instead of having a separate command like appshell, it is more intuitive to ask for a shell to variable --app The exec command works similarly, it will run a command relative to this base:


singularity exec --app bar container.img ls
            filebar.exec  scif

In the above we see the application metadata folder, scif, and some executable in its base. Finally, an application with tests can be tested:

singularity test --app bar container.img

Importantly, for all of these commands, in addition to the base container environment being sourced, any variables specified for the application’s custom environment are also sourced. If I don’t need the environment variables for application “bar” when I’m running foo, they shouldn’t be sourced.

Application Inspect

In the case that a user wants to inspect a particular application for a runscript, test, or labels, that is possible on the level of the application:


singularity inspect --app foo container.img
{
    "SINGULARITY_APP_NAME": "foo",
    "HELLOTHISIS": "foo"
}

The above shows the default output, the labels specific to the application foo.

Container Organization is Really Important

Please give feedback on the draft or install the current implementation to test it out. Feedback is so important! If you have a well-scoped project that would be suited for a single container with multiple applications, please reach out and I’ll help with the development. All contributors will be co-authors on the paper. And you can trust that if you contribute your ideas, you will be heard. That’s the beauty of open source development.

Importantly, a researcher that doesn’t want to use this tool doesn’t have to - the additions to the Singularity recipe file don’t need to be used, and the container will be produced as it is now.

Containers are Like Supermarkets…

However, think about the analogy of your local supermarket. Given that it’s your local market, you go there many times and you can very reliably find the produce section. What happens when you go into a different store? If you are like me, you probably do a few circles before finding the right section. You might waste some time getting lost, need to ask for help, or (worst case) not find what you are looking for. To make matters worse, now imagine that the outside of the store doesn’t even indicate what kind of products are inside. Yuck. Instacart, anyone?

It’s really not so different with containers, but unfortunately we might be closer to the analogy with unlabeled stores because containers are (still) mostly black boxes. I know what you are thinking - we have the container name! And tags! And a list of metadata fields! These labels on the level of the entire container are better than nothing, but to assess content our algorithms still need to look over everything and then decide if it’s interesting (scientific software) or not (system library). We still see two containers called “Tensorflow Foods” and can’t be assured about what kind of things are lurking inside, and where they are. And what about when we dig deeper, and we not only are interested in software, but parameters used at runtime, resources used at runtime, and the outputs? Can we associate different features of the software, user scripts, or host container to some result? Developing a framework to organize the components that we aim to assess is not the answer, but it’s the start to being able to ask and answer these questions.

I’m ok with wandering around my local container with my analysis shopping cart, but I think we can do better.