In this example, we will start with a basic folder of data and generate a data container for it.
A data container is generally an operating-system-less container that is optimized to provide data, either for query/search, or binding for analysis. The qualities of the data container should be:
The generation is fairly simple! It comes down to a three step multistage build:
And then we interact with it! This tutorial will show you the basic steps to perform the multistage-build using a simple Dockerfile along with the data folder.
Let’s break the dockerfile down into it’s components. This first section will install
the cdb
software, add the data, and generate a GoLang script to compile, which will generate an in-memory database.
FROM bitnami/minideb:stretch as generator
# docker build -t data-container .
ENV PATH /opt/conda/bin:${PATH}
ENV LANG C.UTF-8
RUN /bin/bash -c "install_packages wget git ca-certificates && \
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
rm Miniconda3-latest-Linux-x86_64.sh"
# install cdb (update version if needed)
RUN pip install cdb==0.0.1
# Add the data to /data (you can change this)
WORKDIR /data
COPY ./data .
RUN cdb generate /data --out /entrypoint.go
Next we want to build that file, entrypoint.go
, and also carry the data forward:
FROM golang:1.13-alpine3.10 as builder
COPY --from=generator /entrypoint.go /entrypoint.go
COPY --from=generator /data /data
# Dependencies
RUN apk add git && \
go get github.com/vsoch/containerdb && \
GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /entrypoint -i /entrypoint.go
Finally, we want to add just the executable and data to a scratch container (meaning it doesn’t have an operating system)
FROM scratch
LABEL MAINTAINER @vsoch
COPY --from=builder /data /data
COPY --from=builder /entrypoint /entrypoint
ENTRYPOINT ["/entrypoint"]
And that’s it! Take a look at the entire Dockerfile if you are interested.
Let’s build it!
$ docker build -t data-container .
We then have a simple way to do the following:
metadata
If we just run the container, we get a listing of all metadata alongside the key.
$ docker run data-container
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
list
We can also just list data files with -ls
$ docker run data-container -ls
/data/avocado.txt
/data/tomato.txt
orderby
Or we can list ordered by one of the metadata items:
$ docker run data-container -metric size
Order by size
/data/tomato.txt: {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
/data/avocado.txt: {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
search
Or search for a specific metric based on value.
$ docker run data-container -metric size -search 8
/data/tomato.txt 8
$ docker run entrypoint -metric sha256 -search 8
/data/avocado.txt 327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4
/data/tomato.txt 3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816
get
Or we can get a particular file metadata by it’s name:
$ docker run data-container -get /data/avocado.txt
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
or a partial match:
$ docker run data-container -get /data/
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
start
The start command is intended to keep the container running, if we are using it with an orchestrator.
$ docker run data-container -start
It’s more likely that you’ll want to interact with files in the container via
some analysis, or more generally, another container. Let’s put together
a quick docker-compose.yml
to do exactly that.
version: "3"
services:
base:
restart: always
image: busybox
entrypoint: ["tail", "-f", "/dev/null"]
volumes:
- data-volume:/data
data:
restart: always
image: data-container
command: ["-start"]
volumes:
- data-volume:/data
volumes:
data-volume:
Notice that the command for the data-container to start is -start
, which
is important to keep it running. After building our data-container
, we can then bring these containers up:
$ docker-compose up -d
Starting docker-simple_base_1 ... done
Recreating docker-simple_data_1 ... done
$ docker-compose ps
Name Command State Ports
---------------------------------------------------------
docker-simple_base_1 tail -f /dev/null Up
docker-simple_data_1 /entrypoint -start Up
We can then shell inside and see our data!
$ docker exec -it docker-simple_base_1 sh
/ # ls /data/
avocado.txt tomato.txt
The metadata is still available for query by interacting with the data-container entrypoint:
$ docker exec docker-simple_data_1 /entrypoint -ls
/data/avocado.txt
/data/tomato.txt
Depending on your use case, you could easily make this available inside the other container.
This is very simple example of building a small data container to query and
show metadata for two files, and then bind that data to another orchestration
setup. Although this example is simple, the idea is powerful because we
are able to keep data and interact with it without needing an operating system.
Combined with other metadata or data organizational standards, this could be
a really cool approach to develop data containers optimized to interact
with a particular file structure or workflow. How will that work in particular?
It’s really up to you! The cdb
software can take custom functions
for generation of metadata and templates for generating the GoLang script
to compile, so the possibilities are very open.
Please contribute to the effort! I’ll be slowly adding examples as I create them.
Welcome to cdb, the container database software that will help you to create a simple data container! This interface will give you a little background on the project.
In this example, we will start with a basic folder of data and generate a data container for it. Singularity is especially powerful here because our container will be read only, meaning that we know for sure it won’t be changed as long as we maintain the same file.
A data container is generally an operating-system-less container that is optimized to provide data, either for query/search, or binding for analysis. The qualities of the data container should be:
The generation is fairly simple! It comes down to a three step multistage build:
And then we interact with it! This tutorial will show you the basic steps to perform the multistage-build using a simple Dockerfile along with the data folder.
Let’s break the Singularity recipe down into it’s components. This first section will install
the cdb
software, add the data, and generate a GoLang script to compile, which will generate an in-memory database.
Bootstrap: docker
From: bitnami/minideb:stretch
Stage: generator
# sudo singularity build data-container Singularity
%setup
mkdir -p ${SINGULARITY_ROOTFS}/data
%files
./data /data
%post
export PATH=/opt/conda/bin:${PATH}
export LANG=C.UTF-8
/bin/bash -c "install_packages wget git ca-certificates && \
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
rm Miniconda3-latest-Linux-x86_64.sh"
pip install cdb==0.0.1
cdb generate /data --out /entrypoint.go
Next we want to build that file, entrypoint.go
, and also carry the data forward:
Bootstrap: docker
From: golang:1.13-alpine3.10
Stage: builder
%files from generator
/entrypoint.go /entrypoint.go
/data /data
%post
apk add git && \
go get github.com/vsoch/containerdb && \
GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /entrypoint -i /entrypoint.go
Finally, we want to add just the executable and data to a scratch container (meaning it doesn’t have an operating system)
%files from builder
/entrypoint /entrypoint
/data /data
We will then interact with the /entrypoint
binary. Notice that we could
also a little hack at the end - if we add a %runscript
then
it’s going to try and interact with some /bin/sh
and spit out an error.
So instead, we could just create the entrypoint executable to be /bin/sh
!
%setup
mkdir -p ${SINGULARITY_ROOTFS}/bin
%files from builder
/entrypoint /entrypoint
/data /data
I chose to not do this, because I don’t mind running singularity exec instead. I think we would also need a way to pass the arguments to /bin/sh (I haven’t tested this yet, care to try it out?) Also note that you need a fairly recent version of Singularity to have support for scratch. And that’s it! Take a look at the entire Singularity recipe if you are interested.
Let’s build it!
$ sudo singularity build data-container Singularity
We then have a simple way to do the following:
metadata
If we just run the container, we get a listing of all metadata alongside the key. I’m not sure how to silence the warnings, I’m sure there is a way
$ singularity exec data-container /entrypoint
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
list
We can also just list data files with -ls
$ singularity exec data-container /entrypoint -ls
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
/data/avocado.txt
/data/tomato.txt
orderby
Or we can list ordered by one of the metadata items:
$ singularity exec data-container /entrypoint -metric size
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
Order by size
/data/tomato.txt: {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
/data/avocado.txt: {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
search
Or search for a specific metric based on value.
$ singularity exec data-container /entrypoint -metric size -search 8
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
/data/tomato.txt 8
$ singularity exec data-container /entrypoint -metric sha256 -search 8
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
/data/avocado.txt 327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4
/data/tomato.txt 3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816
get
Or we can get a particular file metadata by it’s name:
$ singularity exec data-container /entrypoint -get /data/avocado.txt
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
or a partial match:
$ singularity exec data-container /entrypoint -get /data/avocado.txt -get /data/
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
start
The start command is intended to keep the container running, if we are using it with an orchestrator.
$ singularity exec data-container /entrypoint -get /data/avocado.txt -start
We haven’t created the singularity-compose recipes yet, likely we would need
to have a custom %startscript
to not use /bin/sh but instead to target
the entrypoint. Please take a shot or suggest ideas!
This is very simple example of building a small data container to query and
show metadata for two files. Although this example is simple, the idea is powerful because we
are able to keep data and interact with it without needing an operating system.
Combined with other metadata or data organizational standards, this could be
a really cool approach to develop data containers optimized to interact
with a particular file structure or workflow. How will that work in particular?
It’s really up to you! The cdb
software can take custom functions
for generation of metadata and templates for generating the GoLang script
to compile, so the possibilities are very open.
Please contribute to the effort! I’ll be slowly adding examples as I create them.