The Dockerfiles dataset is a set of approximately 130,000 Dockerfiles extracted in early summer 2018 across a sampling of search prefixes (v1.0.0) and 100,000 Dockerfiles extracted in January 2020 (v2.0.0).
$ find data -type f -name Dockerfile | wc -l
129,519
$ find data -type f -name Dockerfile | wc -l
99,826
The files are hosted as public images on Docker Hub and thus freely available for download and parsing.
The files are currently provided in their raw format,
each named Dockerfile
under an organization by the Docker Hub username. For example, here is the top level of folders under “data” in the repository:
data
├── 0
├── 1
├── 2
├── 3
├── 4
├── 5
├── 6
├── 7
├── 8
├── 9
├── a
├── b
├── c
...
├── w
├── x
├── y
└── z
36 directories, 0 files
and within each, we have folders that represent Docker Hub usernames:
data/a
├── a13r
├── a13xx
├── a1exanderjung
...
├── azuresdk
├── azzanatsu
└── azzra
And then each Dockerhub username has subfolders with container names, and the subfolders contain the Dockerfiles (no pun intended).
data/a/a13r
├── waecm-2018-group-16-bsp-1-backend
│ └── Dockerfile
├── waecm-2018-group-16-bsp-1-frontend
│ └── Dockerfile
└── waecm-2018-group-16-bsp-1-revproxy
└── Dockerfile
Since this dataset (despite the huge number of files!) fits still in a Github repository, the files are provided as is under version control, and don’t require any special downloading aside from cloning the repo, or downloading the archive.
git clone https://www.github.com/vsoch/datasets
# Version 1.0.0
wget https://github.com/vsoch/dockerfiles/archive/1.0.0.zip
wget https://github.com/vsoch/dockerfiles/archive/1.0.0.tar.gz
# Version 2.0.0
wget https://github.com/vsoch/dockerfiles/archive/2.0.0.zip
wget https://github.com/vsoch/dockerfiles/archive/2.0.0.tar.gz
Many of the same questions about signatures of software can be tested or generally relevant for this dataset. Additionally, we might ask the following:
Thanks for reading! If you have other questions, or want help for your project, please don’t hesitate to reach out.