The Zenodo-ML dataset is a collection of just under 10K records from the Zenodo service for generation of digital object identifiers (DOIs) for software and associated digital resources. In human terms, this means that someone writes a codebase for their software, and links it to Zenodo so others can find and cite it. For this dataset, it means that we can find these codebases, and do the following:
Here are some analysis ideas to get you started! These are questions that we at Research Computing are interested in and working on, but we can’t do it alone. We want to empower you to help!
First, let me give you a high level view of how this data can be understood. Software, whether compiled or not, is a collection of scripts. A script is a stack of lines, and you might be tempted to relate it to a document or page in book. Actually, I think we can conceptualize scripts more like images. A script is like an image in that it is a grid of pixels, each of which has some pixel value. In medical imaging we may think of these values in some range of grayscale, and with any kind of photograph we might imagine having different color channels. A script is no different - the empty spaces are akin to values of zero, and the various characters akin to different pixel values. While it might be tempting to use methods like Word2Vec to assess code in terms of lines, I believe that the larger context of the previous and next lines are important too.
This project aimed to:
Here are some ideas for questions that this dataset could help answer:
The dataset was generated by querying the Zenodo API for records, downloading the records archived code (a compressed archive and falling back to the original Github repository) and then running a script to process the scripts into 80x80 images, generate a file hierarchy tree, and save each repository into several data formats for your use. This process was run in parallel on a SLURM cluster at Stanford (Sherlock), and would also be possible to run locally (albeit it would take much longer!) For both cases, if you want to re-generate the dataset or update with newer records, the process is completely reproducible with the code that is provided in the Github repository. This page will review assumptions and details about the processing, and instructions for obtaining and using the dataset.
The original data lives on the Sherlock cluster, so if you are a Stanford affiliate you can ask Research Computing directly (see ? in bottom right of page!). If not, then you can download and use a squashfs filesystem to work with the data. This is over half a terabyte of data compressed into a 60GB squashfs archive, meaning that you can mount it on a FUSE filesystem. I haven’t gotten this working in a container yet, but I’ll show you how to do that.
Once downloaded, check the file size and md5 sum.
$ ls -l zenodo-ml.sqsh -rw-r--r-- 1 vanessa vanessa 20658548736 Jun 1 01:31 zenodo-ml.sqsh $ md5sum zenodo-ml.sqsh bff9f8ca3632fa7372f0b9e440b85c5a zenodo-ml.sqsh
Once you have downloaded the data, you need to mount it to a folder on your computer. If you have sudo, this is fairly easy to do!
$ sudo mount zenodo-ml.sqsh /tmp/data
If you have a FUSE filesystem then you can use squashfuse to mount the volume without needing sudo. First, here are the dependencies and installation steps for squashfuse.
#!/bin/bash apt-get update && apt-get install -y fuse libfuse2 git zlib1g-dev \ autoconf libtool make gcc pkg-config xz-utils \ libtool libfuse-dev liblzma-dev squashfs-tools git clone https://github.com/vasi/squashfuse cd squashfuse libtoolize --force aclocal autoheader automake --force-missing --add-missing autoconf ./configure --with-xz=/usr/lib/ --prefix=/usr/local make # make install
Note that you don’t have to do the make install, you can run it from the folder that you make it in. Once you have it installed, then issue a similar command as before, but with the squashfuse executable instead.
$ squashfuse zenodo-ml.sqsh /tmp/data
You can unmount each of the above either using
$ fusermount -u /tmp/data
sudo umount /tmp/data
This isn’t a completely reproducible way to share the data, and it certainly will take some time to download. We will update the instructions here when we figure out how to serve this dataset more efficiently. This is about half a terabyte for 10K code repositories, so it’s non-trivial.
Once the filesystem is mounted, you can explore it like a traditional filesystem. Let’s take a look at the contents of one of the subfolders under folder:
tree /tmp/data/1065022/ /tmp/data/1065022/ ├── images_1065022.h5 ├── images_1065022.pkl └── metadata_1065022.pkl 0 directories, 3 files
The filenames speak for themselves! The last two are python pickles, but since pickles are fragile to python
versions, the data is also provided in h5 format. We are providing pickles regardless because we will
provide a container with the proper python versions for loading it, and the more complicated h5 format
is a backup of sorts, and of course the entire generation can be reproduced if these data structures
fail. The file
images_*.pkl contains a dictionary data structure
with keys as files in the repository, and each index into the array is a list of file segments.
A file segment is an 80x80 section of the file (the key) that has had it’s characters converted
to ordinal. You do this in Python as follows:
# Character to Ordinal (number) char = 'a' number = ord(char) print(number) 97 # Ordinal back to character chr(number) 'a'
Here is how you would load and look at an image.
import pickle import os image_pkl = os.path.abspath('/tmp/data/1065022/images_1065022.pkl') images = pickle.load(open(image_pkl, 'rb'))
Remember, this entire pickle is for just one repository that is found in a record from Zenodo! If you look at the images “keys” you will see that each one corresponds to a file in the repository.
In : for script in images.keys(): ...: print(script) ...: ...: afmcl-MUSE-gas-velocities-5d4087c/README.md afmcl-MUSE-gas-velocities-5d4087c/musevel.py afmcl-MUSE-gas-velocities-5d4087c/musevel_run.py
It follows, then, that if we index images for a particular key, we are going to find images! Specifically, we will find a giant list of 80x80 images, where each image is a 2D numpy array with characters converted to ordinal (as we showed above).
images['afmcl-MUSE-gas-velocities-5d4087c/README.md'] [array([[ 35, 32, 77, ..., 120, 101, 108], [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32], ..., [ 84, 104, 101, ..., 40, 91, 78], [ 32, 32, 32, ..., 32, 32, 32], [ 80, 108, 101, ..., 32, 32, 32]])]
The above example only has one image, because the file was under 80 lines long.
This is a tiny repository and all the scripts are under 80 lines! However, if we look at scripts in other repos, we find longer files, and thus we find more than one 80x80 images:
image_pkl = os.path.abspath('/tmp/data/1122533/images_1122533.pkl') images = pickle.load(open(image_pkl, 'rb')) len(images['kourami-0.9.6/resources/Notes_on_DRB5_gen.txt'])
The above is pretty annoying, so we’ve provided loading functions to make it easy to get started, along with a <a target=”_blank” href=”https://github.com/vsoch/zenodo-ml/tree/master/preprocess”</a> using the loaded data for various kinds of interesting analyses. Below, you will find code and (sometimes) a writeup with images alongside. This is a work in progress, and please contribute here or to the repository if you want to add content! There are many interesting questions that we can answer with this data!
1.count_extensions.py (code) (writeup)
Thanks for reading! If you have other questions, or want help for your project, please don’t hesitate to reach out.