Zenodo Machine Learning

Datasets Github Tutorial

31 May 2018

Summary

The Zenodo-ML dataset is a collection of just under 10K records from the Zenodo service for generation of digital object identifiers (DOIs) for software and associated digital resources. In human terms, this means that someone writes a codebase for their software, and links it to Zenodo so others can find and cite it. For this dataset, it means that we can find these codebases, and do the following:

What can I learn from this dataset?

Here are some analysis ideas to get you started! These are questions that we at Research Computing are interested in and working on, but we can’t do it alone. We want to empower you to help!

Software in the Context of Image Analysis

First, let me give you a high level view of how this data can be understood. Software, whether compiled or not, is a collection of scripts. A script is a stack of lines, and you might be tempted to relate it to a document or page in book. Actually, I think we can conceptualize scripts more like images. A script is like an image in that it is a grid of pixels, each of which has some pixel value. In medical imaging we may think of these values in some range of grayscale, and with any kind of photograph we might imagine having different color channels. A script is no different - the empty spaces are akin to values of zero, and the various characters akin to different pixel values. While it might be tempting to use methods like Word2Vec to assess code in terms of lines, I believe that the larger context of the previous and next lines are important too.

This project aimed to:

  1. Identify a corpus of scripts and metadata (Zenodo “software” bucket)
  2. Preprocess to derive images, metadata, and structure trees from the data
  3. Generate features of the scripts, or make associations between metadata and script content with deep learning.

Interesting Questions

Here are some ideas for questions that this dataset could help answer:

Generation

The dataset was generated by querying the Zenodo API for records, downloading the records archived code (a compressed archive and falling back to the original Github repository) and then running a script to process the scripts into 80x80 images, generate a file hierarchy tree, and save each repository into several data formats for your use. This process was run in parallel on a SLURM cluster at Stanford (Sherlock), and would also be possible to run locally (albeit it would take much longer!) For both cases, if you want to re-generate the dataset or update with newer records, the process is completely reproducible with the code that is provided in the Github repository. This page will review assumptions and details about the processing, and instructions for obtaining and using the dataset.

Assumptions

  1. We use an “image size” of 80 by 80, under the assumption that the typical editor / programming language prefers lines of max length 80 (see Python’s Pep8 specification) and most machine learning algorithms prefer square images.
  2. We filter the files down to those less than or equal to 100,000 bytes (100KB –> 0.1 MB). This still leads to having on the order of a few thousand images (each 80x80) for one small script.
  3. We filter down the Zenodo repos to the first 10K within the set of the bucket called “software.”
  4. I filtered out repos that (were strangely common) related to something to do with “gcube.”
  5. We take a greedy approach in parsing files - in the case that a single file produces some special error, we pass it in favor of continued processing of the rest.




Tutorial

Download Data

The original data lives on the Sherlock cluster, so if you are a Stanford affiliate you can ask Research Computing directly (see ? in bottom right of page!). If not, then you can download and use a squashfs filesystem to work with the data. This is over half a terabyte of data compressed into a 60GB squashfs archive, meaning that you can mount it on a FUSE filesystem. I haven’t gotten this working in a container yet, but I’ll show you how to do that.

wget https://storage.googleapis.com/dinosaur-datasets/zenodo-ml/zenodo-ml.sqsh

Once downloaded, check the file size and md5 sum.


$ ls -l zenodo-ml.sqsh
-rw-r--r-- 1 vanessa vanessa 20658548736 Jun  1 01:31 zenodo-ml.sqsh

$ md5sum zenodo-ml.sqsh
bff9f8ca3632fa7372f0b9e440b85c5a  zenodo-ml.sqsh

Mount with Sudo

Once you have downloaded the data, you need to mount it to a folder on your computer. If you have sudo, this is fairly easy to do!

$ sudo mount zenodo-ml.sqsh /tmp/data

Mount without Sudo

If you have a FUSE filesystem then you can use squashfuse to mount the volume without needing sudo. First, here are the dependencies and installation steps for squashfuse.

#!/bin/bash

apt-get update && apt-get install -y fuse libfuse2 git zlib1g-dev \
                      autoconf libtool make gcc pkg-config xz-utils \
                      libtool libfuse-dev liblzma-dev squashfs-tools

git clone https://github.com/vasi/squashfuse
cd squashfuse
libtoolize --force
aclocal
autoheader
automake --force-missing --add-missing
autoconf
./configure --with-xz=/usr/lib/ --prefix=/usr/local
make
# make install

Note that you don’t have to do the make install, you can run it from the folder that you make it in. Once you have it installed, then issue a similar command as before, but with the squashfuse executable instead.

$ squashfuse zenodo-ml.sqsh /tmp/data

Unmount without Sudo

You can unmount each of the above either using umount or fusermount -u.

$ fusermount -u /tmp/data
sudo umount /tmp/data

This isn’t a completely reproducible way to share the data, and it certainly will take some time to download. We will update the instructions here when we figure out how to serve this dataset more efficiently. This is about half a terabyte for 10K code repositories, so it’s non-trivial.

Loading Data

Once the filesystem is mounted, you can explore it like a traditional filesystem. Let’s take a look at the contents of one of the subfolders under folder:

tree /tmp/data/1065022/

/tmp/data/1065022/
├── images_1065022.h5
├── images_1065022.pkl
└── metadata_1065022.pkl

0 directories, 3 files

The filenames speak for themselves! The last two are python pickles, but since pickles are fragile to python versions, the data is also provided in h5 format. We are providing pickles regardless because we will provide a container with the proper python versions for loading it, and the more complicated h5 format is a backup of sorts, and of course the entire generation can be reproduced if these data structures fail. The file images_*.pkl contains a dictionary data structure with keys as files in the repository, and each index into the array is a list of file segments. A file segment is an 80x80 section of the file (the key) that has had it’s characters converted to ordinal. You do this in Python as follows:

#  Character to Ordinal (number)
char = 'a'
number = ord(char)
print(number)
97

# Ordinal back to character
chr(number)
'a'

Here is how you would load and look at an image.

import pickle
import os

image_pkl = os.path.abspath('/tmp/data/1065022/images_1065022.pkl')
images = pickle.load(open(image_pkl, 'rb'))

Remember, this entire pickle is for just one repository that is found in a record from Zenodo! If you look at the images “keys” you will see that each one corresponds to a file in the repository.

In [9]: for script in images.keys():
   ...:     print(script)
   ...:     
   ...:     
afmcl-MUSE-gas-velocities-5d4087c/README.md
afmcl-MUSE-gas-velocities-5d4087c/musevel.py
afmcl-MUSE-gas-velocities-5d4087c/musevel_run.py

It follows, then, that if we index images for a particular key, we are going to find images! Specifically, we will find a giant list of 80x80 images, where each image is a 2D numpy array with characters converted to ordinal (as we showed above).

images['afmcl-MUSE-gas-velocities-5d4087c/README.md']
[array([[ 35,  32,  77, ..., 120, 101, 108],
        [ 32,  32,  32, ...,  32,  32,  32],
        [ 32,  32,  32, ...,  32,  32,  32],
        ..., 
        [ 84, 104, 101, ...,  40,  91,  78],
        [ 32,  32,  32, ...,  32,  32,  32],
        [ 80, 108, 101, ...,  32,  32,  32]])]

The above example only has one image, because the file was under 80 lines long.

len(images['afmcl-MUSE-gas-velocities-5d4087c/README.md'])

This is a tiny repository and all the scripts are under 80 lines! However, if we look at scripts in other repos, we find longer files, and thus we find more than one 80x80 images:

image_pkl = os.path.abspath('/tmp/data/1122533/images_1122533.pkl')
images = pickle.load(open(image_pkl, 'rb'))

len(images['kourami-0.9.6/resources/Notes_on_DRB5_gen.txt'][0])

Helper Scripts

The above is pretty annoying, so we’ve provided loading functions to make it easy to get started, along with a <a target=”_blank” href=”https://github.com/vsoch/zenodo-ml/tree/master/preprocess”</a> using the loaded data for various kinds of interesting analyses. Below, you will find code and (sometimes) a writeup with images alongside. This is a work in progress, and please contribute here or to the repository if you want to add content! There are many interesting questions that we can answer with this data!

Extension Counts

1.count_extensions.py (code) (writeup)

Other questions?

Thanks for reading! If you have other questions, or want help for your project, please don’t hesitate to reach out.