Dinosaur Datasets Open Source Discovery

Dinosaur Datasets

Open source, creative datasets for discovery in science.

This is an open source series of organized, high quality datasets ready to go for machine learning use! The dinosaur dataset series will parse a dataset for you to use, show you how to use it, and you can do awesome research with it. We know that huge amounts of time are spent by our users to simply obtain and organize their data, and so our goal is to anticipate interesting research questions, generate general datasets that might answer them, and provide them to you.

What is this initiative about?

This initiative aims to counter the reality that in academic research, the need to publish, and to publish "front page headline" sort of work means that there is little incentive to put time and effort into organizing and disseminating interesting data to drive such discovery. Given interesting data, it's usually the case that it's access is restricted, or that there simply isn't the time or resources required to organize, document, and disseminate it. While this model is optimal for the individual, it's not optimal for discovery. It is not logical to covet a dataset for a small group of individuals when the dataset could be shared freely with thousands for speedy discovery. The dinosaur datasets hosted and provided by Stanford Research Computing will turn this model upside-down. Instead of data curation and sharing being an after thought, we are placing it as first priority.

What kind of datasets will I find here?

We will provide creative and interesting open source datasets that are generated as such. Each dataset will:

- come with complete documentation for usage and example tutorials - carry a digital object identifier and version so that if you use it in your work, others will be able to reproduce it. - be provided in standardized formats to plug into your tooling of choice.

These datasets will be generated entirely from open source resources, sometimes by way of the combination of more than one in an interesting way. This means that you should not expect to find any personal health information or simiar.

How can I contribute?

If you have a messy dataset that you'd like to open source, we welcome you to reach out! Sharing a dataset can mean something as small as a record here and a shared file on Dropbox, or as large as a data paper and API to serve it. We will help you to transform your dataset into such a resource, and even do some "proof of concept" analyses (yes, we have researchers on our team too!

How do I get started?

Instructions for use are provided with each dataset, along with suggestions for the kinds of questions you might answer using the data. You can expect more detail as we develop our first datasets. For now, please post questions and issues here, and we look forward to sharing our first dinosaur dataset with you!


Datasets

Hospital Chargemasters

parsed descriptions and charges for over 100 U.S. hospitals

Check it out

Wikipedia Equation Embeddings

~63K Equation Embeddings for Wikipedia Statitics and Math Articles

Check it out

Singularity Hub Container Guts

Extraction of ~265 Singularity Container Guts

Check it out

Dockerfiles

Approximately 130K Dockerfiles

Check it out

Zenodo Machine Learning

code from ~10K Zenodo software repositories

Check it out