Reproducibie Impossible

Is reproducibility possible? This effort to “freeze everything so we can do it again” exactly the same mirrors the way that science is practiced, and not the natural world. Let me explain. Let’s start with asking how we understand data.

Evolution of data?

We understand data as csv files, images or biological metrics captured at a point in time. In this scoped representation of our world, reproducibility is logical and tangible. We measure our world like this because it’s the most logical thing to do. We have to scope the scale at which we measure to one that is also reasonable. If you ask me to collect features of a human, I’m very unlikely to think about the constantly changing signaling of every pathway, or movement of every gut microbiome fish. I’m likely going to give you some information from a standard blood test, a height and weight, and some behavioral and physical characters. If I’m just interested in studying gut bacteria, I will zoom in and scope what I collect to ahem that area, but I’d bet you I still don’t have a stream of change over time. I have a few timepoints from limited samples at best. It’s overwhelming to think of the sheer overwhelming size of some constant feed of “living data” that really does exist, and even our sensory systems are optimized to ignore the majority of or else we would be unable to function. Now think of if it’s even possible to build a capture device to truly take in everything. I think we would probably run out of resources almost immediately. I’m not saying that our current representation of data is wrong. It reflects our limitations and need to encapsulate phenomena to things that can be measured. I do think, however, that we must try to push our thinking away from static data frames and timepoints to infinite streams of living, and changing signals. How do you plug that into a csv file or data frame? You don’t. We invent new methods that optimize processing of real time streams.

Evolution of Technology

The problem that I see with a general effort to save every little library, programming language, and scientific pipeline is the fact that some things need to die. The crappy code, and in fact, most of all of our work, is pretty terrible and will definitely be replaced with something better. This is the human need - to constantly re-invent and re-optimize and follow the desire for innovation (I would argue that following this desire, and that the general idea of “technological innovation” doesn’t necessarily mean better quality of life, but I’ll save that for another discussion). In some reality if every software package and thing were perfectly frozen and maintained, then we would essentially just have more noise to sift through, and technological evolution would need to discover new means to evolve, or just evolve more slowly and with more noise. It’s no different from human evolution. If you think about it, natural selection is almost no-existent in our world. Let’s think about eyesight. Every person out there that needs glasses would very likely be eaten by a more important animal, and thus non propagate their genes. Now all of us blind molerats are jaunting around and prospering. Now think about more serious medical conditions that are saved by advances in medicine. In the same way natural selection is broken with our advances in medicine, it’s also somewhat broken under some theoretical reality that every technology must be perfectly preserved.

Imagine if we applied this same logic that to organisms. Reproducibility means you clone something and it’s exactly the same. This sort of makes natural selection obsolete and does away with any interesting changes that result from random genetic variation. For what goal? Would we want the organism to live on much longer than it would have otherwise? If we had the resources in terms of land and environments for all things to thrive, then why not. It would be interesting to run by a dodo bird or a wooly mammoth, or even better, a dinosaur! But what are the costs to ensuring the survival of everything? It means that another organism suffers directly or indirectly. I definitely would be eaten by a dinosaur. Whether we are aware of it or not, the whole “survival of the fittest” rule needs to work. Pandas are a great example. They are relatively stupid, probably would go extinct on their own, but their sheer cuteness has humans trying to continue the species. It might be for the generation of YouTube videos of pandas acting stupid. Some other species might benefit if they went away, and the world would change. As humans we think its up to us to decide these things. I disagree. There are more species than we know that have existed and then gone away, and that’s the way that it works. It will happen to our species eventually, because stars eventually explode. If pandas don’t want to have sex, it’s not going to work out, and let them eat bamboo in peace. If an analysis is sort of crappy, don’t spend so much time and energy trying to save and capture it. The best ones will demonstrate utility, and be used and live on regardless.

Reproducibility Impossible?

To play devil’s advocate, let’s pretend that being able to preserve the recipe for a panda, and then “run” it as it was originally conceived to make a baby panda were possible. In scientific programming this means handing a container with dependencies and a dataset to your colleague and having him run it. Reproducibility is, in the long term, unlikely to be maintained because there is always going to be one more level up of an “external environment” that we cannot control. The bases of our technology are always changing. For example, no matter how advanced science is, an organism “frozen” to function during some kind of ice age condition is probably not going to work the same when the ice melts, if it even lives. No matter how great your linux container is, in 100 years when we use some other form of energy to communicate with robots, your old crusy container isn’t so usable.

This brings up two ideas. Firstly, that the time frame of 100 years is important. If I said ten years, it’s pretty reasonable most things will still work. This calls for work and better understanding of the lifespan of a technology or software. I don’t understand why more work doesn’t go into the qualities and attributes of programs and ideas that are long lived. I want to see work done that explores software from the past decade, and shows us how timeframe is important.

Given these time frames, we might instead suggest that reproducibility is some kind of metric that is short lived, and eventually expires. I don’t want to know if work is reproducible, because most of it cannot be guaranteed that, I want to know if its fitness and utility is good enough to stand the tests of time. Might we call this durability? If the work passes tests for survival of the fittest, it evolves into something else. If not, then it stops being used and is lost in the empty stygian abyss we call the internet.

Changes in Technology

From the above, it’s clear that our understanding and evaluation of data and software is changing, and will continue to change. While I think that reproducibility is not possible, giving the liklihood that software and technology has utility and duration over a few decades, it’s a reasonable goal to want for the short time, and it doesn’t matter if it’s possible if it is helpful even in imperfection. For the long term, however, the term that inspires me more is this idea of evolving durability, impact, and footprint. In the same way that we evolve, software does too, and the only certainty is that software of the future will be different from today. This doesn’t mean that all work is lost or useless, because the best from the present turn into the base of the future, and the things that aren’t so great die out. We learn from our successes and our mistakes, and we need natural selection in computational science to continually strive to do better.

Future Predictions

The great thing about the fast paced development of new technologies are that, even if they aren’t intended to further science and academia, they usually trickle down to us and (eventually) have an impact. Based on the above, and from the perspective of a research software engineer (RSE), I can forsee the following:

Living data: We might someday have a world where researchers can focus on asking questions, and research software engineers deliver them data.
Communication: between cluster resources, clouds, and programs is essential. There must be a much bigger focus on standards for inputs, outputs, and general communication.
Publication: models that are slow and capture a limited, frozen point must be replaced.

Changes in Data

I am excited by the idea that data is open and freely available, and not hard to come by. It is as respected an effort to produce technology that can reliably deliver a data feed as it is to run a model over it.

Changes in Publication

The current model of publication is dated to a time when people picked up (actual) journals and read them. I hope to see the current publication process of a “result” replaced with the idea of publication of a hypothesis or idea. If you can define a question, a way to access it, and a definition of what kind of data can answer it, we can feed in discovered data to continually update our knowledge about the world.

Changes in Culture

Our current research culture defines a model that assumes one or more snapshots in time moving through a frozen pipeline and finishing with a result that is the tiniest snapshot of the actual world. It’s painful to realize the hoops that researchers must jump through to share these tiny results, and then that doing the same experiment twice on different samples is almost unheard of. We must change our model of the world, and the questions that we ask of it, to be more of signals than static points in time or knowledge. The need for data as signals will follow, and then the way to publish and share those signals. Once this AI obsession dies down, I’m much more excited about (someday) when there is obsession over the things that I obsess about - data structures, tools, and the connections between them. As an engineer, I want to build these things that have never existed before. We don’t really care about reproducibility - we care about durability, impact, and the short lived fulfillment that each of our efforts might help someone else in the tiniest way.

Party on, party dinosaurs.

Suggested Citation:

Sochat, Vanessa. "Reproducibie Impossible." @vsoch (blog), 28 Nov 2017, https://vsoch.github.io/2017/reproducible-impossible/ (accessed 01 Jul 25).

« The Experiment Factory (v2.0) Beta The Sadness of the Open Source Developer »