Press "Enter" to skip to content

Science’s Reproducibility Crisis: How FOSS Hackers Can Help

Most scientists agree: many published results are hard to reproduce. Can open source practices help fix the problem?

Software Heritage protects the software of yesterday. today, and tomorrow. | National Archives at College Park, Public domain, via Wikimedia Commons

A few weeks ago, I came across a post on the website AI as Normal Technology about the sad state of scientific research, titled “Could AI Slow Science?” which specifically mentioned lack of open source attitudes as one way in which that could happen. In a section headed “Science is not ready for software, let alone AI” the writers argue that:

  • Most scientists are notoriously poor software engineers, who by and large either ignore, or apply haphazardly, basic practices like automated testing, version control, and following programming design guidelines.
  • More and more often, scientific papers are merely summaries of research that happened inside some software, because it wasn’t possible otherwise. In spite of this reality, scientists and researchers very seldom release the software they used, and standard peer review procedures typically don’t evaluate the software, but focus only on the published results.
  • This greatly contributes to what has been called the reproducibility crisis of science: if you can’t rerun the same software a scientist used, you can’t confirm their results, or build on them.
  • Research makes things worse, because AI itself — even when trained on fully known datasets — is hardly reproducible, operating in statistics-based ways not even their developers fully understand.

Software Reproducibility Needs Real Archives, Not Platforms

Is open source literacy really as scarce in science as the article illustrates, and if so, how can we fix that? To find answers I reached out to Stefano Zacchiroli, a former Debian project leader and co-founder of Software Heritage, the international nonprofit that works with UNESCO to collect, preserve, and share publicly available software source code.

The first thing Zacchiroli pointed out is that these days all our society needs software reproducibility, not just programmers, because practically all advances in science and industry happen through software. In a saner world, reproducibility would be a mandatory component of the supply chain of every product that is or includes software.

As far as science is concerned, he said, this means that researchers must publish and preserve all the software they use in ways that really allow others to run it exactly as they did.

Unfortunately, according to Zacchiroli, very few people (including many FOSS developers) understand that project repositories on platforms like Github cannot be a solution. Links to such pages add little or nothing to reproducibility, even when they do specify which version of that software was used — which seldom happens in scientific papers. The problem, of course, is that not only can entire projects vanish from a repository at any moment, but even the software commits being hosted may change or disappear by the time someone needs to study or recompile them.

A much bigger problem, he said, is that by definition, true reproducibility of research — or anything else done with software — needs much more than the specific program used to perform that research. In order to really reproduce some experiment or analysis, you must be sure you are running exactly the same software as the original researchers, in exactly the same environment in which it was originally used.

Strictly speaking, this means that even the computer you’d run the software on should be identical to the one of the original researchers, but since that is almost always impossible, let’s concentrate on software.

Having the same environment doesn’t mean running exactly the same versions of everything, down to the libraries and kernel: you should be certain that you do have the right versions, which in turn means having the possibility to rebuild all those pieces of software from their original, unaltered sources. In case you missed it, this is one more argument why all scientists should only use open source software. Problem is, how can they do all this, and what would be needed?

Enter Software Heritage

Some years ago, Zacchiroli and others started the Debian Sources project as an online service and archive that provides access to all the source code for every package included in the official Debian Linux distribution. Later, when they realized that a general purpose version of that project did not exist, Software Heritage was born, again as an attempt to archive and share all software that is publicly available in source code form.

Before explaining how Software Heritage works, Zacchiroli was very keen to point out that, as important as reproducibility in science is, its goals and impacts go beyond that.

Nextcloud resilient communication and collaboration.

On one hand, open source archives are, together with open data and open access, the third pillar of truly open knowledge, and Software Heritage is made to order to preserve the software part of human culture for future generations, which is why it is supported by UNESCO. On the other hand, there are thousands of proprietary products these days running at least some open source libraries. Being able to certify the full supply chain of all those components is a huge bonus for every industry, not to mention that it could even solve prior art disputes, and Software Heritage offers just that.

As Zacchiroli put it, really complete, really reliable software bills of materials for every industry that uses any open source software in its products are exactly what Software Heritage implemented, before the term SBOM existed.

In the end, Zacchiroli showed me a side of the organization I had completely missed before, namely that it lets functional package managers for Linux like Nix or Guix do things impossible otherwise: Guix, for example, can retrieve source code directly from Software Heritage if, for any reason, its original, official repository isn’t available.

How Does All This Happen?

A detailed explanation would be well beyond the scope of this article, and is available on Software Heritage’s website anyway, plus a paper by Zacchiroli and others. Here, we can just say that every piece of source code that enters the organization gets a unique identifier, that allows everybody using the archive to:

  • certify the origin of each source file
  • declare and know all the dependencies between packages hosted in SH
  • check the full development history of every program or library

Using that data, it’s possible to attach to every paper or product that used any open source software something that Zacchiroli calls a “replication package”: that is, a bundle of source code and metadata that lets everybody else reproduce an experiment, or check how some product works and was developed, in the most accurate way possible.

A Call to Action and Cooperation for Scientists and Hackers

No matter how good an archive is, it does little good if its use doesn’t become part of the daily routine.

Getting there will undoubtedly be hard for many scientists and researchers, and they will need all the help they can get. That’s why Zacchiroli concluded our conversation with a call to action, which I warmly approve, for all FOSS hackers and advocates:

  • First: lead by example: make sure that all your code will always remain executable, by adding it to Software Heritage.
  • Second: adopt a researcher! Reach out to the closest university and offer your help to improve the quality of their research by making it reproducible

Of course, if you are already doing that, let us know in the comments.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *