Press "Enter" to skip to content

Bringing Open Source to Scientific Research

I already knew that academia is behind the curve when it comes to IT, from my non-tech part time job at a local university library. For starters, there’s the overreliance on Windows. Then there’s the use of poorly designed proprietary products when perfectly acceptable GPL solutions exist — not to mention the look of scorn and fright coming from the IT people whenever the term “free and open source” is uttered within their hearing.

Although I already knew there was a problem, I didn’t know how deep the problem is until I spoke with GitHub’s Arfon Smith. It seems that academia’s inability to catch up with the twenty-first century even puts careers in jeopardy — especially in the sciences.

Github's Arfon Smith
GitHub’s Arfon Smith
“…an early career post-doctoral researcher I know has a Python package that has about 100,000 downloads per month by his peers and others,” Smith explained. “To a tenure committee at a university, none of this matters — what matters is how many papers he writes and so he’s currently running the risk of not securing a permanent job, even though the work he does is of massive value to the research community.”

In the academic world it’s still “publish or perish,” and being published online usually doesn’t count for much. The tenure committees still pretty much define “publish” as something bound in paper and sent by snail mail.

Arfon Smith is a scientist with a resume longer than both of my arms. This resume includes such bullet points as co-founding Zooniverse and building DNA sequencing pipelines at the Wellcome Trust Sanger Institute. He’s been at GitHub since last October, where he uses his first hand knowledge of the scientific process to help research scientists leverage the organization’s resources. When I spent about an hour on the phone with him a few weeks back, he tried to bring me up to speed on some of the problems with academia, and the reality of scientific research in these postmodern times.

In spite of his best efforts, I’m not certain I grok the situation in it’s entirety. Such is life. Smith didn’t appear to be discouraged by that, however, noting the subject to be “very nuanced.” He was concerned that I was, at least, “having fun.”

Research scientists write their own programs in order to do their work, which is something I didn’t know. I figured that when scientists need to run some numbers they just call in someone from IT, explain what needs to be done and put the IT dude or dudette on it. Evidently, that’s not how it works. When scientists need to run some numbers in a newfangled way for some newfangled results, they have to write the software themselves. In other words, marine biologists must not only be skilled in the subject of marine biology, they must have computer science skills as well.

Talk about having a lot on one’s plate.

One of the problems: the research biologists don’t get much credit for all this work unless it appears in a peer-reviewed journal. Not only that, it’s not necessarily easy for these biologists to share their work so that others can attempt to reproduce the results. Surely, papers can be posted online detailing a project’s methodology and results, but others will also need the computer applications used by the project in order to verify those results. It also helps if the code is open-source, so others can make sure that the results aren’t the result of poor programming or the like.

Unfortunately, some researchers don’t understand the importance of making the code they use to reach their conclusions available. What’s worse, even if they do understand, the administrators at most universities do not, and the publishers of the all important scientific journals have no method in place for reviewing the code used in scientific research. The later is particularly troublesome, because the publishing of peer-reviewed journals is a huge business and not likely to go away any time soon.

“Hard numbers are difficult to find, but all science publishing is something around $10 billion in revenue per year,” Smith explained. “The profit margins are also often very high, something around 30 to 50 percent.”

Needless to say, with that kind of money at stake the publishers are more than just a little reluctant to make changes, lest they derail the gravy train.

“Are they to blame?” asked Smith. “Certainly, at some level, but they’re also businesses with a revenue model to protect. I think if there was a good way to review software for academic use then, in all likelihood, these publishers would work out a way to do this. In fact, I’ve been talking with some very large publishers about exactly this.”

There is also work being done to get folks at the university end, both administrators and researchers, on track. For example, a five year initiative was launched late last year with $37.8 million in funding from the Alfred P. Sloan and the Gordon and Betty Moore foundations. This initiative involves three universities — New York University; the University of California, Berkeley; and the University of Washington — pursuing three main goals aimed at remedying many of the issues currently hampering research:

  • Develop meaningful and sustained interactions and collaborations between researchers with backgrounds in specific subjects (such as astrophysics, genetics, economics), and in the methodology fields (such as computer science, statistics and applied mathematics), with the specific aim of recognizing what it takes to move each of the sciences forward;
  • Establish career paths that are long-term and sustainable, using alternative metrics and reward structures to retain a new generation of scientists whose research focuses on the multi-disciplinary analysis of massive, noisy, and complex scientific data and the development of the tools and techniques that enable this analysis; and
  • Build on current academic and industrial efforts to work towards an ecosystem of analytical tools and research practices that is sustainable, reusable, extensible, learnable, easy to translate across research areas and enables researchers to spend more time focusing on their science.”

Smith told me it’s too soon to tell how successful this program will be, but added, “I’ve spoken with some involved and they’re pretty pleased with the progress….”

Organizations funding scientific research projects are also taking note of the problem and taking advantage of the power of the dollar to help encourage openness. For example, Wellcome Trust in the UK and the National Institute for Health (NIH) in the U.S. have open source standards attached to the money they disburse, and recently they’ve been holding researchers to these standards. In April, the scientific journal Nature reported that Wellcome Trust had withheld 63 grant payments when papers tied to the funds were not open-access, and the NIH has delayed a number of continuing grant awards due to problems with its open-access policies.

Research Science in Academia
By GeneralDowd [Public domain], courtesy Wikimedia Commons.
This is helping to move things in the right direction. There are also other avenues open to researchers who need to make their work more available.

“Right now some people share on their personal websites on academic web domains – this is fairly common,” Smith explained. “Increasingly we’re seeing people share on GitHub and other code hosting platforms. Last year we at GitHub worked with Zenodo, a data archiver, to make it really easy for an academic to get a Document Object Identifier (DOI) ‘citation code’ for their GitHub repository. This, in theory, makes it easier to track references to this DOI and therefore for researchers to gain credit for their work. I’m told by the Zenodo folks that this tool has been used more than 1,000 times in the last six months, which is great.”

Smith is certain that these efforts will benefit not just the sciences, but other areas of the university as well. “This is part of a wider shift in actually properly measuring a researcher’s impact in their research domain. While there might not be a huge amount of software in English departments (although there is some, for sure), if we move beyond just rewarding a researcher for publishing papers in one part of the academy, then this change will be good for all academics.”

Later this month, Arfon Smith will be addressing these issues, as well as discussing other ways that academia can benefit from adopting open source ideas, at the All Things Open conference. The conference will be held on October 22-23 at the Raleigh Convention Center in downtown Raleigh, N.C. Smith’s presentation, “What Academia Can Learn From Open Source” is scheduled for October 22 at 11:15 a.m.

Said Smith, “Open source has a collaborative model that academia can really learn from.”


  1. Ben Lloyd Ben Lloyd October 6, 2014

    Great article!

    There is also some work being done to use Docker to improve scientific reproducibility. Docker would allow researchers to create a package that includes all needed software, libraries, and dependencies. This would allow anyone to quickly replicate the testing environment and share their own modifications.

  2. Golodh Golodh October 7, 2014

    First of all … a researcher’s job isn’t to write software packages (FOSS or no-FOSS), however well received, but to contribute to knowledge. And we still have no better way of measuring that than through publications in peer-reviewed journals.

    Secondly, downloads are a difficult way of measuring contribution and therefore usually, and appropriately, carry little to no weight.

    Oh, and that sneer about snail-mail, that’s just coincidental. It so happens that most reputable peer-reviewed journals are still linked with paper editions (despite that fact that almost all journals are also available electronically).

    If someone’s software package is so good, he/she can publish a description of how it works and what it does in a peer-reviewed paper and ask for a mention by those who use it.

    If this software package isn’t good enough for such a mention than it’s not a researcher’s job to write it but a technician’s.

    Now the author has two options: either quite and become a technician and a full-time software writer, or spend his time in such a way as to measurable contribute new knowledge.

    Don’t get me wrong: technicians are quite under-rated in research. It’s not for nothing that most commercial research labs cushion their researchers from hard work in areas they’re not good at (e.g. software development or instrument-making) by adding technicians to the lab. It improves the lab’s output considerably.

    Last but not least, reproduceable research is on the up, for excellent reasons, and publication of source code used to arrive at results is part of it just as much as publishing the data is. I think that tenure boards will take such additional publications into account.

    Which is all very good, but publication of code and data is an adjunct to the publication of a scholarly article, not a substitute for it.

Comments are closed.

Breaking News: