LWN.net Logo

Ten simple rules for the open development of scientific software

Here is some advice for scientists developing open-source software published on the PLOS Computational Biology site in early December. "The sustainability of software after publication is probably the biggest problem faced by researchers who develop it, and it is here that participating in open development from the outset can make the biggest impact. Grant-based funding is often exhausted shortly after new software is released, and without support, in-house maintenance of the software and the systems it depends on becomes a struggle. As a consequence, the software will cease to work or become unavailable for download fairly quickly, which may contravene archival policies stipulated by your journal or funding body. A collaborative and open project allows you to spread the resource and maintenance load to minimize these risks, and significantly contributes to the sustainability of your software."
(Log in to post comments)

Ten simple rules for the open development of scientific software

Posted Dec 29, 2012 19:07 UTC (Sat) by oever (subscriber, #987) [Link]

Reproducibility in software is a big challenge. Churn in FOSS libraries is large. Making sure software runs exactly as it did originally is very hard. This is due to the way software development and distribution currently works.

Linux distributions mostly provide only one version of a library at a time. Since a program can depend on many libraries, somewhere along the stack some code will be different from the original code almost immediately. If the code is run on a different distribution, it will be running using a different version of libraries that the software depends on; if only because the build flags are different.

We can pretend that this should not make a difference, but it does. The software stacks are so complex nowadays, that it's very hard to predict how software will behave.

The closest that scientific software can get to being reproducible is to provide the software compiled for a virtual machine with all the required libraries also implemented in that virtual machine and distributed in source and binary form with the software.

The best candidate for this is currently Java, where the entire set of code, libraries and required data and configuration files can be put in a single jar file. Java has a stable instruction set and is available on many platforms. The behavior of e.g. floating point operations is standardized and independent of compiler flags as it is for compiled code.

Ten simple rules for the open development of scientific software

Posted Dec 29, 2012 19:13 UTC (Sat) by oever (subscriber, #987) [Link]

After writing that long comment, I completely forgot to mention CDE and Nixos.

CDE uses ptrace to capture all files that are touched when running a program and places them in an archive that can be run on a different Linux machine. Hence, the software will run with the same library versions as when the original developer ran the software. The kernel and, I believe, libc will differ, but that is all.

Nix is a packaging system that is cross-distribution and should give the same version and compilation flags for a particular library if the checksum that captures the dependencies is the same.

Ten simple rules for the open development of scientific software

Posted Dec 29, 2012 20:37 UTC (Sat) by macc (subscriber, #510) [Link]

The rules are not about presentation fluff.

If use of a computational library produces different results from different versions that is either a fixed bug or a newly introduced bug ;-)


Ten simple rules for the open development of scientific software

Posted Jan 14, 2013 18:54 UTC (Mon) by davide.del.vento (guest, #59196) [Link]

Thanks for mentioning these two projects!

Ten simple rules for the open development of scientific software

Posted Dec 29, 2012 21:10 UTC (Sat) by JMB (guest, #74439) [Link]

From my point of view this is off topic.
Reproducability has to be checked - with differen CPU architectures,
operating systems etc.
(And I don't want to comment on the Java thing ... outch.)

Here the main goal seems to be to attract people writing software
for sciences to open their code.
Being in physical sciences in the 1990-ies I can not imagine anyone
coding for scientific purposes being not aware of the benefits.
MIT, GNU / GPL, transparency, reproducability by other researchers
this is scientific history or best prectices since a long time - not
a present goal (CERN, ESO, ... you name it).
If I had written such an article in 1998 people would have laughed
at me - which would have been justified.
We had the source code of the data analysis packages and changed them
if necessary - and recompiled them on current Linux systems.
But we used open source totally (preferring SW under GPL, of cause),
which is a big advantage.
I have heard of several scientific institutes using proprietary
software - this is a problem like the patents in this domain.
This may be a current problem - but this is not the point of
the article if I got it right.

So the comment seems to be not stranger than the main article.
But maybe things got so screwed (some institutes forced scientists
to claim patents - maybe there was a force to not open up their
source code either?) that one really has to reinvent the wheel?
Time for a new year ... ;-)

JMB

Ten simple rules for the open development of scientific software

Posted Dec 29, 2012 21:45 UTC (Sat) by oever (subscriber, #987) [Link]

The whole point of publishing software along with articles is so that others may easily check the published results. The PLOS article notes that "few papers are accompanied by open software". From experience as researcher and developer in physical chemistry ('98-'03), bioinformatics ('04-'07) and X-Ray crystallography ('07-'09) I can say that this is valid.

Scientists love to use FOSS stacks, but mostly do not publish their own code. This is justified by saying that others could implement the described algorithms and achieve the same results that way. This is true when the algorithms have been documented completely and resources are infinite. Making a second implementation would indeed be a good check, but also takes a lot of work. Incentive for recreating the software that would yield no new publishable material is low.

Journals should and sometimes do require that source code is published with articles.

Of course there are exceptions to the rule. Tartini is a nice desktop application for analyzing musical performance. NSGT toolbox is a library for transforming audio from the time domain to the logarithmic frequency domain (FFT transforms to linear frequency domain).

Ten simple rules for the open development of scientific software

Posted Dec 29, 2012 22:17 UTC (Sat) by dskoll (subscriber, #1630) [Link]

Your link to Tartini looked interesting, but when I tried compiling it on Debian Squeeze, it failed miserably. It looks like Tartini illustrates the author's original point: A lot of academic software is not easily portable to any machine other than the author's workstation, it doesn't use standard tools like autoconf, and it bit-rots.

Too bad, because Tartini looks really cool...

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 0:07 UTC (Sun) by oever (subscriber, #987) [Link]

The fact that Tartini does not compile under current Debian shows that an application that was working fine with a previous version of the libraries stopped being compilable, let alone usable, because Debian has changed so much in just a few years that Tartini does not compile. Is that Tartini's fault or Debian?

Should we expect researchers to keep software up to date with changes in compilers and available libraries? Tartini uses Qt4, a perfectly fine library as is Qt3 and Qt2. Yet, software that relies on Qt2 has a hard time working on a current Linux system.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 0:30 UTC (Sun) by dskoll (subscriber, #1630) [Link]

The fact that Tartini does not compile under current Debian shows that an application that was working fine with a previous version of the libraries stopped being compilable, let alone usable, because Debian has changed so much in just a few years that Tartini does not compile. Is that Tartini's fault or Debian?

Clearly, Tartini's. Debian is about the least bleeding-edge you can get with Linux. The build files for Tartini contain hard-coded paths to specific directories like /home/inferno/research/pitch/lib

I've worked in both academia and industry and know that academic software is not often built with the thought of actually distributing it or maintaining it in mind. It's just an unfortunate fact.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 2:49 UTC (Sun) by yarikoptic (subscriber, #36795) [Link]

> ..., Debian, the least bleeding-edge you can get with Linux

I beg a pardon... Debian is not only Debian stable -- there is also testing, unstable and even experimental. With unstable+experimental you might be as close to being bleeding as possible, while maintaining still usable and relatively stable system.

But this example is indeed a very nice to point out that source code itself, although a huge step forward, is not all what is needed for proper scientific methods dissemination since building/deploying of the "code" might be quite involved at times. Happened authors created proper Debian packages, uploaded them to Debian unstable (the entry point for new packages into Debian) -- it would have resolved many of those benefits others have mentioned:

-The "code" could immediately being used by Debian (and thus its >130 derivatives) users,
-its hardware platform agnosticism would be verified by building across >10 of those Debian supports
- happen there would be unittests ran at build-time -- at least some aspects of hardware platform "reproducibility" would also come "for free"
- longevity of such "code" would be in years due to inclusion/maintenance in Debian stable later on,

Want to read more on our (neuro.debian.net) position/experience -- you are welcome to read
http://www.frontiersin.org/Neuroinformatics/10.3389/fninf...
Open is not enough. Let’s take the next step: an integrated, community-driven computing platform for neuroscience

Ten simple rules for the open development of scientific software

Posted Jan 4, 2013 20:16 UTC (Fri) by pboddie (subscriber, #50784) [Link]

What you and others are saying is that what's missing is the software engineering. People can write code to consume and produce data in order to demonstrate something, get published, and so on, but if others are to benefit from that code in any convenient way, there's the usual amount of software engineering required to achieve this.

Some might dispute whether sharing the code is necessary, but if any algorithm is going to be described in detail - and I doubt that they are described in sufficient detail, especially in disciplines other than pure computer science - then it would be better if the code were available, better still if it could be conveniently used in order to rule out coincidental hardware- or infrastructure-related effects, and even better still if it were well-structured and well-documented. Once again, software engineering is the missing ingredient.

Unfortunately, the funding in many environments probably doesn't cover anything beyond getting something working and getting a paper out the door (and thus attracting more funding). After all, there's always another Web service to use or another bundle of Java class files to stuff into the JVM to massage one's data and produce a "result", and nobody's asking for money, so what's the problem? Right? That's probably the prevailing attitude that needs changing.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 0:46 UTC (Sun) by paulj (subscriber, #341) [Link]

And this kind of thing illustrates that what is *really* needed is to fully describe, in the most natural, concise but precise language the author can manage, the essential methodology of the experiment in the paper. Releasing the software does NOT substitute for that, in terms of increasing the reproducibility of the experiment.

Even seasoned software engineers will find it difficult to distribute software that will just run on a wide variety of machines - unless they do so as something that will boot on something that is close to a universal machine (e.g. x86 VMs). Even then, it's far from guaranteed.

Ten simple rules for the open development of scientific software

Posted Jan 4, 2013 0:40 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]

I was privileged to do my graduate research in a culture (UC Berkeley EECS department) that did rock-solid open source development and released a whole lot of software that was built upon by other groups. I agree that research software should be released, ideally open source, and if the university legal department sets up roadblocks, at least it should be made available on a restricted-use basis. However, it's a mistake to over-emphasize the software, and there may be advantages in having other groups re-implement the algorithms rather than just use the same code.

If Research Group A publishes a paper and releases software, Research Group B can run the software and observe the same result. But this doesn't mean that the result is correct; the software might be wrong. Similarly, claims that algorithm A is superior to algorithm B can be confused with the fact that implementation A is better than implementation B, but a bug in B's implementation led to worse performance than could have been achieved.

Ten simple rules for the open development of scientific software

Posted Jan 4, 2013 10:44 UTC (Fri) by dark (subscriber, #8483) [Link]

I'd find this argument more convincing if it didn't also apply to publishing the data.

It's enough if scientific papers just describe the experimental protocol and their conclusions. It's a mistake to over-emphasize publishing the data; after all, research groups who are interested in verifying the result should run their own experiment instead of re-analysing the same data.

The flaw in the argument here is that if there are mistakes in the original group's analysis then they are exposed by publishing the data along with the conclusions, just like mistakes in software implementation would be exposed by publishing it. Forcing other groups to re-do the work and then guess why their results are different will instead hide these problems.

Publishing experimental data along with the conclusions drawn from it is considered essential; publishing the software used should be considered essential for the same reasons. In both cases, it makes sense to provide only a summary if there's no space for all of it (as in a print article); in that case, showing the implementation of the crucial parts of the algorithm would suffice. We can take the command-line parsing on faith :)

Ten simple rules for the open development of scientific software

Posted Jan 24, 2013 20:41 UTC (Thu) by raalkml (guest, #72852) [Link]

FWIW, it wasn't particularly hard to fix for Debian testing.
If someone is still interested, I could send the fixes I had to do your way.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 0:00 UTC (Sun) by man_ls (subscriber, #15091) [Link]

Reproducibility in software is a big challenge. Churn in FOSS libraries is large. Making sure software runs exactly as it did originally is very hard. This is due to the way software development and distribution currently works.
Your comment is a good argument to keep scientific software as simple as possible. In essence there should be a few data files and a few source code files in a standard language, that generate a binary; both sets of files should be under version control. The binary then takes the data files and produces a new file with results, which can then be plotted or manipulated as desired. This is specially important for all charts, graphics and tabulated data on an article.

There are a few issues in this scheme:

  • Input data should be fully documented: origin, conditions and other constraints.
  • The language used should also be specified. It should be a well-known standard, and ideally a language that maintains backwards compatibility so results can be reproduced in the future. (This rules out e.g. Python.)
  • If binary generation goes beyond compiling a few source files, a standard mechanism such as make should be used.
  • Use of libraries should be reduced to a minimum, and the version for each one should be specified in the documentation (or added to version control).
But it should be workable.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 11:59 UTC (Sun) by macc (subscriber, #510) [Link]

It is always a good thing to separate computation and presentation.
Use human readable formats.
Realise your problemsolving in (combinable) modules.
( netpbm, though not really scientific software is a perfect example.)

Make your solution scriptable!
Use human readable configuration files.
Incorporate the configuration in the results file.

GUI is fluff.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 5:59 UTC (Sun) by heijo (guest, #88363) [Link]

Just always releasing source code with the paper would be a huge improvement, the rest is merely a convenience.

It's always infuriating to read a CS paper that contains an "experimental results" section, clearly indicating they implemented their idea, and not have the code.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 9:29 UTC (Sun) by boudewijn (subscriber, #14185) [Link]

I got hit by this in the early days of working on Krita. There were a number of papers and dissertations that were quite interesting. Mostly, there was no source code, though it was already amazing I could download the papers.

There was Bill Baxter's work on paint simulation which now has ended up in Microsoft's Freshpaint. No source code, and while the videos looked interesting, no way to reproduce from the dissertation and papers alone, at least not for me. (http://www.billbaxter.com)

I managed to get Tunde Cockshott's Wet & Sticky code, under the GPL, and managed to build it, which was amazing. Unfortunately, the actual results of that application were a bit disappointing since it kept crashing, but then, that was work from a different age. (http://www.valdyas.org/fading/index.cgi/books/hacking/wet...)

Clara Chan's Chinese brush simulation was easy enough to port to Qt, and was a good basis for work, so that was an excellent resource (http://www.valdyas.org/fading/index.cgi/hacking/krita/wac..., original paper seems to have disappeared).

The MoXi paper in contrast was also on Chinese brushes, but no source code, so useless.

Ten simple rules for the open development of scientific software

Posted Dec 30, 2012 19:07 UTC (Sun) by brooksmoses (subscriber, #88422) [Link]

Having done scientific programming -- and, in fact, having come to programming by way of doing a Ph.D. in computational fluid dynamics -- I would say that this article is to some extent trying to solve a difficult problem by applying platitudes. Here are some of the reasons that I ran into why the problem is difficult:

* Many scientific programs are written by scientists -- and, more significantly, by newbie science grad students -- who have no training and little experience in writing readable and reusable software. I learned things like "write your program in small chunks that can be independently tested" by doing it wrong and learning from my mistakes. My mentors were all scientists, not programmers; they had more experience but no more training, and nobody even thought of looking at my code with the same red pen they used on my papers. And certainly there were no classes in writing good programs to go along with the classes in writing good papers! (Personally I should also credit the people of comp.lang.fortran and the GCC development list; most of what I learned about good coding style I learned from paying attention to discussions there -- but that's unusual.)

* Many scientific programs are written to solve an immediate single need to demonstrate a particular result. I remember a case where there was a filter shape that ideally would have come from an input file, but I was up against a hard conference deadline and hardcoding it was faster and simpler. Likewise, the code for my final dissertation work involved a horrible mess of Intel Fortran for Windows, a custom-built Cygwin G++, and hardcoded Windows-to-Cygwin directory name conversions in makefiles -- but it worked on my computer, and it got the job done and let me collect a lot of necessary data in short order. When you're making a one-time-use tool, there is very little added value in making it pretty, or making it reusable.

* Publishing software openly does not magically create a community that will spread the maintenance load. As a programmer I have worked on a significant piece of library software (Sourcery VSIPL++) with an open standard interface, a small standardization community, a commercial user base, and for quite some time a sponsored GPL implementation with an associated openly-archived mailing list. I believe we had a number of users of the GPL implementation, but in all the years we had it, we heard from very few of them, and received no patches or other maintenance help. The reality is that most scientific software is not useful to enough people to sustain a development community, and even where it is, sustaining a community requires substantial ongoing work.

* As a general rule of thumb that I've learned from commercial work on open-source projects (now that I'm a professional programmer), the amount of time that it takes to push a patch upstream to an open-source project and get it accepted is about the same as the amount of time that it takes to write something decent that works in the first place -- and that's by people who are good at open-source community politics, which itself is a learned skill. That will apply to people contributing to existing scientific projects, and it's probably a lower bound for the amount of work required to document the internals of a new piece of software and make it readable and portable so that it's something that other people could use.

The problem of making scientific software open source in a really useful way (rather than just a useless unreadable code dump) is difficult to solve because it fundamentally requires a lot more effort, and because it requires a significant amount of training of new scientists. At a rough estimate, it requires at minimum a full semester-long class and ongoing mentoring by experienced programmers, and about a doubling of the effort and time required for writing programs. With science Ph.D. times already extending well beyond what's sustainable, this implies a notable reduction in the amount of work that can be done in a Ph.D., and requires bringing additional programming expertise into scientific departments to teach the classes and provide the mentoring and perhaps reduce some of the programming load on the scientists.

(In support of that: For scientific laboratories, we have machine-shop classes for students and we have laboratory technicians and machinists to help the researchers build the laboratory equipment, because we have accepted that this is skilled labor that is best done with a lot of assistance by professionals. Why don't we do the same with software?)

This is a hard problem because there are real, and significant, costs involved in solving it -- both in funding, and in graduate student time. Platitudes, even well-explained ones, will not pay those costs. We need to be having the hard conversations about what the costs are, why they are important to pay -- and whether they truly are! -- and how university research departments can pay them.

Ten simple rules for the open development of scientific software

Posted Dec 31, 2012 18:14 UTC (Mon) by neiljerram (subscriber, #12005) [Link]

I agree with all that. I also did a PhD in fluid dynamics, and I have programs that I used for my thesis results, and that could be made available and might one day prove useful to someone else.

But they're very specific, combine functionality in strange ways - e.g. mixing computation and display - and partly duplicate function that I'm sure has since been done better in other free software programs.

Hence the upshot is that it's really difficult, even for me as a person with some track record in free software, to see how this code could usefully be made available.

The question of reproducibility of research, and whether journals ought therefore to require associated code to be published, seems to me to be separate from the general desire to share code. But if the former was widely required, it would probably also facilitate the latter.

Ten simple rules for the open development of scientific software

Posted Dec 31, 2012 21:00 UTC (Mon) by Trelane (subscriber, #56877) [Link]

This assumes that the current model, namely develop in private and throw over the wall, is the model to go for, instead of active collaboration between similar research groups. Such a model would be much more sustainable since presumably the PIs or senior post docs would have been invovled with the overarching project long enough that they are truly co-maintainers of the project.

Unfortunately, I suspect that, due to competition over increasingly scarce funding coupled with the lack of interest in the software compared to the papers produced by the software, the situation is not going to improve in the forseeable future. Plus the PIs are used to fiefdoms within their domain, not collaboration with similar groups.

--a former solid-state physics post-doc.
(A PI I worked with stated that there was (paraphrasing from memory) no line in grant applications or progress reports for lines of code written. Rather, it's all papers. Seeing as how I didn't have nearly enough of those, I decided it to be in my family's best interest to move into high-performance software in an industry setting. :)

Ten simple rules for the open development of scientific software

Posted Dec 31, 2012 21:34 UTC (Mon) by brooksmoses (subscriber, #88422) [Link]

I would disagree that a lot of my points rely on that assumption. My conclusion does rely on it to some extent -- you are in essence arguing that writing reusable-quality code and building a community around it produces benefits in terms of reduced programming effort from being able to reuse the work of one's collaborators, and this partly offsets the extra costs; I had neglected that potential offset. And that's a very valid point.

However, I think that my general argument still holds. A model of active collaboration on code development between similar research groups will require that the research groups develop high-quality code, and that means (a) a lot of additional effort in making the code reusable by others, and (b) the graduate students who are writing the code need to have training, mentoring, and code review that they currently aren't getting -- and which the current structure generally doesn't have anybody with skills or time to provide. And, (c) you also get politics of maintainership when people have different ideas of where the code should go and what level of code quality is acceptable, which means you end up spending time and effort dealing with the politics. Maybe the benefits of collaboration can offset that for shared foundational work in some cases, but that's quite a lot of extra work that needs to be offset -- and the need for universities to invest in people who can provide programming mentorship is still there.

(I've seen software written by PIs. I've worked with software written by PIs. I learned a lot about how not to write code by working with software written by PIs. Senior postdocs, maybe, but are you selecting and training for programming skills or research skills? These days, the programming skills only seem to come along accidentally.)

There's also the point that, even when you have a shared foundation, there's a lot of one-time-use code that gets written to support a single experiment, because every experiment is (by definition) doing something new. That's still going to be in the "develop in private" model simply because only one person ever needs it!

Ten simple rules for the open development of scientific software

Posted Jan 14, 2013 16:57 UTC (Mon) by davide.del.vento (guest, #59196) [Link]

Good points.

However, from my own experience, the world is changing for the better. Not as fast as we'd like and not at the same speed in all field, but thankfully it is.

For example I was at the American Meteorological Society and was surprised how well attended and good feedback this talk received.

I myself started organizing the SEA Scientific Software Engineering conference last year, which is also very well received by the community. See the talks we had last year and consider going or submitting a talk for this year

Pardon the shameless promoting of "my event", but my point is "yes, it's difficult, yes it requires time, yes there is resistance, but something is happening and changing that". Wasn't the same for basically any human achievement? :-)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds