Reproducibility in software is a big challenge. Churn in FOSS libraries is large. Making sure software runs exactly as it did originally is very hard. This is due to the way software development and distribution currently works.
Linux distributions mostly provide only one version of a library at a time. Since a program can depend on many libraries, somewhere along the stack some code will be different from the original code almost immediately. If the code is run on a different distribution, it will be running using a different version of libraries that the software depends on; if only because the build flags are different.
We can pretend that this should not make a difference, but it does. The software stacks are so complex nowadays, that it's very hard to predict how software will behave.
The closest that scientific software can get to being reproducible is to provide the software compiled for a virtual machine with all the required libraries also implemented in that virtual machine and distributed in source and binary form with the software.
The best candidate for this is currently Java, where the entire set of code, libraries and required data and configuration files can be put in a single jar file. Java has a stable instruction set and is available on many platforms. The behavior of e.g. floating point operations is standardized and independent of compiler flags as it is for compiled code.
Ten simple rules for the open development of scientific software
Posted Dec 29, 2012 19:13 UTC (Sat) by oever (subscriber, #987)
[Link]
After writing that long comment, I completely forgot to mention CDE and Nixos.
CDE uses ptrace to capture all files that are touched when running a program and places them in an archive that can be run on a different Linux machine. Hence, the software will run with the same library versions as when the original developer ran the software. The kernel and, I believe, libc will differ, but that is all.
Nix is a packaging system that is cross-distribution and should give the same version and compilation flags for a particular library if the checksum that captures the dependencies is the same.
Ten simple rules for the open development of scientific software
Posted Dec 29, 2012 20:37 UTC (Sat) by macc (subscriber, #510)
[Link]
The rules are not about presentation fluff.
If use of a computational library produces different results from different versions that is either a fixed bug or a newly introduced bug ;-)
Ten simple rules for the open development of scientific software
Posted Jan 14, 2013 18:54 UTC (Mon) by davide.del.vento (guest, #59196)
[Link]
Thanks for mentioning these two projects!
Ten simple rules for the open development of scientific software
Posted Dec 29, 2012 21:10 UTC (Sat) by JMB (guest, #74439)
[Link]
From my point of view this is off topic.
Reproducability has to be checked - with differen CPU architectures,
operating systems etc.
(And I don't want to comment on the Java thing ... outch.)
Here the main goal seems to be to attract people writing software
for sciences to open their code.
Being in physical sciences in the 1990-ies I can not imagine anyone
coding for scientific purposes being not aware of the benefits.
MIT, GNU / GPL, transparency, reproducability by other researchers
this is scientific history or best prectices since a long time - not
a present goal (CERN, ESO, ... you name it).
If I had written such an article in 1998 people would have laughed
at me - which would have been justified.
We had the source code of the data analysis packages and changed them
if necessary - and recompiled them on current Linux systems.
But we used open source totally (preferring SW under GPL, of cause),
which is a big advantage.
I have heard of several scientific institutes using proprietary
software - this is a problem like the patents in this domain.
This may be a current problem - but this is not the point of
the article if I got it right.
So the comment seems to be not stranger than the main article.
But maybe things got so screwed (some institutes forced scientists
to claim patents - maybe there was a force to not open up their
source code either?) that one really has to reinvent the wheel?
Time for a new year ... ;-)
JMB
Ten simple rules for the open development of scientific software
Posted Dec 29, 2012 21:45 UTC (Sat) by oever (subscriber, #987)
[Link]
The whole point of publishing software along with articles is so that others may easily check the published results. The PLOS article notes that "few papers are accompanied by open software". From experience as researcher and developer in physical chemistry ('98-'03), bioinformatics ('04-'07) and X-Ray crystallography ('07-'09) I can say that this is valid.
Scientists love to use FOSS stacks, but mostly do not publish their own code. This is justified by saying that others could implement the described algorithms and achieve the same results that way. This is true when the algorithms have been documented completely and resources are infinite. Making a second implementation would indeed be a good check, but also takes a lot of work. Incentive for recreating the software that would yield no new publishable material is low.
Journals should and sometimes do require that source code is published with articles.
Of course there are exceptions to the rule. Tartini is a nice desktop application for analyzing musical performance. NSGT toolbox is a library for transforming audio from the time domain to the logarithmic frequency domain (FFT transforms to linear frequency domain).
Ten simple rules for the open development of scientific software
Posted Dec 29, 2012 22:17 UTC (Sat) by dskoll (subscriber, #1630)
[Link]
Your link to Tartini looked interesting, but when I tried compiling
it on Debian Squeeze, it failed miserably. It looks like Tartini illustrates the author's original point: A lot of academic software is not easily portable to any machine other than the author's workstation, it doesn't use standard tools like autoconf, and it bit-rots.
Too bad, because Tartini looks really cool...
Ten simple rules for the open development of scientific software
Posted Dec 30, 2012 0:07 UTC (Sun) by oever (subscriber, #987)
[Link]
The fact that Tartini does not compile under current Debian shows that an application that was working fine with a previous version of the libraries stopped being compilable, let alone usable, because Debian has changed so much in just a few years that Tartini does not compile. Is that Tartini's fault or Debian?
Should we expect researchers to keep software up to date with changes in compilers and available libraries? Tartini uses Qt4, a perfectly fine library as is Qt3 and Qt2. Yet, software that relies on Qt2 has a hard time working on a current Linux system.
Ten simple rules for the open development of scientific software
Posted Dec 30, 2012 0:30 UTC (Sun) by dskoll (subscriber, #1630)
[Link]
The fact that Tartini does not compile under current Debian shows that an application that was working fine with a previous version of the libraries stopped being compilable, let alone usable, because Debian has changed so much in just a few years that Tartini does not compile. Is that Tartini's fault or Debian?
Clearly, Tartini's. Debian is about the least bleeding-edge you can get with Linux. The build files for Tartini contain hard-coded paths to specific directories like /home/inferno/research/pitch/lib
I've worked in both academia and industry and know that academic software is not often built with the thought of actually distributing it or maintaining it in mind. It's just an unfortunate fact.
Ten simple rules for the open development of scientific software
Posted Dec 30, 2012 2:49 UTC (Sun) by yarikoptic (subscriber, #36795)
[Link]
> ..., Debian, the least bleeding-edge you can get with Linux
I beg a pardon... Debian is not only Debian stable -- there is also testing, unstable and even experimental. With unstable+experimental you might be as close to being bleeding as possible, while maintaining still usable and relatively stable system.
But this example is indeed a very nice to point out that source code itself, although a huge step forward, is not all what is needed for proper scientific methods dissemination since building/deploying of the "code" might be quite involved at times. Happened authors created proper Debian packages, uploaded them to Debian unstable (the entry point for new packages into Debian) -- it would have resolved many of those benefits others have mentioned:
-The "code" could immediately being used by Debian (and thus its >130 derivatives) users,
-its hardware platform agnosticism would be verified by building across >10 of those Debian supports
- happen there would be unittests ran at build-time -- at least some aspects of hardware platform "reproducibility" would also come "for free"
- longevity of such "code" would be in years due to inclusion/maintenance in Debian stable later on,
Want to read more on our (neuro.debian.net) position/experience -- you are welcome to read http://www.frontiersin.org/Neuroinformatics/10.3389/fninf...
Open is not enough. Let’s take the next step: an integrated, community-driven computing platform for neuroscience
Ten simple rules for the open development of scientific software
Posted Jan 4, 2013 20:16 UTC (Fri) by pboddie (subscriber, #50784)
[Link]
What you and others are saying is that what's missing is the software engineering. People can write code to consume and produce data in order to demonstrate something, get published, and so on, but if others are to benefit from that code in any convenient way, there's the usual amount of software engineering required to achieve this.
Some might dispute whether sharing the code is necessary, but if any algorithm is going to be described in detail - and I doubt that they are described in sufficient detail, especially in disciplines other than pure computer science - then it would be better if the code were available, better still if it could be conveniently used in order to rule out coincidental hardware- or infrastructure-related effects, and even better still if it were well-structured and well-documented. Once again, software engineering is the missing ingredient.
Unfortunately, the funding in many environments probably doesn't cover anything beyond getting something working and getting a paper out the door (and thus attracting more funding). After all, there's always another Web service to use or another bundle of Java class files to stuff into the JVM to massage one's data and produce a "result", and nobody's asking for money, so what's the problem? Right? That's probably the prevailing attitude that needs changing.
Ten simple rules for the open development of scientific software
Posted Dec 30, 2012 0:46 UTC (Sun) by paulj (subscriber, #341)
[Link]
And this kind of thing illustrates that what is *really* needed is to fully describe, in the most natural, concise but precise language the author can manage, the essential methodology of the experiment in the paper. Releasing the software does NOT substitute for that, in terms of increasing the reproducibility of the experiment.
Even seasoned software engineers will find it difficult to distribute software that will just run on a wide variety of machines - unless they do so as something that will boot on something that is close to a universal machine (e.g. x86 VMs). Even then, it's far from guaranteed.
Ten simple rules for the open development of scientific software
Posted Jan 4, 2013 0:40 UTC (Fri) by JoeBuck (subscriber, #2330)
[Link]
I was privileged to do my graduate research in a culture (UC Berkeley EECS department) that did rock-solid open source development and released a whole lot of software that was built upon by other groups. I agree that research software should be released, ideally open source, and if the university legal department sets up roadblocks, at least it should be made available on a restricted-use basis. However, it's a mistake to over-emphasize the software, and there may be advantages in having other groups re-implement the algorithms rather than just use the same code.
If Research Group A publishes a paper and releases software, Research Group B can run the software and observe the same result. But this doesn't mean that the result is correct; the software might be wrong. Similarly, claims that algorithm A is superior to algorithm B can be confused with the fact that implementation A is better than implementation B, but a bug in B's implementation led to worse performance than could have been achieved.
Ten simple rules for the open development of scientific software
Posted Jan 4, 2013 10:44 UTC (Fri) by dark (subscriber, #8483)
[Link]
I'd find this argument more convincing if it didn't also apply to publishing the data.
It's enough if scientific papers just describe the experimental protocol and their conclusions. It's a mistake to over-emphasize publishing the data; after all, research groups who are interested in verifying the result should run their own experiment instead of re-analysing the same data.
The flaw in the argument here is that if there are mistakes in the original group's analysis then they are exposed by publishing the data along with the conclusions, just like mistakes in software implementation would be exposed by publishing it. Forcing other groups to re-do the work and then guess why their results are different will instead hide these problems.
Publishing experimental data along with the conclusions drawn from it is considered essential; publishing the software used should be considered essential for the same reasons. In both cases, it makes sense to provide only a summary if there's no space for all of it (as in a print article); in that case, showing the implementation of the crucial parts of the algorithm would suffice. We can take the command-line parsing on faith :)
Ten simple rules for the open development of scientific software
Posted Jan 24, 2013 20:41 UTC (Thu) by raalkml (guest, #72852)
[Link]
FWIW, it wasn't particularly hard to fix for Debian testing.
If someone is still interested, I could send the fixes I had to do your way.
Ten simple rules for the open development of scientific software
Posted Dec 30, 2012 0:00 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
Reproducibility in software is a big challenge. Churn in FOSS libraries is large. Making sure software runs exactly as it did originally is very hard. This is due to the way software development and distribution currently works.
Your comment is a good argument to keep scientific software as simple as possible. In essence there should be a few data files and a few source code files in a standard language, that generate a binary; both sets of files should be under version control. The binary then takes the data files and produces a new file with results, which can then be plotted or manipulated as desired. This is specially important for all charts, graphics and tabulated data on an article.
There are a few issues in this scheme:
Input data should be fully documented: origin, conditions and other constraints.
The language used should also be specified. It should be a well-known standard, and ideally a language that maintains backwards compatibility so results can be reproduced in the future. (This rules out e.g. Python.)
If binary generation goes beyond compiling a few source files, a standard mechanism such as make should be used.
Use of libraries should be reduced to a minimum, and the version for each one should be specified in the documentation (or added to version control).
But it should be workable.
Ten simple rules for the open development of scientific software
Posted Dec 30, 2012 11:59 UTC (Sun) by macc (subscriber, #510)
[Link]
It is always a good thing to separate computation and presentation.
Use human readable formats.
Realise your problemsolving in (combinable) modules.
( netpbm, though not really scientific software is a perfect example.)
Make your solution scriptable!
Use human readable configuration files.
Incorporate the configuration in the results file.