LWN.net Logo

A scientific basis for Open Source Software

A scientific basis for Open Source Software

Posted May 19, 2012 15:17 UTC (Sat) by richo123 (guest, #24309)
Parent article: A scientific basis for Open Source Software

I think the argument that it is taxpayers money that paid for the software is more persuasive.

I am a research scientist who does a large amount of source code development. Sharing that code is only infrequently useful to either the public or other scientists.

The reproducibility argument is a weak one. The methodologies are described in any paper worth its salt so the software could be (and often should be) rewritten independently. Indeed such a rewrite is a better test of the original results since large codes are hard to rigorously assess if you did not write them. Since the algorithm is published the software SHOULD be reproducible.

On the other hand it really helps the scientific community in general if codes are shared. It can often save a lot of time.


(Log in to post comments)

A scientific basis for Open Source Software

Posted May 19, 2012 18:48 UTC (Sat) by pboddie (subscriber, #50784) [Link]

I am a research scientist who does a large amount of source code development. Sharing that code is only infrequently useful to either the public or other scientists.

That is your modest assessment, however. A lot of software, if exposed to the right audience, can benefit substantially from the accompanying exposure and improvements even if the audience is unfamiliar with the problem domain. And people in various domains can often benefit from techniques employed in other domains.

The reproducibility argument is a weak one. The methodologies are described in any paper worth its salt so the software could be (and often should be) rewritten independently. Indeed such a rewrite is a better test of the original results since large codes are hard to rigorously assess if you did not write them. Since the algorithm is published the software SHOULD be reproducible.

I'm not convinced that I've seen a paper outside the computer science domain that fully describes a non-trivial algorithm, although I'll freely admit that I don't read that many papers. My impression is that authors want you to get in contact to find out more and to "collaborate" with them - that appears to be easier than getting a complete algorithm description published.

Several factors exist that frustrate reproducibility and transparency, not limited to competition for funding, politics, publication requirements (both the need to publish and the restrictions around publication), and the temptation to "monetise" research by institutions.

But I agree that independent reimplementations of software systems can be useful in assessing the quality of results and in decoupling a methodology from implementation artifacts. However, I can personally attest that it takes time away from more rewarding work and is arguably a luxury unless one is the sort of researcher that is on such good terms with the funding agencies that one gets money for just about any project regardless of its merits.

A scientific basis for Open Source Software

Posted May 19, 2012 18:58 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

>I'm not convinced that I've seen a paper outside the computer science domain that fully describes a non-trivial algorithm, although I'll freely admit that I don't read that many papers.

That happens all the time in HEP (High Energy Physics) and in computational biology. Also quite often publications describe modifications of existing methods rather than totally new and unique methods. So it's fairly easy to replicate them.

A scientific basis for Open Source Software

Posted May 21, 2012 13:49 UTC (Mon) by dgm (subscriber, #49227) [Link]

Also in Mathematics, where many of the algorithms we use daily come from. Do not play with Big Boys' stuff if you don't wand a serious headache ;-)

A scientific basis for Open Source Software

Posted May 19, 2012 20:25 UTC (Sat) by richo123 (guest, #24309) [Link]

I agree that the papers do not spell out the algorithm however any professional scientist worth their salt can fill in any gaps. If the paper is evasive or not transparent (not common in my experience) you can always have recourse to the journal correspondence section or to private communication.

As to not having time etc to reproduce that is likely true however in practise reproducibility is only an issue for quite pivotal results (example the superluminal neutrino). In general if a similar result is not reproduced by others in the course of community research it is forgotten. Science works often by filtering out results that are not broadly similar to what a bunch of others have found. It isn't all that confrontational in that respect.

A scientific basis for Open Source Software

Posted May 19, 2012 21:49 UTC (Sat) by pboddie (subscriber, #50784) [Link]

I agree that the papers do not spell out the algorithm however any professional scientist worth their salt can fill in any gaps.

The devil is often in the details, though.

If the paper is evasive or not transparent (not common in my experience) you can always have recourse to the journal correspondence section or to private communication.

True, but it seems like an unnecessary overhead when the authors could have just published their source code. There are researchers who are quite happy to share their sources unconditionally, so not doing so just seems like adding an extra barrier between people for the sake of it.

Of course, there are factors that discourage people from releasing their sources, such as dissatisfaction about the quality or polish of the work, the lack of readiness of a system for immediate deployment (and other engineering issues), concerns over a maintenance burden, and so on. I've personally heard some of these used to justify not sharing the code running various widely-used services within a particular domain: the service maintainers would rather you used lots of bandwidth and their hardware than take the burden of potentially supporting others deploying their software.

The sad thing is that many scientists just don't seem to care if a service goes away if another similar one pops up in another place. They are quite happy to relinquish control over the process if they get data they can put in a paper. We actually need truly open services as well as software products.

A scientific basis for Open Source Software

Posted May 20, 2012 18:06 UTC (Sun) by Del- (guest, #72641) [Link]

>I agree that the papers do not spell out the algorithm however any professional scientist worth their salt can fill in any gaps.

I am afraid this is not even close to reality. Sure, there are many papers with more or less trivial implementations where you can defend your statement, but then you are ignoring a major chunk of today's academic research in science.

Often the code bases are large proprietary beasts, other times it is major code bases built over time at the university, only available to select people. Often we are talking about code bases where you certainly *do not* just fill in the gaps, simply because it would be a monumental undertaking, and you still would get somewhat different results because you did not implement the method identically.

This is a problem we are really struggling with these days, and as one informed poster mentioned the GPLv3 is our best shot at making things better for the future. It is so bad that lack of common code bases between academic communities brings advances to a grinding halt. Compute intensive tasks that require complex codes tend to progress very slowly.

I am thrilled to see this topic on the agenda, and I hope everybody realises that research involving implementations is not worthy academia unless all codes are provided with at least a GPLv3 (or alternatively a less restrictive) license. Moreover, researchers should be encouraged to build on already established codes instead of reinventing the wheel. This is the only way science can prosper, Newton and Leibniz understood this perfectly three hundred years ago. It is about time today's humans do too.

A scientific basis for Open Source Software

Posted May 20, 2012 20:39 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I have to say, I sympathize with the other side of the argument. It would be great to have source and have everything properly licensed with the GPL but for scientific research, repeatability demands re-implementations. If you can't reproduce the "science" without copying some magic code rather than understanding and reimplementing it then i'm not sure you are doing Science. If you are just copying then any bugs in the analysis are going to be propagated when the analysis is double-checked, making any double-checking useless. Maybe having source available makes an audit of the methods use easier but is it sufficiently easier than reimplementation or is that level of auditing ultimately breaking even with reimplementation, esp. if the computer is just automating basic statistical analysis.

I think the subject is worth debating. Open source scientific software isn't an unmitigated win although it may be the best way forward.

A scientific basis for Open Source Software

Posted May 20, 2012 21:18 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

but if you can't see the code to compare the code, is the cause of any difference a result of a code bug (on either side), a problem with experimental procedure, or really a different result?

to the extent that releasing the code creates a monoculture, it's bad (although a monoculture that can be looked at is far better than one that can't be).

But is the risk of this really so high that anyone looking at the issue should be required to code from scratch?

Having this as the requirement means that there is no casual review of the results, it becomes a major undertaking (almost to the scale of the original research project) to try and duplicate the results or try a slight variation.

There is room for debate, but I have a hard time believing that the cases where people blindly use other researchers work are going to be that much more severe than they are today, and I especially have trouble believing that this is not going to be overwhelmed by the benefits that come from the code being looked at by others.

A scientific basis for Open Source Software

Posted May 20, 2012 21:22 UTC (Sun) by mrjk (subscriber, #48482) [Link]

You want to force other people to replicate human-decades of labor to redo work that could be checked in key areas with a few weeks of effort? We are talking huge, millions of line code bases with very complex logic in many places.

There is a reason numerical analysis libraries were and are re-used for decades. People know them backwards and forwards, and understand their weaknesses and strengths.

Why not just give a qualitative overview of your breakthrough, if people can't reproduce it from that, its not really science...

This is all a matter of efficiency of effort. The whole point of the science as a human activity is really to build on the work of others and have more and more confidence you can trust models because you and others check various parts of them in detail over time. Without getting to that point in the models embedded in our software we are reducing the effectiveness of science by a significant amount.

We'll have to re-implement those models in an open way anyway to allow people the insight to have good confidence in them so why not do it the first time?

A scientific basis for Open Source Software

Posted May 20, 2012 21:44 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Think about tens of millions of man-hours lost if people try to (fruitlessly) use theories built on faulty assumptions.

And we're not speaking about codebases millions of lines long. It's quite rare for scientists to write large amounts of code, in fact (they generally suck at it).

A scientific basis for Open Source Software

Posted May 21, 2012 18:13 UTC (Mon) by Del- (guest, #72641) [Link]

>Think about tens of millions of man-hours lost if people try to (fruitlessly) use theories built on faulty assumptions.

That happens every day because there is little code sharing in academia. It is the other way around you know. When codes are shared, the general quality assurance level increases. Moreover, it allows codes to increase their complexity, reaching farther than earlier efforts did. This should be rather trivial to observe for anybody with knowledge on science today.

>And we're not speaking about codebases millions of lines long. It's quite rare for scientists to write large amounts of code,

First up, the fact that many scientists stick to more or less trivial implementations when they obviously could reach much longer with proper code bases available, only strengthen the point. Code needs to be shared, and it needs to be shared in such a way that one may build on each other. Scientists needing implementations as part of their research needs to give as much priority to the implementation as they do to writing the papers. Secondly, I am wondering what kind of experience you have in this. I can easily think of numerous academic code bases that comprise monumental implementations. Here is a nice selection at your convenience:
http://www.dune-project.org/
http://www.mcs.anl.gov/petsc/
http://www.openfoam.com/
http://www.reproducibility.org/wiki/Main_Page
http://fenicsproject.org/
http://www.gnu.org/software/octave/

Good luck with re-implementing any of them.

A scientific basis for Open Source Software

Posted May 21, 2012 19:16 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

You've listed mostly tool or tool-related software. Of course, it should be open.

But tools are rarely the focus of a research paper.

A scientific basis for Open Source Software

Posted May 21, 2012 20:01 UTC (Mon) by daglwn (subscriber, #65432) [Link]

The tools are what generate the published results. Without the tools one can't reproduce the results. Reimplementation is not practical.

No code needed

Posted May 21, 2012 15:10 UTC (Mon) by southey (subscriber, #9466) [Link]

I very much agree that the any person trained in the area should be able to independently verify any result without requiring any code for the authors. If you can not do that then I do not see that you have a right to complain about the code availability.

Code licenses are really a small issue as often author may send you the code (or not). Usually it is other aspects that are more problematic. One is the user support (documentation and running the code) as the authors have no time or money for that - hence my first comment. Probably under that is also code quality - some code is really well written that you can find what you want, others are more complex (but not incorrect). Often, it is far easier to write your own than try to modify existing code.

Most of the applications have very specific code bases that are not suitable for distribution. Sure, there are community efforts (just see what Scientific Linux distro provide) that provide the basic libraries yet you still must know how to use them. It is very easy to say provide the code but it just isn't that simple. You need to find a dedicated person to help when the code does not compile (especially porting to x86-64 platforms or from one platform to another). Even if you have money, finding a person with suitable training (i.e., knows the area AND programming) is very difficult. Furthermore, I doubt that the return on that investment is more than correctly training a person.

Finally, there is one of the most important components, competitive edge. Grant money is essential and I am NOT going to help someone using my code beat me to the same grant!

(Actually I consider having the data used way more critical than the code!)

No code needed

Posted May 21, 2012 15:55 UTC (Mon) by pboddie (subscriber, #50784) [Link]

The problems you're describing have everything to do with the sustainability of an activity, which in this case is about a piece of research that is supposed to inform further research. If the level of engineering is more or less "it works for me", both in the environment that produced some work and in any environment that wishes to build on it, then the code is likely to be no more than a curiosity, particularly if all people are going to do is just run it and get it to do something before it crashes.

Finally, there is one of the most important components, competitive edge. Grant money is essential and I am NOT going to help someone using my code beat me to the same grant!

And this is precisely why the sustainability situation is so hopeless. It's all "We got our result, on to the next publication!" and just hope that somebody else absorbs the cost of picking up any pieces worth keeping.

Meanwhile, as I write this, an academic somewhere on the planet is probably seeing open source for the first time and wondering if it's an extraterrestrial artifact: "What? They do sharing like this? How wonderful/perverse!"

No code needed

Posted May 24, 2012 8:35 UTC (Thu) by man_ls (subscriber, #15091) [Link]

Finally, there is one of the most important components, competitive edge. Grant money is essential and I am NOT going to help someone using my code beat me to the same grant!
Shameful. Instead of advancing the state of the art we are back to Alchemy, but with software as the secret ingredient that nobody else must have. Just replace "code" with "formula" or "reaction" in the above.

Only for this reason public research grants should mandate publication of the code bases under free licenses. The days where a few equations were enough to reproduce someone else's results are long gone in too many fields.

No code needed

Posted May 24, 2012 14:49 UTC (Thu) by raven667 (subscriber, #5198) [Link]

These motivations are understandable and they are regrettable. In any event a scientific paper should detail all of the analysis in sufficient detail that it can be reproduced. I wouldn't say source code would be required but certainly sufficient detail to re-implement any tools would be required. For sufficient analysis complexity maybe the source code would be the best documentation. One worry I have is the same as for electronic elections, what does it really mean if you just rerun the same tools and it spits out a number, any errors in the analysis will be faithfully reproduced which I think would impede scientific understanding.

No code needed

Posted May 24, 2012 15:17 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

It seems to me that in a study of this kind there are two aspects to reproducibility: the raw data, and the analysis performed by the software. By including both the raw data and the actual software used in the original study, you make it possible to check each part separately. Without the original software, it's difficult to say whether any differences in the processed results are due to problems with the original software, problems with the reimplementation, or differences in the input.

Having the original software for comparison also makes it easier to guarantee that the results can be reproduced with a different _style_ of implementation; otherwise, not knowing how the original software was implemented, you might end up recreating it the same way, with the same built-in flaws. If the software is included you can deliberately choose a different approach.

No code needed

Posted May 24, 2012 15:35 UTC (Thu) by man_ls (subscriber, #15091) [Link]

There is also gradual improvement of results, which has been a tenet of science for many centuries. A first researcher publishes their basic results, a second researcher publishes their enhancements, the next one publishes a refinement in certain conditions... In these days of computer simulations it becomes essential to have both data and software, as you say, and improve on them gradually. Otherwise research papers become just a lot of hand-waving around estimations and algorithms.

A scientific basis for Open Source Software

Posted May 20, 2012 19:56 UTC (Sun) by dps (subscriber, #5725) [Link]

Have you read "matrix multiplication by arithmetic progressions"? This paper currently holds the record for the lowest exponent for n*n matrix multiplication. The authors demonstrate the existence of, but do not describe, an algorithm and I doubt anybody has ever implemented it.

The limitations of my knowledge of bilinear forms makes it impossible for me not to take the result claimed on faith and the absence of any dispute about the result. I suspect most people have neither the time nor a large enough enough dense matrix to make an implementation worthwhile.

A scientific basis for Open Source Software

Posted May 21, 2012 16:08 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> you can always have recourse to the journal correspondence section or to
> private communication.

Good luck with that. My experience is that authors either do not have the time to adequately answer and resolve questions or are afraid to do so because they know the models stink.

A scientific basis for Open Source Software

Posted May 21, 2012 17:14 UTC (Mon) by deater (subscriber, #11746) [Link]

>> you can always have recourse to the journal correspondence section or to
>> private communication.

> Good luck with that. My experience is that authors either do not have the time to
> adequately answer and resolve questions or are afraid to do so because they know
> the models stink

I want to completely agree with this. I've had the experience of asking for source *when the final published paper said it would be available* and still had them refuse on the grounds that it "wasn't ready yet". Two years after publication.

As for the correspondence section, in many computer related fields people publish in conferences. Good luck getting anything out of a conference after it is finished. I've found papers that have had actively inaccurate information, but there's no one to complain to, no mechanism for getting a correction published, and the original authors don't care because hey they got another publication.

A scientific basis for Open Source Software

Posted May 21, 2012 18:10 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

by two years after the paper is published, the people are on to different projects and generally not very interested in working on something they 'finished' two years ago for no additional money or credit.

A scientific basis for Open Source Software

Posted May 21, 2012 20:05 UTC (Mon) by daglwn (subscriber, #65432) [Link]

And that's exactly the problem. We must require source code with all publications. The purpose of publications is to expand knowledge. The whole reason we have references is so that others may look at our work and build upon it. Withholding source is completely contrary to the hole purpose of publication.

Except that most research groups don't see publications that they. Publications are one of two things: a way to graduate or a way to obtain tenure. That's a very different set of goals with a very different values and activities motivated by it. Reproducibility is not among them.

A scientific basis for Open Source Software

Posted May 21, 2012 20:21 UTC (Mon) by zooko (subscriber, #2589) [Link]

My wife is a researcher in computational linguistics. I've seen her waste months of her precious time trying to figure out why published (and widely cited) results don't match up with their published algorithm. These are the kinds of questions that could be easily answered by examining source code and/or by experimenting on the executable, but are more or less impossible to answer without access to either. After watching this happen up close (I help her on occasion with programming work for her research, so I understand some of what she does), I became convinced that reviewers should start rejecting any paper that comes without full executable source code, on the basis that the results in the paper are not reproducible.

Now, maybe other fields than computational linguistics rely on simpler or more declaratively specifiable computation, but I doubt it. My guess is that if you pick a recent, widely cited paper from most modern fields of science and attempt to reproduce it, that it will take many months of programming effort on your part, and more than likely that you'll fail -- the ultimate results your attempted reproduction emits will not be identical to the original, and you'll be unable to determine why it is different.

A scientific basis for Open Source Software

Posted May 21, 2012 23:30 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> I've seen her waste months of her precious time trying to figure out why
> published (and widely cited) results don't match up with their published
> algorithm. These are the kinds of questions that could be easily answered
> by examining source code and/or by experimenting on the executable, but
> are more or less impossible to answer without access to either.

Spot on. I also spent months and months of time trying to unsuccessfully reproduce results. The only difference from you wife's situation is that I at least got a dissertation chapter out of it. :)

A scientific basis for Open Source Software

Posted May 21, 2012 16:06 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> I'm not convinced that I've seen a paper outside the computer science
> domain that fully describes a non-trivial algorithm, although I'll freely
> admit that I don't read that many papers.

I've never read a paper IN the computer/software engineering field that described algorithms and models in sufficient detail to reproduce the results. It's really quite sad. I worked with some higher-ups at IBM for a while who told me they reject 99% of the research papers outright and rarely reproduce the results of the 1% of ideas they do try.

A scientific basis for Open Source Software

Posted May 20, 2012 0:05 UTC (Sun) by viro (subscriber, #7872) [Link]

You've got to be kidding. Results are produced by implementation, not by algorithm. I.e. by the algorithm plus the bugs present in said implementation. "We have reimplemented the algorithm and results differed from those given in the paper in the following respects" is a _lot_ weaker and less useful than "reviewing the implementation in appendix B of the paper reveals the following bugs and results are affected by those in the following respects".

While we are at it, rigorously assessing one's *own* code is much harder than doing that to code written by somebody else. There's a reason why mutual code review is useful...

If one doesn't want to sink down to the level of "soft sciences" (or alchemy, for that matter[1]), description of methods and materials should be detailed enough to make it realistically possible to investigate how do results depend on the details of those. Otherwise the results are unfalsifiable, with everything that follows from that.

[1] aka "your failure to reproduce my results only proves that you are less spiritually advanced than I am and need to work harder to elevate your soul to my level" - and no, that's not a parody. The discipline really had been infected by that from its inception and it had taken Boyle et.al. to dump that "spiritual" garbage. At which point it had become chemistry...

A scientific basis for Open Source Software

Posted May 20, 2012 20:59 UTC (Sun) by szoth (guest, #14825) [Link]

Lwn needs a like button, I really enjoyed this comment :)

A scientific basis for Open Source Software

Posted May 24, 2012 8:39 UTC (Thu) by man_ls (subscriber, #15091) [Link]

Me too, even though Viro beat me to using Alchemy as a rhetorical argument (and on a more solid basis).

A scientific basis for Open Source Software

Posted May 21, 2012 16:05 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> The reproducibility argument is a weak one. The methodologies are
> described in any paper worth its salt so the software could be (and often
> should be) rewritten independently. Indeed such a rewrite is a better test
> of the original results since large codes are hard to rigorously assess if
> you did not write them.

Ideally yes. Unfortunately this doesn't happen, ESPECIALLY in computer science and engineering research. Half my dissertation was about how completely unreproducible results are. Small variations in assumptions resulted in large changes in outcomes. In my case changes to either or both of the hardware model and the compiler algorithm running on it gave wildly different results.

We think we have a good intuition of how software and hardware works. We don't. It is impossible to reproduce the models presented in papers precisely because describing them to an adequate degree would essentially be a replication of the source code.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds