LWN.net Logo

A scientific basis for Open Source Software

Martin Davis of the JTS Topology Suite project points readers to an article in Nature arguing that open source software should be a standard requirement for peer-reviewed science. "The paper raises the argument for open source software to a higher plane, that of being a necessary component of scientific proof. It points out that the increasing use of computational science as a basis for scientific discovery implies that open source must become a standard requirement for documentation. Apparently some journals such as Science already require source code to be supplied along with submissions of articles."
(Log in to post comments)

A scientific basis for Open Source Software

Posted May 18, 2012 21:41 UTC (Fri) by theophrastus (guest, #80847) [Link]

More power to them!

I've felt this needed to be the policy standard at public (tax payer funded) universities since a biochem group I was a member of was granted a "free" license for some of Schrodinger's molecular tool-sets (associated with their Glide™ molecular binding libraries). I asked at various meetings "how can our results be published if we aren't allowed to know the mathematics behind their essential 'scoring function'"? I was waved-off with "it's enough that we publish that we used a particular version of their software" -- which was a proprietary blackbox. Our publications thus became free advertising - as was always their plan.

I was then afterwards deemed a trouble-maker as the university continued its slide toward "corporate cooperativity". "I suppose you'll be wanting next that we convert away from MSWord documents?", I was asked. "well yes... but let's work on the smaller infections first" [wink]

A scientific basis for Open Source Software

Posted May 20, 2012 20:10 UTC (Sun) by engla (guest, #47454) [Link]

Ask them if Word is integral to your research. I don't think it is, like the scoring function is.

A scientific basis for Open Source Software

Posted May 18, 2012 22:39 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]

The scientific argument is that independent researchers must be able to determine how the results were produced, so they need to be able to see the source code and ideally include an analysis of the code as part of the review process if the results depend on it. But this doesn't require that every provision of the DFSG be honored; for example, many researchers make their software available "for research purposes only", hoping to commercialize it later.

That said, for research that gets taxpayer funds the taxpayer shouldn't have to pay twice (once to fund the development, again if they want to use the results), and we wouldn't want reviewers, who are peers in the same field, from being prevented from developing similar but improved software.

A scientific basis for Open Source Software

Posted May 18, 2012 23:31 UTC (Fri) by sfeam (subscriber, #2841) [Link]

I have developed several scientific code packages with support from the NIH (US National Institutes of Health). There is a requirement in the boilerplate of a typical NIH grant award that the resulting programs are treated as open source. Now the NIH definition of "open source" probably does not meet the criteria of the OSF, but it does satisfy the scientific concern that other researchers using the programs can inspect the source code for themselves to verify or understand how it works.

If the software is later commercialized, this can introduce complications. But, at least in my field, the norm for commercial software developed academically is to charge what the market will bear to pharmaceutical companies but make it available at no cost, usually with restrictions on further redistribution, to other academic groups. Schroedinger was mentioned in an earlier comment, and provides a case in point. They charge a nice fee for their stuff commercially, but at least some of their packages are also distributed as standard RPMs in various linux distros. To drag in mention of a separate thread, this is/was one of the nice features of Mandrake/Mandriva. They offered a nice set of chemistry-related packages, although in recent years this wasn't kept up as well as one might wish.

A scientific basis for Open Source Software

Posted May 18, 2012 23:35 UTC (Fri) by atai (subscriber, #10977) [Link]

What is OSF?

A scientific basis for Open Source Software

Posted May 18, 2012 23:43 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

Seems pretty obvious that OSI is being mistakenly called OSF

A scientific basis for Open Source Software

Posted May 18, 2012 23:55 UTC (Fri) by sfeam (subscriber, #2841) [Link]

My brain/fingers are stuck back in the era of DEC/OSF. But yes, I meant the OSI.

A scientific basis for Open Source Software

Posted May 19, 2012 16:42 UTC (Sat) by gmaxwell (subscriber, #30048) [Link]

> But this doesn't require that every provision of the DFSG be honored

Though it may require much more of it than is obvious at than at first guess. For example, sometimes the only viable path to show that something is broken is to fix it and then let others validate for themselves that the fixed version works better. A license which allows you to look but not touch isn't sufficient.

Ultimately, I think the real underlying requirement is that the software itself also be part of the continuing scientific dialog— but this isn't reasonably possible without pretty much the full DFSG. Consider— without the freedom to use the results of your modifications commercial, you'll spend time reinventing the wheel instead of furthering the art from the existing tools. — but sure, access to the code is an essential improvement.

A scientific basis for Open Source Software

Posted May 19, 2012 22:09 UTC (Sat) by bjartur (guest, #67801) [Link]

Technically, you could address your former concern by allowing the distribution and application of patches. Actually, some supposedly free software is distributed under licenses that prohibit modification but allow distribution if patches. Gnuplot (not GNU) and amiwm, for example.

But yes, substantial advancement of the state of art, and in special merging software from multiple projects, requires more freedom.

A scientific basis for Open Source Software

Posted May 19, 2012 14:37 UTC (Sat) by magi (subscriber, #4051) [Link]

I have thought that this is a good idea for a long time. However, the other problem will be to convince the scientist to release their code as open source. I don't think that commercialisation is such a big issue (at least in the field where I am working), but the bigger issue is one of control:
  • this is my research and code how can I make sure no-one is publishing results before I have finished with it?
  • this is a complicated model, how can I make sure that others understand its limitations and do not publish results that are based on misuses of the model?
Another item on my wish list is that research councils should recognise (open source) software as creditable products like journal papers.

A scientific basis for Open Source Software

Posted May 19, 2012 17:43 UTC (Sat) by JoeBuck (subscriber, #2330) [Link]

In my experience, the researcher is usually happy to release the software as open source, but university management often interferes, because they hope to obtain money from licensing anything that turns out to be valuable. Some universities in the US treat software developed by researchers as a profit center.

One workaround could be to use GPLv3 and let the licensing people sell alternative licensing.

A scientific basis for Open Source Software

Posted May 19, 2012 20:40 UTC (Sat) by emk (guest, #1128) [Link]

Yup. Or just speak the university lawyers earlier in the process.

Shrug and say, "Look, this might be useful to some researcher somewhere, but it would be an epic failure as commercial technology transfer. It's just half-baked, and the entire market is probably a half-dozen other broke academics. If we sunk in $50–150K of startup costs and a year of our professional lives, we might scrape up $5K of sales if those academics busted their budgets. Frankly, we're better off spending the time writing more grants."

"If you want to maintain some commercial rights on the off chance that anybody, anywhere ever cares, we can go ahead a slap a strict 'share and share alike' license (such as the GPLv3) on it, and reserve the right to license it under alternate terms if somebody wants to pay us."

I've successfully made this pitch at two major research universities. In my experience, if you have the backing of your PI, your lawyers will probably go along.

A scientific basis for Open Source Software

Posted May 20, 2012 9:28 UTC (Sun) by danieldk (guest, #27876) [Link]

As much as I dislike the GPL, it has an additional benefit in science: with some grants, you have to transfer copyrights to all software that is produced, *except* when you can't. In such cases, if you can argue that you need to expand an *existing* GPL-licensed project, you can often get a waiver (for copyright transfer) for deliverables that involve that particular software.

In other words: licensing scientific software under a strong copyleft license could make life easier for other scientists who want to expand your work.

A scientific basis for Open Source Software

Posted May 21, 2012 15:53 UTC (Mon) by paulj (subscriber, #341) [Link]

The GPL is no obstacle to transferring the copyright. What do you think the FSF has long demanded for contributions to many GNU projects?

A scientific basis for Open Source Software

Posted May 19, 2012 15:17 UTC (Sat) by richo123 (guest, #24309) [Link]

I think the argument that it is taxpayers money that paid for the software is more persuasive.

I am a research scientist who does a large amount of source code development. Sharing that code is only infrequently useful to either the public or other scientists.

The reproducibility argument is a weak one. The methodologies are described in any paper worth its salt so the software could be (and often should be) rewritten independently. Indeed such a rewrite is a better test of the original results since large codes are hard to rigorously assess if you did not write them. Since the algorithm is published the software SHOULD be reproducible.

On the other hand it really helps the scientific community in general if codes are shared. It can often save a lot of time.

A scientific basis for Open Source Software

Posted May 19, 2012 18:48 UTC (Sat) by pboddie (subscriber, #50784) [Link]

I am a research scientist who does a large amount of source code development. Sharing that code is only infrequently useful to either the public or other scientists.

That is your modest assessment, however. A lot of software, if exposed to the right audience, can benefit substantially from the accompanying exposure and improvements even if the audience is unfamiliar with the problem domain. And people in various domains can often benefit from techniques employed in other domains.

The reproducibility argument is a weak one. The methodologies are described in any paper worth its salt so the software could be (and often should be) rewritten independently. Indeed such a rewrite is a better test of the original results since large codes are hard to rigorously assess if you did not write them. Since the algorithm is published the software SHOULD be reproducible.

I'm not convinced that I've seen a paper outside the computer science domain that fully describes a non-trivial algorithm, although I'll freely admit that I don't read that many papers. My impression is that authors want you to get in contact to find out more and to "collaborate" with them - that appears to be easier than getting a complete algorithm description published.

Several factors exist that frustrate reproducibility and transparency, not limited to competition for funding, politics, publication requirements (both the need to publish and the restrictions around publication), and the temptation to "monetise" research by institutions.

But I agree that independent reimplementations of software systems can be useful in assessing the quality of results and in decoupling a methodology from implementation artifacts. However, I can personally attest that it takes time away from more rewarding work and is arguably a luxury unless one is the sort of researcher that is on such good terms with the funding agencies that one gets money for just about any project regardless of its merits.

A scientific basis for Open Source Software

Posted May 19, 2012 18:58 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

>I'm not convinced that I've seen a paper outside the computer science domain that fully describes a non-trivial algorithm, although I'll freely admit that I don't read that many papers.

That happens all the time in HEP (High Energy Physics) and in computational biology. Also quite often publications describe modifications of existing methods rather than totally new and unique methods. So it's fairly easy to replicate them.

A scientific basis for Open Source Software

Posted May 21, 2012 13:49 UTC (Mon) by dgm (subscriber, #49227) [Link]

Also in Mathematics, where many of the algorithms we use daily come from. Do not play with Big Boys' stuff if you don't wand a serious headache ;-)

A scientific basis for Open Source Software

Posted May 19, 2012 20:25 UTC (Sat) by richo123 (guest, #24309) [Link]

I agree that the papers do not spell out the algorithm however any professional scientist worth their salt can fill in any gaps. If the paper is evasive or not transparent (not common in my experience) you can always have recourse to the journal correspondence section or to private communication.

As to not having time etc to reproduce that is likely true however in practise reproducibility is only an issue for quite pivotal results (example the superluminal neutrino). In general if a similar result is not reproduced by others in the course of community research it is forgotten. Science works often by filtering out results that are not broadly similar to what a bunch of others have found. It isn't all that confrontational in that respect.

A scientific basis for Open Source Software

Posted May 19, 2012 21:49 UTC (Sat) by pboddie (subscriber, #50784) [Link]

I agree that the papers do not spell out the algorithm however any professional scientist worth their salt can fill in any gaps.

The devil is often in the details, though.

If the paper is evasive or not transparent (not common in my experience) you can always have recourse to the journal correspondence section or to private communication.

True, but it seems like an unnecessary overhead when the authors could have just published their source code. There are researchers who are quite happy to share their sources unconditionally, so not doing so just seems like adding an extra barrier between people for the sake of it.

Of course, there are factors that discourage people from releasing their sources, such as dissatisfaction about the quality or polish of the work, the lack of readiness of a system for immediate deployment (and other engineering issues), concerns over a maintenance burden, and so on. I've personally heard some of these used to justify not sharing the code running various widely-used services within a particular domain: the service maintainers would rather you used lots of bandwidth and their hardware than take the burden of potentially supporting others deploying their software.

The sad thing is that many scientists just don't seem to care if a service goes away if another similar one pops up in another place. They are quite happy to relinquish control over the process if they get data they can put in a paper. We actually need truly open services as well as software products.

A scientific basis for Open Source Software

Posted May 20, 2012 18:06 UTC (Sun) by Del- (guest, #72641) [Link]

>I agree that the papers do not spell out the algorithm however any professional scientist worth their salt can fill in any gaps.

I am afraid this is not even close to reality. Sure, there are many papers with more or less trivial implementations where you can defend your statement, but then you are ignoring a major chunk of today's academic research in science.

Often the code bases are large proprietary beasts, other times it is major code bases built over time at the university, only available to select people. Often we are talking about code bases where you certainly *do not* just fill in the gaps, simply because it would be a monumental undertaking, and you still would get somewhat different results because you did not implement the method identically.

This is a problem we are really struggling with these days, and as one informed poster mentioned the GPLv3 is our best shot at making things better for the future. It is so bad that lack of common code bases between academic communities brings advances to a grinding halt. Compute intensive tasks that require complex codes tend to progress very slowly.

I am thrilled to see this topic on the agenda, and I hope everybody realises that research involving implementations is not worthy academia unless all codes are provided with at least a GPLv3 (or alternatively a less restrictive) license. Moreover, researchers should be encouraged to build on already established codes instead of reinventing the wheel. This is the only way science can prosper, Newton and Leibniz understood this perfectly three hundred years ago. It is about time today's humans do too.

A scientific basis for Open Source Software

Posted May 20, 2012 20:39 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I have to say, I sympathize with the other side of the argument. It would be great to have source and have everything properly licensed with the GPL but for scientific research, repeatability demands re-implementations. If you can't reproduce the "science" without copying some magic code rather than understanding and reimplementing it then i'm not sure you are doing Science. If you are just copying then any bugs in the analysis are going to be propagated when the analysis is double-checked, making any double-checking useless. Maybe having source available makes an audit of the methods use easier but is it sufficiently easier than reimplementation or is that level of auditing ultimately breaking even with reimplementation, esp. if the computer is just automating basic statistical analysis.

I think the subject is worth debating. Open source scientific software isn't an unmitigated win although it may be the best way forward.

A scientific basis for Open Source Software

Posted May 20, 2012 21:18 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

but if you can't see the code to compare the code, is the cause of any difference a result of a code bug (on either side), a problem with experimental procedure, or really a different result?

to the extent that releasing the code creates a monoculture, it's bad (although a monoculture that can be looked at is far better than one that can't be).

But is the risk of this really so high that anyone looking at the issue should be required to code from scratch?

Having this as the requirement means that there is no casual review of the results, it becomes a major undertaking (almost to the scale of the original research project) to try and duplicate the results or try a slight variation.

There is room for debate, but I have a hard time believing that the cases where people blindly use other researchers work are going to be that much more severe than they are today, and I especially have trouble believing that this is not going to be overwhelmed by the benefits that come from the code being looked at by others.

A scientific basis for Open Source Software

Posted May 20, 2012 21:22 UTC (Sun) by mrjk (subscriber, #48482) [Link]

You want to force other people to replicate human-decades of labor to redo work that could be checked in key areas with a few weeks of effort? We are talking huge, millions of line code bases with very complex logic in many places.

There is a reason numerical analysis libraries were and are re-used for decades. People know them backwards and forwards, and understand their weaknesses and strengths.

Why not just give a qualitative overview of your breakthrough, if people can't reproduce it from that, its not really science...

This is all a matter of efficiency of effort. The whole point of the science as a human activity is really to build on the work of others and have more and more confidence you can trust models because you and others check various parts of them in detail over time. Without getting to that point in the models embedded in our software we are reducing the effectiveness of science by a significant amount.

We'll have to re-implement those models in an open way anyway to allow people the insight to have good confidence in them so why not do it the first time?

A scientific basis for Open Source Software

Posted May 20, 2012 21:44 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Think about tens of millions of man-hours lost if people try to (fruitlessly) use theories built on faulty assumptions.

And we're not speaking about codebases millions of lines long. It's quite rare for scientists to write large amounts of code, in fact (they generally suck at it).

A scientific basis for Open Source Software

Posted May 21, 2012 18:13 UTC (Mon) by Del- (guest, #72641) [Link]

>Think about tens of millions of man-hours lost if people try to (fruitlessly) use theories built on faulty assumptions.

That happens every day because there is little code sharing in academia. It is the other way around you know. When codes are shared, the general quality assurance level increases. Moreover, it allows codes to increase their complexity, reaching farther than earlier efforts did. This should be rather trivial to observe for anybody with knowledge on science today.

>And we're not speaking about codebases millions of lines long. It's quite rare for scientists to write large amounts of code,

First up, the fact that many scientists stick to more or less trivial implementations when they obviously could reach much longer with proper code bases available, only strengthen the point. Code needs to be shared, and it needs to be shared in such a way that one may build on each other. Scientists needing implementations as part of their research needs to give as much priority to the implementation as they do to writing the papers. Secondly, I am wondering what kind of experience you have in this. I can easily think of numerous academic code bases that comprise monumental implementations. Here is a nice selection at your convenience:
http://www.dune-project.org/
http://www.mcs.anl.gov/petsc/
http://www.openfoam.com/
http://www.reproducibility.org/wiki/Main_Page
http://fenicsproject.org/
http://www.gnu.org/software/octave/

Good luck with re-implementing any of them.

A scientific basis for Open Source Software

Posted May 21, 2012 19:16 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

You've listed mostly tool or tool-related software. Of course, it should be open.

But tools are rarely the focus of a research paper.

A scientific basis for Open Source Software

Posted May 21, 2012 20:01 UTC (Mon) by daglwn (subscriber, #65432) [Link]

The tools are what generate the published results. Without the tools one can't reproduce the results. Reimplementation is not practical.

No code needed

Posted May 21, 2012 15:10 UTC (Mon) by southey (subscriber, #9466) [Link]

I very much agree that the any person trained in the area should be able to independently verify any result without requiring any code for the authors. If you can not do that then I do not see that you have a right to complain about the code availability.

Code licenses are really a small issue as often author may send you the code (or not). Usually it is other aspects that are more problematic. One is the user support (documentation and running the code) as the authors have no time or money for that - hence my first comment. Probably under that is also code quality - some code is really well written that you can find what you want, others are more complex (but not incorrect). Often, it is far easier to write your own than try to modify existing code.

Most of the applications have very specific code bases that are not suitable for distribution. Sure, there are community efforts (just see what Scientific Linux distro provide) that provide the basic libraries yet you still must know how to use them. It is very easy to say provide the code but it just isn't that simple. You need to find a dedicated person to help when the code does not compile (especially porting to x86-64 platforms or from one platform to another). Even if you have money, finding a person with suitable training (i.e., knows the area AND programming) is very difficult. Furthermore, I doubt that the return on that investment is more than correctly training a person.

Finally, there is one of the most important components, competitive edge. Grant money is essential and I am NOT going to help someone using my code beat me to the same grant!

(Actually I consider having the data used way more critical than the code!)

No code needed

Posted May 21, 2012 15:55 UTC (Mon) by pboddie (subscriber, #50784) [Link]

The problems you're describing have everything to do with the sustainability of an activity, which in this case is about a piece of research that is supposed to inform further research. If the level of engineering is more or less "it works for me", both in the environment that produced some work and in any environment that wishes to build on it, then the code is likely to be no more than a curiosity, particularly if all people are going to do is just run it and get it to do something before it crashes.

Finally, there is one of the most important components, competitive edge. Grant money is essential and I am NOT going to help someone using my code beat me to the same grant!

And this is precisely why the sustainability situation is so hopeless. It's all "We got our result, on to the next publication!" and just hope that somebody else absorbs the cost of picking up any pieces worth keeping.

Meanwhile, as I write this, an academic somewhere on the planet is probably seeing open source for the first time and wondering if it's an extraterrestrial artifact: "What? They do sharing like this? How wonderful/perverse!"

No code needed

Posted May 24, 2012 8:35 UTC (Thu) by man_ls (subscriber, #15091) [Link]

Finally, there is one of the most important components, competitive edge. Grant money is essential and I am NOT going to help someone using my code beat me to the same grant!
Shameful. Instead of advancing the state of the art we are back to Alchemy, but with software as the secret ingredient that nobody else must have. Just replace "code" with "formula" or "reaction" in the above.

Only for this reason public research grants should mandate publication of the code bases under free licenses. The days where a few equations were enough to reproduce someone else's results are long gone in too many fields.

No code needed

Posted May 24, 2012 14:49 UTC (Thu) by raven667 (subscriber, #5198) [Link]

These motivations are understandable and they are regrettable. In any event a scientific paper should detail all of the analysis in sufficient detail that it can be reproduced. I wouldn't say source code would be required but certainly sufficient detail to re-implement any tools would be required. For sufficient analysis complexity maybe the source code would be the best documentation. One worry I have is the same as for electronic elections, what does it really mean if you just rerun the same tools and it spits out a number, any errors in the analysis will be faithfully reproduced which I think would impede scientific understanding.

No code needed

Posted May 24, 2012 15:17 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

It seems to me that in a study of this kind there are two aspects to reproducibility: the raw data, and the analysis performed by the software. By including both the raw data and the actual software used in the original study, you make it possible to check each part separately. Without the original software, it's difficult to say whether any differences in the processed results are due to problems with the original software, problems with the reimplementation, or differences in the input.

Having the original software for comparison also makes it easier to guarantee that the results can be reproduced with a different _style_ of implementation; otherwise, not knowing how the original software was implemented, you might end up recreating it the same way, with the same built-in flaws. If the software is included you can deliberately choose a different approach.

No code needed

Posted May 24, 2012 15:35 UTC (Thu) by man_ls (subscriber, #15091) [Link]

There is also gradual improvement of results, which has been a tenet of science for many centuries. A first researcher publishes their basic results, a second researcher publishes their enhancements, the next one publishes a refinement in certain conditions... In these days of computer simulations it becomes essential to have both data and software, as you say, and improve on them gradually. Otherwise research papers become just a lot of hand-waving around estimations and algorithms.

A scientific basis for Open Source Software

Posted May 20, 2012 19:56 UTC (Sun) by dps (subscriber, #5725) [Link]

Have you read "matrix multiplication by arithmetic progressions"? This paper currently holds the record for the lowest exponent for n*n matrix multiplication. The authors demonstrate the existence of, but do not describe, an algorithm and I doubt anybody has ever implemented it.

The limitations of my knowledge of bilinear forms makes it impossible for me not to take the result claimed on faith and the absence of any dispute about the result. I suspect most people have neither the time nor a large enough enough dense matrix to make an implementation worthwhile.

A scientific basis for Open Source Software

Posted May 21, 2012 16:08 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> you can always have recourse to the journal correspondence section or to
> private communication.

Good luck with that. My experience is that authors either do not have the time to adequately answer and resolve questions or are afraid to do so because they know the models stink.

A scientific basis for Open Source Software

Posted May 21, 2012 17:14 UTC (Mon) by deater (subscriber, #11746) [Link]

>> you can always have recourse to the journal correspondence section or to
>> private communication.

> Good luck with that. My experience is that authors either do not have the time to
> adequately answer and resolve questions or are afraid to do so because they know
> the models stink

I want to completely agree with this. I've had the experience of asking for source *when the final published paper said it would be available* and still had them refuse on the grounds that it "wasn't ready yet". Two years after publication.

As for the correspondence section, in many computer related fields people publish in conferences. Good luck getting anything out of a conference after it is finished. I've found papers that have had actively inaccurate information, but there's no one to complain to, no mechanism for getting a correction published, and the original authors don't care because hey they got another publication.

A scientific basis for Open Source Software

Posted May 21, 2012 18:10 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

by two years after the paper is published, the people are on to different projects and generally not very interested in working on something they 'finished' two years ago for no additional money or credit.

A scientific basis for Open Source Software

Posted May 21, 2012 20:05 UTC (Mon) by daglwn (subscriber, #65432) [Link]

And that's exactly the problem. We must require source code with all publications. The purpose of publications is to expand knowledge. The whole reason we have references is so that others may look at our work and build upon it. Withholding source is completely contrary to the hole purpose of publication.

Except that most research groups don't see publications that they. Publications are one of two things: a way to graduate or a way to obtain tenure. That's a very different set of goals with a very different values and activities motivated by it. Reproducibility is not among them.

A scientific basis for Open Source Software

Posted May 21, 2012 20:21 UTC (Mon) by zooko (subscriber, #2589) [Link]

My wife is a researcher in computational linguistics. I've seen her waste months of her precious time trying to figure out why published (and widely cited) results don't match up with their published algorithm. These are the kinds of questions that could be easily answered by examining source code and/or by experimenting on the executable, but are more or less impossible to answer without access to either. After watching this happen up close (I help her on occasion with programming work for her research, so I understand some of what she does), I became convinced that reviewers should start rejecting any paper that comes without full executable source code, on the basis that the results in the paper are not reproducible.

Now, maybe other fields than computational linguistics rely on simpler or more declaratively specifiable computation, but I doubt it. My guess is that if you pick a recent, widely cited paper from most modern fields of science and attempt to reproduce it, that it will take many months of programming effort on your part, and more than likely that you'll fail -- the ultimate results your attempted reproduction emits will not be identical to the original, and you'll be unable to determine why it is different.

A scientific basis for Open Source Software

Posted May 21, 2012 23:30 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> I've seen her waste months of her precious time trying to figure out why
> published (and widely cited) results don't match up with their published
> algorithm. These are the kinds of questions that could be easily answered
> by examining source code and/or by experimenting on the executable, but
> are more or less impossible to answer without access to either.

Spot on. I also spent months and months of time trying to unsuccessfully reproduce results. The only difference from you wife's situation is that I at least got a dissertation chapter out of it. :)

A scientific basis for Open Source Software

Posted May 21, 2012 16:06 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> I'm not convinced that I've seen a paper outside the computer science
> domain that fully describes a non-trivial algorithm, although I'll freely
> admit that I don't read that many papers.

I've never read a paper IN the computer/software engineering field that described algorithms and models in sufficient detail to reproduce the results. It's really quite sad. I worked with some higher-ups at IBM for a while who told me they reject 99% of the research papers outright and rarely reproduce the results of the 1% of ideas they do try.

A scientific basis for Open Source Software

Posted May 20, 2012 0:05 UTC (Sun) by viro (subscriber, #7872) [Link]

You've got to be kidding. Results are produced by implementation, not by algorithm. I.e. by the algorithm plus the bugs present in said implementation. "We have reimplemented the algorithm and results differed from those given in the paper in the following respects" is a _lot_ weaker and less useful than "reviewing the implementation in appendix B of the paper reveals the following bugs and results are affected by those in the following respects".

While we are at it, rigorously assessing one's *own* code is much harder than doing that to code written by somebody else. There's a reason why mutual code review is useful...

If one doesn't want to sink down to the level of "soft sciences" (or alchemy, for that matter[1]), description of methods and materials should be detailed enough to make it realistically possible to investigate how do results depend on the details of those. Otherwise the results are unfalsifiable, with everything that follows from that.

[1] aka "your failure to reproduce my results only proves that you are less spiritually advanced than I am and need to work harder to elevate your soul to my level" - and no, that's not a parody. The discipline really had been infected by that from its inception and it had taken Boyle et.al. to dump that "spiritual" garbage. At which point it had become chemistry...

A scientific basis for Open Source Software

Posted May 20, 2012 20:59 UTC (Sun) by szoth (guest, #14825) [Link]

Lwn needs a like button, I really enjoyed this comment :)

A scientific basis for Open Source Software

Posted May 24, 2012 8:39 UTC (Thu) by man_ls (subscriber, #15091) [Link]

Me too, even though Viro beat me to using Alchemy as a rhetorical argument (and on a more solid basis).

A scientific basis for Open Source Software

Posted May 21, 2012 16:05 UTC (Mon) by daglwn (subscriber, #65432) [Link]

> The reproducibility argument is a weak one. The methodologies are
> described in any paper worth its salt so the software could be (and often
> should be) rewritten independently. Indeed such a rewrite is a better test
> of the original results since large codes are hard to rigorously assess if
> you did not write them.

Ideally yes. Unfortunately this doesn't happen, ESPECIALLY in computer science and engineering research. Half my dissertation was about how completely unreproducible results are. Small variations in assumptions resulted in large changes in outcomes. In my case changes to either or both of the hardware model and the compiler algorithm running on it gave wildly different results.

We think we have a good intuition of how software and hardware works. We don't. It is impossible to reproduce the models presented in papers precisely because describing them to an adequate degree would essentially be a replication of the source code.

Academic license discounts

Posted May 19, 2012 18:17 UTC (Sat) by SimonO (subscriber, #56318) [Link]

I doubt there's much negative feelings here regarding applying the open source way to science, but I'd like to mention the prevalent use of educational discounts, which has a terrible side effect of getting students hooked to specific software which they will then "need" when starting in a commercial job.

In the Netherlands, this way of discounting to the point of giving away to students of whatever level has turned the country in a 99% windows place.
Similarly, the use of matlab in science, which is more or less affordable in the academic world, makes it attractive to build software on top of it in that world, but hard to include other, less fortunately situated people to work on that software (e.g. http://fieldtrip.fcdonders.nl/).

So this is just the first step, they should further require that no proprietary software is necessary to reproduce the results. Such a rule would probably motivate (i.e. get money for) the people from FieldTrip to port it to Python or Octave, whichever is easiest.

/Simon

Academic license discounts

Posted May 20, 2012 18:31 UTC (Sun) by theophrastus (guest, #80847) [Link]

hear-hear! (and i watched it all happen like a ballet between a business suit and tweed jacket)

my own sad attempt to challenge this trend took the form of the (for example) shiny new post-doc who was enthusing over the graphing/presentation abilities of excel/powerpoint ("now you can add *sound-effects* to our graphs!") was to ask: "when you graduate, and establish your wondrous biotech company (as they all were planning) wouldn't it be advantageous not to have to buy business licenses for every seat?" about half of them waved me off mumbling something about vast steel towers built of IPO funds; but a minority may've seen the advantages.... or not.

my favorite line, which i had the young man sign was: "i don't need to know the science, as long as i get the publication"

A scientific basis for Open Source Software

Posted May 20, 2012 18:55 UTC (Sun) by david.a.wheeler (guest, #72896) [Link]

The failure to release source code holds back research.

"The Evolution from LIMMAT to NANOSAT" by Armin Biere (Technical Report #444, Dept. Computer Science, ETH Zurich, CH-8092 Zurich, Switzerland), 15 April 2004 notes that when they examined previous work, "From the publications alone, without access to the source code, various details were still unclear... what we did not realize, and which hardly could be deduced from the literature, was [an optimization] employed in GRASP and CHAFF [was critically important]... Only [when CHAFF's source code became available did] our unfortunate design decision became clear... The lesson learned is, that important details are often omitted in publications and can only be extracted from source code. It can be argued, that making source code of SAT solvers available is as important to the advancement of the field as publications”.

Very simply, if "we the people" paid for it, then I believe "we the people" should normally get it. I.E., any software developed using public funds should be released by default to the people under an OSS license. Sure, there are exceptions, but they should be justified as exceptions. That's phrased in the language of the U.S. Constitution, but any democracy still has the basic notion that the government should be working for the benefit of its citizens.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds