From lab to libre software: how can academic software research become open source?

October 25, 2017

This article was contributed by Andy Oram

Academics generate enormous amounts of software, some of which inspires commercial innovations in networking and other areas. But little academic software gets released to the public and even less enters common use. Is some vast "dark matter" being overlooked in the academic community? Would the world benefit from academics turning more of their software into free and open projects?

I asked myself these questions a few months ago when Red Hat, at its opening of a new innovation center in Boston's high-tech Fort Point neighborhood, announced a unique partnership with the goal of tapping academia. Red Hat is joining with Boston-area computer science departments—starting with Boston University—to identify promising software developed in academic projects and to turn it into viable free-software projects. Because all software released by Red Hat is under free licenses, the partnership suggests a new channel by which academic software could find wider use.

This article looks at some successful academic projects that have entered mainstream use—ranging across computer history from the Berkeley Software Distribution (BSD) to Jupyter notebooks—and looks for the factors that might help make that transition work. The projects that I covered suggest the following rules of thumb:

Academics, working in the reward system of academia, are not likely to carry through the conversion of software from a research project to a viable product of interest to a broad community.
Funding, usually from government agencies or foundations, is key to the creation of high-quality products that can be widely adopted. This funding can help support the conversion to a useful free-software project.
It helps a lot if the target users share some of the values and technical knowledge of the project leaders.
Infrastructure software tends to succeed more than application-level projects, perhaps because it has a broader appeal.

We'll start off with some observations from free software advocates about why it's so hard to derive production-ready software from academic research.

Software success ≠ academic success

Academics are obsessed with publication. And almost universally, the academic publishers are interested only in research findings, such as: "Packets of a certain size maximize transmission throughput under such-and-such conditions." The publications do not include related data or source code, which are considered, at best, as ancillary and, all too often, as junk. (Things are changing a bit here, especially for government-funded projects, but mostly in the demand for open data, not open source code.)

The irony—and even tragedy—of this disregard for the infrastructure that makes their findings possible is that academics skimp on software quality measures, such as testing. Quite often, bugs in the software cause researchers to publish incorrect results.

In order to get code widely adopted by people outside the academic setting where it is invented, professors or students must first develop the code conscientiously to ensure that it's robust, extendable, and correctly solves the problem at hand. Then they must iron out the idiosyncrasies of the project for which the code was developed, generalizing it for a broader range of domains and purposes. They have to maintain a repository, solicit contributions, and vet those contributions. Ultimately, someone has to run a community that can debate changes and choose new directions. None of those time-consuming tasks has anything to do with publication or tenure. Academics are pressured to get interesting results, put out a paper, and move on to the next project.

James Vasile, a programmer and consultant in the open-source space, notes that academics are biased toward secrecy in their code, as in their research. Woe to them if they release code early in their research that helps competing scientists reach conclusions earlier and get published first. To prevent this career killer, they hold on to the code until they publish their paper or present their conference session. That could be years after they wrote the code, which is pretty late to create a public repository and develop a community around it.

Generally, Vasile told me, it's more likely for academics to share tools that enable their research as infrastructure, but that aren't the main goal of their research. The possibility of turning infrastructure into open source has parallels in commercial firms in a phenomenon I labeled "closed core" six years ago. Companies often want to keep their essential business software secret much like academics want to hold the code for their experiments close to the vest.

Vasile mentioned several other barriers that make it hard to develop open-source projects from academic code. Academics work slowly on code, not at the pace of a professional team. When organizations try to help through donations, universities take a big bite—often half—out of the funds. Finally, many of the tasks required to make software robust and useful are not academically interesting.

Marshall Kirk McKusick, one of the early developers and maintainers of BSD, furnished the additional insight that students usually haven't had the time to develop the skills of maintainable, extendable coding, so their work is quick-and-dirty and unsuitable for reuse. It may also contain useless stub code for features that were never fully developed and that probably never will be.

Additional barriers to freeing code were pointed out by Jeffrey Spies, co-founder and CTO of the Center for Open Science. Researchers rarely think about the traits of software that make it suitable for widespread adoption, such as maintainability or documentation. And even those who make the code available in a public repository have little incentive to foster a community around the software.

High-quality software development requires hiring high-quality software developers, which is difficult in academic settings. Good developers want to work in an environment that appreciates their contributions, a trait unlikely to attract them to academic environments that discount the importance of software quality.

Vasile, McKusick, and Spies presented daunting prospects for successful deployment of academic code. But some projects manage to surmount the hurdles and become free-software successes. Let's look at a few, and try to tease out what helped them succeed.

BSD

BSD, which came out of the University of California, Berkeley, is the crowning success of academically conceived software. McKusick provided a history of BSD's growth and adoption for the O'Reilly Media book Open Sources: Voices from the Open Source Revolution.

One factor that probably allowed BSD to gain wide adoption was its audience of system administrators, who had the skills to install a complicated and sophisticated piece of software on bare metal. Many in the community also submitted contributions to the code. For instance, improvements by the community made 4.2BSD networking more efficient and robust than the code that was originally contributed by Bolt, Beranek, and Newman, which had achieved fame by developing the foundations of the Internet for ARPA (the original name of DARPA). McKusick told me that hundreds of contributors were involved in developing BSD.

Was BSD denied access to sufficient capital? It seems to have been mostly a project of the Berkeley computer science department, although McKusick's article cites funding from DARPA during the critical transition from 3BSD to 4BSD. I see no record of Sun Microsystems, which used BSD as the basis of its SunOS operating system, ever giving money to the project, although it contributed a good deal of code and bug fixes.

Spark

We move now to another project that began at UC Berkeley, but in a very different time and context. A successor to the ground-breaking Hadoop, the Apache Spark cluster-computing engine is part of most "big data" strategies now. Among the projects I've researched for this article, Spark is probably closest to the kind of project that Red Hat will sponsor.

I spoke to one of the early organizers of the Spark project, Patrick Wendell, who left UC Berkeley along with some other team members to found the company Databricks, where he is now VP of Engineering. Wendell told me that Spark was a brainchild of an atypical research group at Berkeley called the AMPLab, where five or six faculty work with about 35 students at any one time on big data processing tools. The AMPLab had both public and private sponsorship, and researchers there were expected to produce software of use to a wide industry audience—as Wendell said, it's "baked into their philosophy." Although projects don't have to be released as free software, many researchers do so to gain the benefits of wide adoption and contributions from the field. For instance, the Berkeley team donated Spark to the Apache Software Foundation in 2013 and built a community of developers outside the AMPLab.

Hence, Wendell said, academics in the AMPLab can have impacts in ways that go beyond publishing papers. They measure success by the broad adoption of their work, not only by insights that get into conferences or journals. He agreed that building community and fixing bugs were not the most efficient path to publication, but for some academics that's fine. Working in the AMPLab does not preclude academic success either—for instance, the Spark project has generated lots of academic publications. Matei Zaharia, founder of the project, took a sabbatical to co-found Databricks but then returned to academia, where he is an assistant professor in the Stanford CS department.

PostgreSQL

This database-management system, perhaps as much as any free software, demonstrates how much can be achieved by developers in an open community. The long and complex history of the project, summarized on a project web page, involved a couple of failed commercialization efforts. But the code of the current PostgreSQL seems to be derived entirely from academic and community efforts, where a Berkeley database project called Ingres inspired another called Postgres, the genesis of modern PostgreSQL. The original developers were part of the same constellation of Berkeley researchers responsible for BSD.

But, as the history page notes: "In 1996, Postgres95 departed from academia". And in a podcast interview, Bruce Momjian suggested that not much of the work on modern PostgreSQL was done at Berkeley, and that PostgreSQL was really a community project from 1996 onward (2:42 into the podcast). He also highlighted the role played by college professors in the PostgreSQL community.

Paradoxically, the academicians don't seem to contribute much to the code, a recalcitrance that Momjian attributes to their lack of interest in practical use (6:30 into the podcast), echoing my conversations with Vasile and Spies. Major funding seems to have come late in the project. Recently, according to Momjian, a number of "big players" have offered support, including IBM, Amazon, and Microsoft (12:57 into the podcast).

Jupyter

Supple enough to be valuable to educators, conference presenters, and general authors alike in the computer field, Jupyter emerged from academic researchers at Cal Poly State University, San Luis Obispo and UC Berkeley. It has brought information presentation into the modern era of multimedia, interactivity, and collaboration. Originally designed to display and run Python code (and called IPython), it was eventually extended so that other computer programming languages could be supported, and its name was changed to Jupyter (still keeping "py" in the name to honor its Python roots and implementation). Jupyter is a central tool in use at my own employer, O’Reilly Media, as was described in a video keynote; it has many other users as well.

In an interview with one of Jupyter's earliest developers, Brian E. Granger, I learned that the Python origins of the project were crucial for historical reasons. Scientists, after years of using proprietary tools such as Mathematica and MATLAB, were turning to powerful Python libraries such SciPy, NumPy, and the many modules that rely on them. According to Granger, these libraries were developed in the early 2000s but were not ready for production use until later in the decade. Two other advantages enhanced their popularity: being cost-free and being easy to mix with other Python libraries for other tasks. Once the Python libraries became fixtures of many fields in science and engineering—particularly the new field that came to be known as data science—their users were open to the interactive educational tools offered by IPython.

It wasn't hard for people outside academia to appreciate IPython. Everybody in the field teaches a course sometimes, or just gives a conference presentation. IPython, and then Jupyter, cut hours from the time it took to put one's code and text into a spiffy presentation form. The project solved several scientific needs at once: repeating experiments in a reliable way, reproducibility of results by other researchers, and teaching or giving talks.

Granger makes no bones about the importance of funding for the success of his project. It benefited quite early from support by Joshua M. Greenberg of the Sloan Foundation, and now it is additionally funded by the Moore Foundation and Helmsley Charitable Trust. The project also has numerous sponsors and institutional partners and gets significant code contributions from about 25 full-time developers.

Conclusion

Each research project that experienced success in the larger software world has found its own path forward. The examples I cited in this article are by no means the end of the story. For instance, the co-founder of the R statistical language, Ross Ihaka, suggested (in a paper) that the developers maintained R for years in a "relatively closed process" and stumbled by necessity onto basic open-source practices such as establishing mailing lists and a group of core committers. This project looks like another example of academic software that could be quickly understood and adopted because the target audience closely resembled the developers and was technically adept.

The Mosaic browser, another historic project, started as a government-funded project of the National Center for Supercomputing Applications (NCSA) at University of Illinois Urbana-Champaign. The triumph of Mosaic was short-lived, however, because the leader of the Mosaic team, Marc Andreessen, soon started the Netscape company and created a far superior browser based on Mosaic's principles.

I haven't covered the Linux kernel or GNU project here, because (in addition to them already being famous) they weren't academic projects, even though Linus Torvalds and Richard Stallman happened to be associated with universities when they launched the projects.

Combining what I heard from project leaders and from the other free-software leaders I interviewed, I suggest that an effort like the Red Hat one I mentioned at the beginning of the article would have the best chance of succeeding by following a few overarching principles. First, choose a project whose value can be quickly understood and embraced by its intended users. Bring in outside experts to evaluate the code for quality to make sure it's worth using; if not, it may make sense to launch a new code base with similar goals. The code must also be easy to generalize and extend. Finally, the project should be taken out of the academic environment as soon as possible (with a payoff to the university, if necessary) and assigned to a project leader who has experience building communities around projects and in recruiting companies or individuals to develop code and all the other infrastructure a free-software project needs.

I'll end with some optimism. Professors and students have routinely turned their ideas into proprietary software. But, given the ease of coding these days, and the resulting commoditization of software, some of these academics are likely to consider making the software free. Apache Spark, discussed earlier, is one example. Another is MapD, a major database project that benefited from advice by Michael Stonebraker, one of the field's leading researchers and entrepreneurs. This company open sourced its core product and has been funded to the tune of 25 million dollars. Fledgling organizations can now turn to organizations such as the Apache Foundation and the Software Freedom Conservancy for organizational advice. In a decade or so, we may know much more about what motivates researchers to open their code, and how they can do so successfully.

[This article is also available in a Portuguese translation by homeyou.]

Index entries for this article
GuestArticles	Oram, Andy

From lab to libre software: how can academic software research become open source?

Posted Oct 25, 2017 14:48 UTC (Wed) by pj (subscriber, #4506) [Link]

I found this awhile ago and watch the software that comes out; it's an interesting slice of research software:

http://joss.theoj.org/

From lab to libre software: how can academic software research become open source?

Posted Oct 25, 2017 19:49 UTC (Wed) by thomas.poulsen (subscriber, #22480) [Link] (1 responses)

As mentioned in the article, the R-project (aka GNU S) has fostered a community around free software for statistical computing. A lot of that gets published in the Journal of Statistical Software as a way to get (academic) credit for software development in this field.

R-project is not an example

Posted Oct 26, 2017 16:40 UTC (Thu) by southey (guest, #9466) [Link]

R-project is just an open source port of S/S+: "R can be considered as a different implementation of S" where S is the S programming language developed by Bell Labs. This resulted in most of the S/S+ community moving to R.

From lab to libre software: how can academic software research become open source?

Posted Oct 25, 2017 20:13 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (19 responses)

Academics and researchers have no interest in sharing software if it is the object of their research.

However, they do have a huge interest in sharing their tools (even though most do not understand this). Sharing tools does not endanger their (academic or private) exclusivity on research results, does not require sharing the data those tools process. Tools are reused from study to study (they are a long-term investment). Public tools help collaborating with other researchers and recruit students already familiar with the researcher way of working. As the first users of their tools, academics will care about prosaic stuff that seems less important to them when they deliver software for someone else.

Another (non US, private R&D with shared tools example) https://code-aster.org/

From lab to libre software: how can academic software research become open source?

Posted Oct 25, 2017 23:17 UTC (Wed) by gdt (subscriber, #6284) [Link] (3 responses)

they do have a huge interest in sharing their tools

TeX being one of the earliest and one of the most influential examples. Not only does that underlie the publishing systems of many publishers in the numerical sciences, but the algorithms Knuth invented for TeX are widely used (eg, the formula modes of pretty much all word processors).

Academics and researchers have no interest in sharing software if it is the object of their research

The "crisis of replication" and the "open access" movement has lead to research funding bodies becoming keen on making a paper's data and analysis programs available. At least with a license which allows inspection. Of course that doesn't mean that the program is available in a timely way or in a way which makes it easy to spin into a programming project, but it's an improvement on the previous situation where having a printout of the program archived in the university registry was best practice.

The programming community has done a poor job of inducting and welcoming scientists. You get a sense of that with this recommendation: "the project should be taken out of the academic environment as soon as possible". An alternative path would be to build near-professional programming skills in scientists; they already need to acquire many other skills to a near-professional level.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 0:35 UTC (Thu) by Tara_Li (guest, #26706) [Link] (2 responses)

> The programming community has done a poor job of inducting and welcoming scientists. You get a sense of that with this recommendation: "the project should be taken out of the academic environment as soon as possible". An alternative path would be to build near-professional programming skills in scientists; they already need to acquire many other skills to a near-professional level.

How about a third path, which programmers making themselves more available (somehow) to non-CS academics to get better code available to them, and to everyone else - with a better chance of the programmer being able to say "Oh, yeah, we do this kind of thing all the time" cross-fertilization from other fields. It'd be funny if there was some kind of subtle similarity between genetic drift and the sorting out of collision events of interest in the LHC.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 0:57 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

One way programmers can become more available is when researchers finally accept the necessity of hiring them. The researchers I know who have taken that step seem to have benefited tremendously from having people who really know their stuff doing the programming. If your research is computer-intensive enough that you're worried about the need to share your programs, you really ought to have a serious programmer involved in writing them. Even if the group doesn't do enough to justify having one full-time on staff, it might make sense to have some kind of research programming group that's available as a centralized core service.

From lab to libre software: how can academic software research become open source?

Posted Nov 2, 2017 6:30 UTC (Thu) by einar (guest, #98134) [Link]

> One way programmers can become more available is when researchers finally accept the necessity of hiring them.

But at least in my country, academia does *not* pay enough for a programmer to be hired. I get funny stares when I say that if we need someone to do pure programming, we need to pay market rates.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 13:16 UTC (Thu) by deater (subscriber, #11746) [Link] (14 responses)

> Academics and researchers have no interest in sharing
> software if it is the object of their research.

That is an innacurate and, frankly, pretty insulting generalization.

There are a lot of researchers and academics who share their software, even in the face of the huge incentives not to. They might be in the minority, but the story is exactly the same outside of academia. It's not like the majority of corporations are sharing their software.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 13:29 UTC (Thu) by deater (subscriber, #11746) [Link]

And a bit of a followup, there is a lot of misunderstandings between the groups. A lot of kernel/Linux devels seem to think academics are locked up in ivory towers, but believe it or not a lot of academics think the kernel people are locked up in their own isolated tower.

The area I am most familiar with is the perf subsystem. The academic/supercomputer researchers are *still* bitter about the kernel politics involved with getting perf merged and really want nothing to do with it. They feel like the kernel perf devs only care about debugging kernel stuff and that they have no understanding of nor care about supercomputing issues.

And now, since perf typically requires root to run anyway due to security reasons, the academic researchers see no reason to bother with perf and have built entire ecosystems around tools that program the performance counters directly (direct MSR writes) and completely bypass the perf_event subsystem to accomplish what they want, as they feel like the perf developers are unresponsive to their needs.

A huge, unnecessary duplication of effort. Both sides are free software though. It hasn't really helped the issue.

And this is not a case of an issue that throwing some grant money around or making up a few new obscure journals is going to help.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 17:13 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (10 responses)

Ok, I'll reformulate. They have no interest in sharing the object of their research from a research point of view. That has the risk of others finishing up before them and reaping awards and funding, or wandering into application aspects.

As human beings, they have the same incentives as every one else: leave their mark, make the world a little better for everyone, and so on.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 18:44 UTC (Thu) by sfeam (subscriber, #2841) [Link] (6 responses)

There seem to be two discussions working at cross-purposes here, perhaps because of a subtle distinction between "academic software research" and "academic research software". I have decades of history and publication as an academic researcher with a focus on the development of software for structural biology. Of course we have an interest and incentive in sharing the object of our research. But the software is not by itself the object of our research, it is a tool which allows us to conduct that research. Toolmakers are sadly under-appreciated by government funding agencies, but good tools are recognized and appreciated by the target research community and I have never found that publication and consequent academic recognition was hindered by developing the code as open source. The GPL is not a good match for many academic software projects, but that's a separate discussion. So no, for the case of "academic research software" I think even your reformulated statement misses the mark. I'll go further and say that in my field shared software has a much better track record and success rate than equivalent competing projects that are kept purely in-house by the originating research group. Because of this, research proposals that do not incorporate plans for making the software tools public are dinged both by reviewers and by funding agencies.

From lab to libre software: how can academic software research become open source?

Posted Nov 2, 2017 6:29 UTC (Thu) by einar (guest, #98134) [Link] (5 responses)

> Of course we have an interest and incentive in sharing the object of our research.

Is this true for smaller institutions? In my field (bioinformatics), unless you're in a large institution that can handle these things (and often will cripple them with ridicolous licensing, but that's another story) I've seen tools kept as extremely guarded secrets. Because you *might* (but often, never) publish them one day.
(That said, I actively engage in communities that understand that collaboration, even on software, is important).

Also, you briefly touched on an important point for fields where software itself is not the object of research: you will never (or very rarely) get funding for maintaining software. So it's up to the Ph.D. students and the post-docs to create software, if needed, only to abandon it after publication or because they move elsewhere. A sad state of affairs.

From lab to libre software: how can academic software research become open source?

Posted Nov 3, 2017 16:22 UTC (Fri) by pboddie (guest, #50784) [Link] (4 responses)

Is this true for smaller institutions? In my field (bioinformatics), unless you're in a large institution that can handle these things (and often will cripple them with ridicolous licensing, but that's another story) I've seen tools kept as extremely guarded secrets. Because you *might* (but often, never) publish them one day.

It disappoints me slightly that the article focused largely on what might be regarded as computer science research or artefacts thereof, whereas there are huge challenges in other disciplines in the delivery of "sustainable science". Interestingly, Jupyter (silly name, I think) attempts to tackle this challenge.

As to whether software is publishable, I think that especially in disciplines like bioinformatics (where I have some experience), software is frequently and disappointingly seen as being a disposable means to an end. It seemed to me that people were quite happy to surf around looking for a Web service that would tell them what they wanted without any curiosity about how that particular tool was made; if it was backed by a publication they'd be reassured, but then that publication might not have any software or data attached to it, perhaps inviting inquirers to make contact to "collaborate".

I did see some interest in the tools I was writing for the group in which I worked. People certainly saw the need to take advantage of things that process existing datasets, and the different public databases seemed to realise that people wanted software to work with the data, even though "Web services" were also touted as an option, which really isn't viable at all for anything more than ad-hoc queries unless the database supplier likes being hammered with HTTP requests (another idiotic habit we had to deal with, in one case involving some US research group whose IP addresses I eventually ended up banning).

Fortunately, my boss kept the rights to the software we developed and released it as Free Software, thus undermining the stupid and greedy "commercialisation" doctrine and organs of the university that were busy inhibiting other people's non-software work. In theory it lives on, but then the matter of sustaining development once everyone has moved elsewhere becomes a problem.

From lab to libre software: how can academic software research become open source?

Posted Nov 3, 2017 17:58 UTC (Fri) by sfeam (subscriber, #2841) [Link] (3 responses)

The situation is much better than you make it appear. Sure there's junk bioinformatics software, just as there's junk wherever Sturgeon's Law applies, which is nearly everywhere. And if you rely on randomly found web services you should indeed be concerned about their quality. But contrary to what you imply, the key software tools are published, reviewed, shared, and often accompanied by test data sets. Decent documentation is admittedly often a sore point. With specific regard to bioinformatics software provided as a web service, I suggest you look at the special issues devoted annually to exactly this class of software in the journal Nucleic Acids Research [*]. Their requirements for publication, documentation, validation suites, demonstrated community use, etc are all very solid.

[*] Yeah the name of the journal does not make it the obvious place to look for such a focus, but it has become a first rank quality touchstone for web-based bioinformatics software.

From lab to libre software: how can academic software research become open source?

Posted Nov 5, 2017 16:45 UTC (Sun) by pboddie (guest, #50784) [Link] (2 responses)

I'm not a true bioinformatician, merely a software engineer who got involved with people doing bioinformatics, but while I agree that many of the tools and services I had to work with were largely robust in terms of methods and transparency of operation, I would have reservations about whether the audience of those tools and services have the means or even the inclination to review what those tools and services do.

Maybe I should clarify my remarks about the audience for such things, though. While I might expect bioinformaticians to be equipped to evaluate software and services, largely because such people should be familiar with the computational and engineering aspects of such work, there are plenty of other people who use things like Web services to get "answers". Admittedly, this doesn't affect the "bread and butter" services like databases, but is more of a concern with services performing some level of analysis. (I remember a newly-pitched literature-mining service using "what causes cancer" as its Google-style front page example.)

Maybe things are actually great and I never realised it, or maybe I was working more in niches where people were more cavalier about what constitutes the validation of a particular tool, so that instead of just publishing their software and data and allowing people to get their hands dirty, there had to be a back-and-forth to get the code, not necessarily constructed using great engineering practices.

I will say that I was encouraged by the software tool use by some institutions, even though some of their choices weren't always to my liking. But then I must also say that even though our group made all our code and data available, there really wasn't much interest in reproducing what we did. For the most part people wanted us to do everything for them, and I certainly got more confirmation that software development is hardly valued in the field: people will drop big bucks for sequencing equipment but expect the analysis to happen for free by random people elsewhere.

From lab to libre software: how can academic software research become open source?

Posted Nov 5, 2017 19:18 UTC (Sun) by sfeam (subscriber, #2841) [Link] (1 responses)

You have shifted to highlighting a separate problem that is indeed serious. A very real problem when developing tools for sophisticated data analysis is that the more you make them easy to use, e.g. providing a web interface, the more likely it is that you attract users who do not understand when the tool is or is not appropriate. This is true regardless of how well engineered, documented, or reviewed the software behind that easy-to-use web interface may be.

To the extent it is even possible to address this problem, I think it must come through better education of the target user group. That is partly what I had in mind when I said that good documentation is a recurring sore point. It is too often hard for non-experts, even in the same field, to understand under which set of conditions their data is best analysed by method A rather than method B, or tool C rather than tool D. The wrong choice may lead to an erroneous scientific conclusion even when both A and B are perfectly valid methods and tools C and D are correctly implemented, each in their own respective domain of applicability.

Hiring more programmers during development is not going to solve this at all. The open/closed status of the source code is also irrelevant to this end of the problem. An example of something that does help, though it only works if you can attract outside funding to set it up, is to hold well-publicized and "fun" contests that pit competing tools against each other. For example the CASP competitions that pit competing approaches to predicting what 3D shape is formed by a protein produced from a particular DNA sequence. The larger community tends to remember, and use, the winning tools even if they don't understand the details of why it performed better. It is notable that if you look at the CASP winners, many of them use open source toolkits and libraries. And those shared code bases are improved by feedback from competition between comtributing groups. We need more of this.

From lab to libre software: how can academic software research become open source?

Posted Nov 6, 2017 0:25 UTC (Mon) by pboddie (guest, #50784) [Link]

You have shifted to highlighting a separate problem that is indeed serious.

Don't worry: I'm willing to discuss all problems here!

A very real problem when developing tools for sophisticated data analysis is that the more you make them easy to use, e.g. providing a web interface, the more likely it is that you attract users who do not understand when the tool is or is not appropriate. This is true regardless of how well engineered, documented, or reviewed the software behind that easy-to-use web interface may be.

Yes, but there is arguably more of a demand for attractive, "easy to use" Web services rather than tools. Experiences may vary with regard to what is publishable or not and what the expectations of the reviewers are.

On the former topic, I have my name on a publication about a database that I doubt my previous boss would have regarded interesting enough for publication, but for a publication venue for my then boss it passed the threshold. That is the difference between more bioinformatics-related journals, where the computational techniques would be emphasised, and biology-related journals who probably want a greater emphasis on, say, experimental techniques or theory. (For all I know. What I did perceive, however, was that in the evaluation of research, if you have people who don't "rate" bioinformatics journals because they aren't amongst the ones they know, the research achievements don't get properly recognised.)

On the second topic, the publication in question got remarks about the user interface from the reviewers. It was clear that they wanted something slick and attractive, although decades after the introduction of usability research, people still don't understand that this is largely an iterative process that you really don't want to do in the confines of an article review. Fortunately for everyone concerned, being a relatively simple database, there wasn't much of a trade-off between "looks great" and "obscures what the tool does". We also worked with a group who put quite a bit of emphasis on the look and feel of their Web front-end to my colleagues' work. Again, for certain audiences (and potentially the ones you need to educate), it seems that good-looking things can be seen as more publishable, sometimes deservedly so (they introduce useful visualisations), other times arguably not.

I agree that audience education is essential, wondering if I didn't state or imply that in what I wrote. I also had experience of competitions between tools which were useful to the extent that you could see what other people's tools were supposedly capable of, but I might also suggest that they were distractions in various respects: you can end up focusing on limited datasets, tuning for potentially atypical data, and still not really learning what people were doing.

I remember one participant in a meeting around one of these competitions saying that he rather doubted that various people employing certain machine learning approaches really understood what they were doing. Another doubted that by making opaque tools we were gaining any insight into the problems to be solved (which is also a hot topic with regard to "AI" these days). To an extent, I got the impression that some of these competitions were profile-sustaining activities for certain research groups, and if the code was freely available then people would get many of the benefits anyway.

My remarks about paying for development weren't made in the context of improving the application of the scientific method, but rather an observation about the status of developers in certain parts of academia. I also have to dispute your assertions about code availability somewhat, not to be contrary, but I had actual experiences of methods and code differing when I was able to review them both. Of course, if no-one looks at the code, and my impression was that the audience was under-resourced and unlikely to look at it, then making everything available doesn't solve all the problems.

From lab to libre software: how can academic software research become open source?

Posted Oct 27, 2017 14:54 UTC (Fri) by deater (subscriber, #11746) [Link] (2 responses)

> Ok, I'll reformulate. They have no interest in sharing
> the object of their research from a research point of
> view. That has the risk of others finishing up before
> them and reaping awards and funding, or wandering
> into application aspects.

Citation needed? Who is this "they" you are referring to?

There is certainly a subset of academics who do this, but there is also a large number who release their work immediately.

And there has been a big push by funding agencies to force the release of all data and code within a reasonable time window to allow for first publication.

From lab to libre software: how can academic software research become open source?

Posted Nov 7, 2017 10:51 UTC (Tue) by aggelos (subscriber, #41752) [Link] (1 responses)

Ok, I'll reformulate. They have no interest in sharing the object of their research from a research point of view. That has the risk of others finishing up before them and reaping awards and funding, or wandering into application aspects.
Citation needed? Who is this "they" you are referring to?

Oh wow. I guess that's a fair question. This survey comes to mind. I don't think there's any shortage of systems or security papers not supported by source code; a point with which you seem to agree below. Just adding the citation you requested.

Keep in mind that the above survey does not avoid publication bias. Namely, if reviewers are more likely to reject a paper because they can spot obvious bugs, unstated limitations etc in the source, whereas they would accept a similar paper that gives a ponies-and-rainbows description of their implementation, then there's submissions (though not publications) with source that are not accounted for. In my experience, this sort of bias against papers with code is a significant worry for researchers.

That said, I have heard of reviewers requesting code the last couple of years (though this mainly works when the paper is rejected and subsequently resubmitted to the same conference). I'm not at all sure that this perceived bias exists. For all we know, the bias could be in the other direction (i.e. in favor of papers that do publish source).

There is certainly a subset of academics who do this, but there is also a large number who release their work immediately.

And there has been a big push by funding agencies to force the release of all data and code within a reasonable time window to allow for first publication.

Glad to hear that. My turn to ask for a citation now, as I'd like to learn more about those incentives to release code and data.

The existence of academics who release code despite significant disincentives need not draw attention away from the existence of said disincentives. Nor shift the focus to individual failings, of course.

From lab to libre software: how can academic software research become open source?

Posted Nov 7, 2017 18:10 UTC (Tue) by sfeam (subscriber, #2841) [Link]

And there has been a big push by funding agencies to force the release of all data and code within a reasonable time window to allow for first publication.
-- Glad to hear that. My turn to ask for a citation now, as I'd like to learn more about those incentives to release code and data.

Again I will respond specifically with regard to "academic research software", i.e. software tools developed for use in research as opposed to software that is itself the subject of the research. I am mostly familiar with research funded by the US National Institutes of Health. Here is text from the over-arching guideline published in the 1999 Federal Register. Note that software falls under the umbrella categories "research tool", "material", or "unique research resource", terms used throughout the document. One relevant section of the text reads:

Recipients are expected to ensure that unique research resources arising from NIH-funded research are made available to the scientific research community. The majority of transfers to not-for-profit entities should be implemented under terms no more restrictive than the UBMTA. In particular, Recipients are expected to use the Simple Letter Agreement provided below, or another document with no more restrictive terms, to readily transfer unpatented tools developed with NIH funds to other Recipients for use in NIH-funded projects. If the materials are patented or licensed to an exclusive provider, other arrangements may be used, but commercialization option rights, royalty reach-through, or product reach-through rights back to the provider are inappropriate.

That particular text only mandates access by other NIH-funded researchers, but in practice that means everyone from your closest collaborators to your fiercest competitors. Furthermore my experience with the NIH peer-review system both as an applicant and as a reviewer is that in software-heavy proposals, failure to state an intention to share your software tools counts as a black mark, while a well-documented plan and previous history in disseminating your software can boost the "impact" score, which is critical for funding.

I am less familiar with the parallel requirements for funding by other US federal agencies, but here is text from a guideline by the NSF (National Science Foundation): Dissemination and Sharing of Research Results.

c. Investigators and grantees are encouraged to share software and inventions created under the grant or otherwise make them or their products widely available and usable.

From lab to libre software: how can academic software research become open source?

Posted Nov 5, 2017 12:12 UTC (Sun) by CycoJ (guest, #70454) [Link] (1 responses)

I have to agree that this is somewhat insulting. I would in fact argue that academia has a very high proportion of sharing software compared to other industries where producing free software is not the main aim.

While I agree with many of the point brought forward I do see problems with the proposal (the principles proposed in the conclusions) that is being put forward here. One of the main points seems to be to bring in "professionals" and take the project out of academia and have them lead by people with experience in "community building". This sounds to me like somewhat "push the people who started the work out and let the professionals handle it". If Red Hat would propose something like this for other volunteer-lead free-software projects there would be an outcry. Just because the academics don't have the time and funding to fully develop and maintain a project, does not mean that they don't feel strongly about "their baby".

Instead of trying to wrestle control away from the academics who started the work, maybe tackle the point that is the biggest problem. Funding for professional software developers. So Red Hat (or a similar initiative) could give grants (monetary, developer time, other advise and support), to academic projects to develop them into full-blown free-software projects, while leaving the academic in control. This would also foster building bridges (and a community) between the free-software and academic communities, and academics who have been part of such a program could clearly see and advocate for the benefits of creating open source projects. Contrary to what is stated in the article, in my experience most academics have an interest in sharing their work, it's part of the scientific process (on which the free software movement was modelled arguably), they just don't have the resources to do it.

From lab to libre software: how can academic software research become open source?

Posted Nov 6, 2017 19:16 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> One of the main points seems to be to bring in "professionals" and take the project out of academia and have them lead by people with experience in "community building". This sounds to me like somewhat "push the people who started the work out and let the professionals handle it".

This focus on "control" when it comes to open source projects seems negative and foolish, leading to distracting petty squabbles rather than focused team building. If someone forks your projects and takes it in a different direction than you want, you still own your own code and can do with it whatever you like, you haven't lost any control, you just don't get to dictate your wishes to others.

> Instead of trying to wrestle control away from the academics who started the work, maybe tackle the point that is the biggest problem. Funding for professional software developers.

Why would this necessarily be a fight about control between software professionals and academics involving "wrestling" instead of a team where the academic brings their domain specific knowledge and general idea of how to break down a problem into small enough steps for a computer to help, and the software engineer/scientist who can bring project management, change control, deployment, testing, team building and algorithm/data-structure complexity engineering so the end result is fast, efficient and useful to other people.

> Leaving the academic in control. This would also foster building bridges

I don't think so, bridges are two-way and require trust, an insistence on control and hierarchy is not demonstrating trust or respect, it's treating the professional developer as a tool rather than a partner or community member. If you had funding to hire a professional then you could give them orders as your direct employee but if you are expecting a community of volunteers then you have to treat them as peers, not as minions. If you don't have the resources to pay professional developers directly then you have to give them a reason to care and a reason to follow your lead, because they don't have to do either.

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 4:21 UTC (Thu) by Shewmaker (guest, #1126) [Link]

The Center for Research in Open Source Software (CROSS) was inspired by the story of Sage Weil, a student at UCSC who turned his academic prototype for the Ceph file system into a multi-million dollar company (now owned by RedHat).

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 8:30 UTC (Thu) by bavay (subscriber, #60804) [Link] (1 responses)

At the institute where I work, we have released three models as open source and I am their maintainer (https://models.slf.ch). This is a task that entails lots of frustration, lots of sacrifices and very little recognition. First, it's "paper above all", so "normal" researchers don't really care about the long term maintainability of the software they write, it just has to enable publishing the next paper. Then software is looked upon as some kind of ugly child that does not have the nobility of measurements. This means that even when talking about research infrastructure, most people (colleagues, management, etc) don't consider software. Moreover, most people assume that your computer model is badly written, ridden with bugs, totally unreliable and impossible to use for anybody else (so mostly worthless outside of the paper that described it).

Then, the research community does not even want this to change: investing time in designing a proper structure, setting up coding guidelines and trying to enforce them is seen as a waste of time and me (as a maintainer) pestering people to think before coding, cleanup their code, etc is seen as an annoyance that does not have to produce changes. A lot of our contributors learned programming directly in our models, stubbornly refused to cleanup their code (they just want their master/phd/paper out) and considered that I would have to clean up myself after they left anyway (so they get the credit for the code, while I get the long term maintenance and the bugs). This means that the maintainer has to rewrite every contributions, does not get credited for it (the paper is already published), does not necessarily have tremendous support for doing so (why rewriting something that was good enough to get a paper out?) and destroys his/her scientific career in the process (because one can not rewrite tons of spaghetti code and publish many papers at the same time).

How could this change? I am dreaming of the "publication" of software with detailed reviews, as it is done for papers. A tough process that would value well written scientific software and that could be accepted as equal to a traditional publication (a piece of software, as a paper, is an expression of ideas and logical reasoning. There are no reasons why the two should be so different). This could also help establish "reference" implementations for a wide range of algorithms (so people don't end up always (badly) re-implementing the solution to well known scientific problem). I did not knew "joss", but this is definitely a step in the right direction!

Mathias

From lab to libre software: how can academic software research become open source?

Posted Oct 26, 2017 13:18 UTC (Thu) by fenncruz (subscriber, #81417) [Link]

Its not all doom and gloom, though i get where you are coming from. In astronomy things are changing slowly, there a several big projects that get funded for software "development" (with the science side almost being a secondary component, atleast when writing the grant) have a look at the NSF's SI grants which fund software infrastructure. Also the journals are slowly changing, ApJ (a top tier journal for astronomy) now accept code papers that just describe the code without needing to do new "physics" with them (see http://journals.aas.org/policy/software.html). In fact i work on a open source astronomy code, with several software papers, and because it is open and other people get to use the code, these code papers get a lot of citations and have become my most highly cited papers.

The other useful thing that is changing as well is citing smaller projects, i maintain a plotting library that on its own is never getting a paper, but after doing a release on github it gets picked up by zenodo (https://zenodo.org/) and gets a a DOI so i can cite it in my papers. ApJ now have a section you can add to your paper mentioning (and citing) the software you use, including ancillary things like pyton/numpy/scipy etc, not just the "main" code you used.

Though i do still find when interviewing for jobs people question the value of my software development (even if i have fixed bugs for them and they depend on the code). Maybe software development needs to be seen more like instrument builders, something that is essential and should be rewarded but not necessarily going to generate the huge papers on their own.

From lab to libre software: how can academic software research become open source?

Posted Nov 2, 2017 22:23 UTC (Thu) by mfidelman (guest, #119412) [Link]

Well let's not forget the web, and particularly Mosaic & the NCSA Daemon that became Apache. CERN & NCSA are certainly academic facilities.

Or visualization stuff like D3 (originated at Stanford).

From lab to libre software: how can academic software research become open source?

Posted Nov 8, 2017 1:50 UTC (Wed) by partain (subscriber, #6389) [Link]

The Glasgow Haskell Compiler (25+ years old) has been an academic project that always intended to be open source and to have real-world use.