Development

Measuring the scientific-software community

By Nathan Willis
May 18, 2016

The 2016 Community Leadership Summit in Austin, Texas featured a FLOSS Metrics side-track focusing on the tools and processes used to track and understand participation in open-source projects. The subject is of interest to those doing community-management work, since questions recur about how best to measure participation and influence beyond naive metrics like lines of code or commit frequency. But, as James Howison's session on scientific-software usage demonstrated, a well-organized open-source project with public mailing lists is far easier to analyze than certain other software niches.

Howison is a researcher at the University of Texas, Austin who has been studying the software ecosystem that exists almost entirely within academic circles. The root issue is that scientists and other researchers often write custom software to support their work (examples would include bespoke statistical packages in the R language and software that models some physical process of interest) but rarely, if ever, does that software get published on a typical open-source project-hosting service. Instead, it tends to be uploaded as a tar archive to the researcher's personal page on the university web site, where it sits until some other researcher discovers it through a citation and downloads it to run on another project.

This presents a challenge to anyone wanting to study how such software is used across the scientific community as a whole. For starters, it can be quite difficult to identify the software packages of interest, when the only indicator may be a citation in a research paper. Howison showed several examples of less-than-helpful citation styles that may be encountered. Some mention a URL where they downloaded software written by another researcher; some just mention the name of the package or the university where it originated. One crowd-pleasing citation said only "this was tested using software written in the Java language."

Howison has been collecting a data set of software cited in biology research. One of the challenges unique to the research-software niche is that the individual modules themselves are, as a rule, not compiled or linked against each other in what would be termed a "dependency relationship" in the traditional package-management sense. Instead, they may be run as separate jobs that act on the same data, not linked at run time but connected only by scripts or job queues, or they may only be connected by virtue of the fact that they model parts of the same larger research project. Thus, even when modules make their way onto a public repository site like the Comprehensive R Archive Network (CRAN), there is another problem: determining when specific modules are used in combination with others. Howison and others who study the scientific-software ecosystem term this phenomenon "complementary" modules.

Several efforts exist to discover and map complementary software usage. Howison works on the Scientific Software Network Map, which he demonstrated for the audience. It relies on user-submitted data from cooperating institutions, such as the logs from high-performance computing facilities about which packages (currently for R only) are run as part of the same job. Wherever possible, the packages are mapped to known public software sources, and the system creates a directed graph showing how individual packages are used in combination with others. The goal is to allow researchers to collaborate better, reporting bugs and feature requests that will improve their software for other teams that make use of it.

A similar effort is Depsy, which maps the usage of Python and R packages. Depsy provides information about dependency chains several levels deep, based on usage data extracted from research paper citations. It also factors in dependencies from the Python Package Index and CRAN, as well as data from GitHub repositories where available. Howison noted that Depsy has proven useful to researchers working on grant applications and tenure portfolios who have a need to show that their work is widely used.

Another tricky facet of assessing the scientific software ecosystem is measuring the installed base of a program—which, he noted, is not a problem unique to scientific software in the least. There are several approaches available, such as instrumenting software to ping the originating server whenever it is used, but that option has serious privacy concerns. The approach Howison has been working on digs deep into download statistics instead, which has revealed some perhaps surprising information.

A typical download graph over time looks like a heartbeat, he said: there is a big spike whenever a new release or some major publicity event occurs, then the number of downloads tapers back down to a relatively flat level. Conventional wisdom is that the heartbeat does not indicate the number of installed users, because the download spike includes a large number of experimenters, people downloading out of curiosity, and others who do not continue to run the program. But Howison's research indicates that, with sufficiently high-resolution download data, the "static" installed base correlates to the number of downloads that come during the initial spike and taper-off, but only when the average number of daily downloads from the post-spike period is subtracted.

That is to say, the active-installed-base users download a new release in the first wave, but one must wait and see what the new "baseline" download level after the release is, and adjust downward to remove its effect. So the installed base corresponds to the area under the spike of the download curve, but above the post-spike baseline level.

That measurement makes some degree of intuitive sense, but Howison cautioned that attempting to assess the size of an installed user base is highly dependent on good, high-resolution data: hourly statistics or better. And, unfortunately, that high-resolution data is hard to come by. He had the most success examining download statistics from packages hosted on SourceForge, which provides high-resolution statistics. GitHub and other newer project-hosting services are, evidently, falling short on this front.

As a practical matter, knowing the active installed base for an open-source project is valuable for a number of reasons. Corporate sponsors of projects always want to know what the open-source equivalent to "sales" is, but many projects have discovered that raw download numbers do not translate particularly well. All of the techniques Howison described could be used to help open-source projects better assess where their code is in use and by whom; the questions are particularly tricky in the realm of academic research, but certainly of value to developers in general.

Comments (4 posted)

Brief items

Quotes of the week

I don't get the "Unix Philosophy" thing.

Posts about "this functionality is broken on systemd but not OpenRC" are great, but "Unix Philosophy"? What did that get us? A bunch of failing vendors in the 1990s, and the inevitability of Windows NT.

The "philosophers" of Unix let themselves be rounded up and made irrelevant.

If you want to build a better init system, build a better init system. But philosophy-based OS advocacy is a failure. Designing working software based on philosophy is like writing real network software based on the OSI 7-layer burrito model.

— Don Marti

I think I understand how the FSF feels now.

— OSI General Manager Patrick Masson at CLS 2016, on the frustrations of the growing misuse of the term "open source."

Comments (none posted)

Python 3.6.0a1 is now available

The first alpha release of what will become Python 3.6 has been released. A total of four alpha releases are planned; users should be aware, however, that Python remains under heavy development at this stage, so caution is advised.

Full Story (comments: none)

Docker 1.11: The first runtime built on containerd and based on OCI technology

Docker Engine 1.11 has been released, built on runC and containerd. "runC is the first implementation of the Open Containers Runtime specification and the default executor bundled with Docker Engine. Thanks to the open specification, future versions of Engine will allow you to specify different executors, thus enabling the ecosystem of alternative execution backends without any changes to Docker itself. By separating out this piece, an ecosystem partner can build their own compliant executor to the specification, and make it available to the user community at any time – without being dependent on the Engine release schedule or wait to be reviewed and merged into the codebase."

Comments (none posted)

U-Boot v2016.05 has been released

U-Boot version 2016.05 is now available, a week later than initially expected. The highlight in this round is EFI support on ARM. "U-Boot can now run EFI applications on ARM and ARMv8. The main use case here has been GRUB and that works. This is enabled by default on ARM and ARMv8 and a few platforms have turned it off due to not being useful for their needs or space savings." Other changes of note include further progress on the device model and improvements in testing.

Full Story (comments: none)

Flatpak (previously xdg-app) 0.6.0 is available

The application-level containerization tool previously known as xdg-app has made a new release, version 0.6. In conjunction with that release, the project has been renamed "Flatpak." Lead developer Alexander Larsson notes that existing repositories should keep working after the change, but that users will need to reinstall any system-wide runtimes and apps—as well as to refamiliarize themselves with the new naming convention for the command-line tools, D-Bus interface names, and so forth.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the past week

Comments (3 posted)

Schaller: H264 in Fedora Workstation

At his blog, Christian Schaller discusses the details of the OpenH264 media codec from Cisco, which is now available in Fedora. In particular, he notes that the codec only handle the H.264 "Baseline" profile. "So as you might guess from the name Baseline, the Baseline profile is pretty much at the bottom of the H264 profile list and thus any file encoded with another profile of H264 will not work with it. The profile you need for most online videos is the High profile. If you encode a file using OpenH264 though it will work with any decoder that can do Baseline or higher, which is basically every one of them." Wim Taymans of GStreamer is looking at improving the codec with Cisco's OpenH264 team.

Comments (91 posted)

Page editor: Nathan Willis
Next page: Announcements>>