Measuring the scientific-software community
The 2016 Community Leadership Summit in Austin, Texas featured a FLOSS Metrics side-track focusing on the tools and processes used to track and understand participation in open-source projects. The subject is of interest to those doing community-management work, since questions recur about how best to measure participation and influence beyond naive metrics like lines of code or commit frequency. But, as James Howison's session on scientific-software usage demonstrated, a well-organized open-source project with public mailing lists is far easier to analyze than certain other software niches.
Howison is a researcher at the University of Texas, Austin who has
been studying the software ecosystem that exists almost entirely
within academic circles. The root issue is that scientists and other
researchers often write custom software to support their work
(examples would include bespoke statistical packages in the R language
and software that models some physical process of interest) but
rarely, if ever, does that software get published on a typical
open-source project-hosting service. Instead, it tends to be uploaded
as a tar archive to the researcher's personal page on the university
web site, where it sits until some other researcher discovers it
through a citation and downloads it to run on another project.
This presents a challenge to anyone wanting to study how such software is used across the scientific community as a whole. For starters, it can be quite difficult to identify the software packages of interest, when the only indicator may be a citation in a research paper. Howison showed several examples of less-than-helpful citation styles that may be encountered. Some mention a URL where they downloaded software written by another researcher; some just mention the name of the package or the university where it originated. One crowd-pleasing citation said only "this was tested using software written in the Java language."
Howison has been collecting a data set of software cited in biology research. One of the challenges unique to the research-software niche is that the individual modules themselves are, as a rule, not compiled or linked against each other in what would be termed a "dependency relationship" in the traditional package-management sense. Instead, they may be run as separate jobs that act on the same data, not linked at run time but connected only by scripts or job queues, or they may only be connected by virtue of the fact that they model parts of the same larger research project. Thus, even when modules make their way onto a public repository site like the Comprehensive R Archive Network (CRAN), there is another problem: determining when specific modules are used in combination with others. Howison and others who study the scientific-software ecosystem term this phenomenon "complementary" modules.
Several efforts exist to discover and map complementary software usage. Howison works on the Scientific Software Network Map, which he demonstrated for the audience. It relies on user-submitted data from cooperating institutions, such as the logs from high-performance computing facilities about which packages (currently for R only) are run as part of the same job. Wherever possible, the packages are mapped to known public software sources, and the system creates a directed graph showing how individual packages are used in combination with others. The goal is to allow researchers to collaborate better, reporting bugs and feature requests that will improve their software for other teams that make use of it.
A similar effort is Depsy, which maps the usage of Python and R packages. Depsy provides information about dependency chains several levels deep, based on usage data extracted from research paper citations. It also factors in dependencies from the Python Package Index and CRAN, as well as data from GitHub repositories where available. Howison noted that Depsy has proven useful to researchers working on grant applications and tenure portfolios who have a need to show that their work is widely used.
Another tricky facet of assessing the scientific software ecosystem is measuring the installed base of a program—which, he noted, is not a problem unique to scientific software in the least. There are several approaches available, such as instrumenting software to ping the originating server whenever it is used, but that option has serious privacy concerns. The approach Howison has been working on digs deep into download statistics instead, which has revealed some perhaps surprising information.
A typical download graph over time looks like a heartbeat, he said:
there is a big spike whenever a new release or some major publicity
event occurs, then the number of downloads tapers back down to a
relatively flat level. Conventional wisdom is that the heartbeat
does not indicate the number of installed users, because the download
spike includes a large number of experimenters, people downloading out
of curiosity, and others who do not continue to run the program. But
Howison's research indicates that, with sufficiently high-resolution
download data, the "static" installed base correlates to the
number of downloads that come during the initial spike and taper-off,
but only when the average number of daily downloads from the
post-spike period is subtracted.
That is to say, the active-installed-base users download a new release in the first wave, but one must wait and see what the new "baseline" download level after the release is, and adjust downward to remove its effect. So the installed base corresponds to the area under the spike of the download curve, but above the post-spike baseline level.
That measurement makes some degree of intuitive sense, but Howison cautioned that attempting to assess the size of an installed user base is highly dependent on good, high-resolution data: hourly statistics or better. And, unfortunately, that high-resolution data is hard to come by. He had the most success examining download statistics from packages hosted on SourceForge, which provides high-resolution statistics. GitHub and other newer project-hosting services are, evidently, falling short on this front.
As a practical matter, knowing the active installed base for
an open-source project is valuable for a number of reasons. Corporate
sponsors of projects always want to know what the open-source
equivalent to "sales" is, but many projects have discovered that raw
download numbers do not translate particularly well. All of the
techniques Howison described could be used to help open-source
projects better assess where their code is in use and by whom; the
questions are particularly tricky in the realm of academic research,
but certainly of value to developers in general.
| Index entries for this article | |
|---|---|
| Conference | Community Leadership Summit/2016 |
