Debsources as a platform
Debsources is a project that provides a web-based interface into the source code of every package in the Debian software archive—not a small task by any means. But, as Stefano Zacchiroli and Matthieu Caneill explained in their DebConf 2015 session, Debsources is far more than a source-code browsing tool. It provides a searchable viewport into 20 years of free-software history, which makes it viable as a platform for many varieties of research and experimentation.
Big data
Debsources was first developed at the Initiative de Recherche et Innovation sur le Logiciel Libre (IRILL), Zacchiroli began. Initially, the project implemented a web application for browsing the full repository of Debian source packages. The packages indexed cover the stable, unstable, and experimental archives for every Debian release from 1998's "hamm" up through today's current experimental archive, plus all of the backports. For each package, the Debsources database includes every update that has been pushed to the official archive. Thus, while it does not capture every commit made to a package, it does include every upload made by the Debian project.
The Debsources browsing tool lets users navigate to specific files—in graphical or text-mode web browsers—and provides syntax highlighting for more than 100 languages. For users needing to explore a particular package, he said, this is far faster than using apt-source to download and install the source code locally (and which must then be explored through an editor or other application).
But the developers did not stop at implementing a browsable archive. They implemented full-text search across the entire database, with support for searching on package names, file hashes (using SHA-256), and functional symbols in the source (e.g., functions, classes, and variables). The symbol-searching functionality is implemented using the ctags utility, and it supports searching by both ctags indices and by regular expressions. Every time a new file is uploaded to the Debian archive, ctags is run automatically to add the changes to Debsources.
Here again, Zacchiroli explained the distinction between what Debsources does and the functionality already offered by an existing tool—in this case, codesearch.debian.net. The codesearch database, he said, is geared toward bug fixing for upcoming Debian releases; it only indexes the current "unstable" archive, and it is not updated on every push. The Debsources web application is also written to facilitate collaboration: on the site, users can generate and share links to specific lines in a file and can highlight or annotate lines. That allows users to reference and comment on potential bugs at a granular level.
Debsources is also integrated with the codesearch site and with Debian's tracker.debian.org package tracker. Debsources and codesearch share the same regular-expression search engine, with the results being automatically redirected to the site from which the search was performed. On the package tracker, each package page includes links to Debsources marked as "browse source code." Integration with additional parts of the Debian infrastructure is still to come. In addition, all of the features that are exposed in the web interface are also available in a JSON-based API, so even more developers can make use of the Debsources service.
The massive collection of package data and source code is interesting from a statistical perspective as well as a practical one, Zacchiroli said. A wide array of metrics is available at sources.debian.net/stats, including disk space consumed, lines of code, number of files, and number of ctags symbols for every release. Altogether, "sid" currently takes up 228GB across 11.7 million files and over one billion lines of code. Of those lines, about 439 million are in C.
Zacchiroli also discussed some ancillary features of Debsources that make it potentially interesting for other uses. Because it tracks SHA-256 hashes of each file, the database can easily identify duplicate files anywhere in the archive. On each file's page, the user interface includes a link that will bring up every incidence of a duplicate file in the archive. This makes it easy to see, for instance, that there are 4,309 copies of the GPLv3 COPYING file.
Ongoing developments
Caneill then took the microphone and discussed recent work, both by existing Debian contributors and by Outreachy or Google Summer of Code (GSoC) interns. The new features include a detailed directory-listings format that includes file sizes and permissions (much like the output of ls -l), plus an in-browser file editor. The editor is implemented as a browser plugin (for Chromium and Firefox/Iceweasel); it lets the user edit any file in Debsources and output the changes as a diff that is ready to send to the package maintainer.
Behind the scenes, he said, the interns have done a lot of refactoring of the Debsources application that should make it easier for contributors to add still more functionality. The codebase is a lot more modular now, which has other benefits, too. For example, the file-updating code has been rewritten to be asynchronous at each stage (adding or updating a package, computing the statistics, etc), which helps performance. The charting module has been rewritten to produce nicer-looking graphs, and Python 3 support was added.
Another new feature—still in development—is the "copyright information" application, which is used to scan and track copyright information in the Debian archive. Some (though not all) packages include machine-readable copyright statements, which the application tracks and computes statistics from. In addition, the application generates a Software Package Data Exchange (SPDX) file for each copyright statement that it finds, and will display it in the Debsources web interface. That application was developed by GSoC student Orestis Ioannou, who is also working on a patch-tracking application that will integrate with Debsources.
Moving forward, Caneill said, the roadmap includes a number of other features: automatically running static-analysis tools, providing more live statistics (such as on license and patch information), and linking every binary package to the corresponding source package in Debsources (which is not currently easy, because a binary package might originate from a variety of source packages with "Provides:" or "Replaces:" rules, for instance). There are also some technical hurdles that still need to be overcome, he said, like being able to unpack and index tarballs within tarballs.
The team also wants to implement file-level deduplication to conserve disk space. That includes not just deduplicating the 4,039 copies of COPYING in the current "unstable" archive, but also deduplicating files over time. There are quite a few files that do not change in any given upload, so storing duplicates of them is an unnecessary use of disk space. The current database uses 1.1TB, which is not enormous on its own, but one year ago it only required 800GB.
Future research
Zacchiroli and Caneill closed out the session by discussing how Debsources is viable as a research platform. It includes twenty years of history for tens of thousands of packages. That makes it possible to statistically analyze, for example, how programming language popularity has evolved over the years or how file sizes have changed on a per-language basis. In response to an audience question, the pair added that statistics about build systems, packaging choices, and other factors could be generated as well. The two have written two papers analyzing the source in the archive, both of which have been presented at academic conferences. They have also been contacted by an outside researcher, although his research has not yet been published.
The audience asked quite a few questions in the time remaining. One attendee wondered if the team had encountered any hash collisions among all of the SHA-256 hashes computed; the pair replied that they had not found any, but that it would be fun. Another asked if there was interest in including any Debian derivatives in Debsources. The pair replied that they have a tracking bug open and hope to implement it, but that file-level deduplication needs to be implemented first, "or else it would explode." The Debian project, it seems, is already finding Debsources to be a valuable addition to the project infrastructure, both for tracking statistics and integrating with other tools; given the breadth and depth of the data set it includes, many other projects may find it valuable as well.
[The author would like to thank the Debian project for travel
assistance to attend DebConf 2015.]
| Index entries for this article | |
|---|---|
| Conference | DebConf/2015 |
