Distributions
Better code searching for Debian
The Debian project works with a rather large source code collection; recent estimates place its total volume at well over 130 gigabytes for the unstable tree. Finding some means to efficiently search through that massive library of code has been the focus of work by Michael Stapelberg, who maintains the Debian Code Search engine. Stapelberg recently unveiled a major upgrade to Debian Code Search that will enable users to run substantially more focused search queries, as well as returning results significantly faster.
Stapelberg launched Debian Code Search in November 2012. The search engine indexes the most recent snapshot of Debian unstable (sid), which covers around 130 GiB across more than 17,000 separate packages. In 2013, Stapelberg estimated that this coverage added up to around 74% of the total contents of unstable, stable, and testing, since not every package differs between the three distributions. In addition, the engine only indexes Debian's free-software packages, not the contrib and nonfree trees.
In the early days, however, Debian Code Search ran on Stapelberg's private server, which was certainly not the most ideal scenario for long-term usage. In mid-2013, though, the project was promoted to an official Debian service, and migrated to an OpenStack cloud instance donated for the purpose by Rackspace. Stapelberg subsequently set out to speed up the performance of the search engine, as well as to implement outstanding feature requests.
The principal concerns about the original Debian Code Search were that there was no way to group search results by package and that queries were set to time out after 60 seconds (which, naturally, limited whether some queries could ever be used to get meaningful results). Implementing fixes for both of those issues, though, mandated some redesign of the search engine architecture. On December 3, Stapelberg announced the immediate availability of the revamped engine, which he called Debian Code Search Instant.
The new search engine—which has replaced the old one entirely—splits the search index across six servers (instead of one), resulting in considerably faster searching at lower latency. The 60-second timeout has been removed, but the top results will be displayed almost immediately for any query (hence the name "Instant"), even if the full query takes a long time to complete.
The new engine also maintains a separate search index for each package, which are then merged into one overall index. That allows search queries to be restricted by package (and allows the results to be grouped or filtered on a per-package basis), but it also allows the system to refresh the index whenever any individual package is updated. The refresh is triggered automatically by Debian's FedMsg notification system, the real-world results of which Stapelberg explained in his announcement:
The time between uploading a package and being able to find it in Debian Code Search therefore now ranges from a couple of minutes to about an hour, instead of about a week!
On December 23, Stapelberg posted an update on Debian Code Search Instant, in which he analyzed the first few weeks' worth of performance data looking for places where the engine performed poorly. As it turns out, the new engine was crashing on queries where the search term consists only of extremely common trigrams (or three-letter substrings).
There were several factors leading to the crashes: trying to keep all of the results in memory, storing the full path for each result in order to ensure that results were returned in the same order every time, and storing the full package name for each result in order to enable the group-by-package feature. All three bottlenecks were eventually resolved: the engine now writes results to temporary files rather than trying to keep them all in memory, uses hashes in place of full paths to retain a stable order, and uses pointers to package names rather than storing the package name for every result.
That set of changes prevented the engine from crashing on queries for common words, but such queries still required an excessive amount of time: Stapelberg's post describes one query that took 20 minutes to complete. After some profiling work, he determined that the culprit was the search engine front-end, which was generating up to several thousand result pages for every request—even though most users would only look at the first few. Making the page-generation an on-demand process cut the typical query's time down considerably.
One final optimization involved investigating why the disk and network I/O bandwidth was far below expectations. Switching from JSON to Cap'n Proto for the serialization protocol linking the front- and back-ends and implementing a buffered reader on the network connection helped quite a bit—the 20-minute query mentioned earlier could be shrunk down to five—but there was still room for improvement.
Ultimately, Stapelberg patched the system to delay converting the search results to JSON format until the last possible moment—right before the result page is sent to the user. That brought the total time required by the long query down from 20 minutes to 2.5—which might still seem like a long wait for a search result, if it was not for the fact that the example query Stapelberg used for the entire process was the word "arse." Fortunately, he concluded, the speed-up needed to reduce the elapsed time for curse words happen to benefit everyone else as well, and the average duration for every query has gone down. Search terms that only include common trigrams may produce too many results to be particularly useful, but they still serve as a decent worst-case-scenario that can reveal the search engine's upper bounds. Stapelberg did not post numbers for the performance of Debian Code Search Instant on average queries, but he did provide several graphs that reveal how much the optimization effort has sped up the overall performance.
The upshot is that Debian now has a fast search engine that indexes its entire source-code repository, supporting complex queries (including regular expressions), and with the ability to filter results by package. That is likely to prove itself a valuable resource not just for Debian, but for other distributions and large free-software projects as well.
Brief items
Distribution quotes of the week
I am worried that the rhetoric of mediation and consensus leaves little room for justice (by which I mean the remedy of power inequalities). We should be challenging the actually existing power relations. That is what the TC is for.
Devuan progress report
The people behind the Devuan project have released a progress report. Devuan is a fork of Debian without systemd. A repository has been set up at GitLab. "This is the most recent achievement on infrastructure development: last night the first devuan-baseconf package was built correctly through our continuous integration infrastructure, pulling directly from our source repository."
Distribution News
Fedora
Fedora 21 for AArch64 and IBM System z
The Fedora ARM team has released Fedora 21 for AArch64. This release includes a bootable DVD, net installation media, and an installation tree.Fedora 21 for the IBM System z (s390x) has also been released. Currently Fedora for s390x is available as the Server flavor, although there are plans to add the Cloud flavor in the future.
Ubuntu family
Vivid Vervet Alpha 1 Released
The first alpha version of Vivid Vervet (15.04) is available for Kubuntu, Lubuntu, Ubuntu GNOME, UbuntuKylin, and Ubuntu Cloud.
Newsletters and articles of interest
Distribution newsletters
- DistroWatch Weekly, Issue 590 (December 22)
- Ubuntu Weekly Newsletter, Issue 397 (December 21)
Prokop: Ten years of Grml
Michael Prokop looks at ten years of leading the Grml project. "Over the years we moved from private self-hosted infrastructure to company-sponsored systems, migrated from Subversion (brr) to Mercurial (2006) to Git (2008). Our Zsh-related work became widely known as grml-zshrc. jenkins.grml.org managed to become a continuous integration/deployment/delivery home e.g. for the dpkg, fai, initramfs-tools, screen and zsh Debian packages. The underlying software for creating Debian packages in a CI/CD way became its own project known as jenkins-debian-glue in August 2011. In 2006 I started grml-debootstrap, which grew into a reliable method for installing plain Debian (nowadays even supporting installation as VM, and one of my customers does tens of deployments per day with grml-debootstrap in a fully automated fashion). So one of the biggest achievements of Grml is ā from my point of view ā that it managed to grow several active and successful sub-projects under its umbrella."
Page editor: Rebecca Sobol
Next page:
Development>>