Distributions

Better code searching for Debian

By Nathan Willis
December 24, 2014

The Debian project works with a rather large source code collection; recent estimates place its total volume at well over 130 gigabytes for the unstable tree. Finding some means to efficiently search through that massive library of code has been the focus of work by Michael Stapelberg, who maintains the Debian Code Search engine. Stapelberg recently unveiled a major upgrade to Debian Code Search that will enable users to run substantially more focused search queries, as well as returning results significantly faster.

Stapelberg launched Debian Code Search in November 2012. The search engine indexes the most recent snapshot of Debian unstable (sid), which covers around 130 GiB across more than 17,000 separate packages. In 2013, Stapelberg estimated that this coverage added up to around 74% of the total contents of unstable, stable, and testing, since not every package differs between the three distributions. In addition, the engine only indexes Debian's free-software packages, not the contrib and nonfree trees.

In the early days, however, Debian Code Search ran on Stapelberg's private server, which was certainly not the most ideal scenario for long-term usage. In mid-2013, though, the project was promoted to an official Debian service, and migrated to an OpenStack cloud instance donated for the purpose by Rackspace. Stapelberg subsequently set out to speed up the performance of the search engine, as well as to implement outstanding feature requests.

The principal concerns about the original Debian Code Search were that there was no way to group search results by package and that queries were set to time out after 60 seconds (which, naturally, limited whether some queries could ever be used to get meaningful results). Implementing fixes for both of those issues, though, mandated some redesign of the search engine architecture. On December 3, Stapelberg announced the immediate availability of the revamped engine, which he called Debian Code Search Instant.

The new search engine—which has replaced the old one entirely—splits the search index across six servers (instead of one), resulting in considerably faster searching at lower latency. The 60-second timeout has been removed, but the top results will be displayed almost immediately for any query (hence the name "Instant"), even if the full query takes a long time to complete.

The new engine also maintains a separate search index for each package, which are then merged into one overall index. That allows search queries to be restricted by package (and allows the results to be grouped or filtered on a per-package basis), but it also allows the system to refresh the index whenever any individual package is updated. The refresh is triggered automatically by Debian's FedMsg notification system, the real-world results of which Stapelberg explained in his announcement:

In the new architecture, we store an index for each source package and then merge these into one big index shard. This currently takes about 4 minutes with the code I wrote, but I’m sure this can be made even faster if necessary. So, whenever new packages are uploaded to the Debian archive, we can just index the new version and trigger a merge. We get notifications about new package uploads from FedMsg. Packages that are not seen on FedMsg for some reason are backfilled every hour.

The time between uploading a package and being able to find it in Debian Code Search therefore now ranges from a couple of minutes to about an hour, instead of about a week!

On December 23, Stapelberg posted an update on Debian Code Search Instant, in which he analyzed the first few weeks' worth of performance data looking for places where the engine performed poorly. As it turns out, the new engine was crashing on queries where the search term consists only of extremely common trigrams (or three-letter substrings).

There were several factors leading to the crashes: trying to keep all of the results in memory, storing the full path for each result in order to ensure that results were returned in the same order every time, and storing the full package name for each result in order to enable the group-by-package feature. All three bottlenecks were eventually resolved: the engine now writes results to temporary files rather than trying to keep them all in memory, uses hashes in place of full paths to retain a stable order, and uses pointers to package names rather than storing the package name for every result.

That set of changes prevented the engine from crashing on queries for common words, but such queries still required an excessive amount of time: Stapelberg's post describes one query that took 20 minutes to complete. After some profiling work, he determined that the culprit was the search engine front-end, which was generating up to several thousand result pages for every request—even though most users would only look at the first few. Making the page-generation an on-demand process cut the typical query's time down considerably.

One final optimization involved investigating why the disk and network I/O bandwidth was far below expectations. Switching from JSON to Cap'n Proto for the serialization protocol linking the front- and back-ends and implementing a buffered reader on the network connection helped quite a bit—the 20-minute query mentioned earlier could be shrunk down to five—but there was still room for improvement.

Ultimately, Stapelberg patched the system to delay converting the search results to JSON format until the last possible moment—right before the result page is sent to the user. That brought the total time required by the long query down from 20 minutes to 2.5—which might still seem like a long wait for a search result, if it was not for the fact that the example query Stapelberg used for the entire process was the word "arse." Fortunately, he concluded, the speed-up needed to reduce the elapsed time for curse words happen to benefit everyone else as well, and the average duration for every query has gone down. Search terms that only include common trigrams may produce too many results to be particularly useful, but they still serve as a decent worst-case-scenario that can reveal the search engine's upper bounds. Stapelberg did not post numbers for the performance of Debian Code Search Instant on average queries, but he did provide several graphs that reveal how much the optimization effort has sped up the overall performance.

The upshot is that Debian now has a fast search engine that indexes its entire source-code repository, supporting complex queries (including regular expressions), and with the ability to filter results by package. That is likely to prove itself a valuable resource not just for Debian, but for other distributions and large free-software projects as well.

Comments (9 posted)

Brief items

Distribution quotes of the week

In summary, if we want to look for more consensus-seeking in our decisionmaking, and better negotiation, we should strengthen and encourage the TC. We should not undermine it, and not criticise the TC for acting vigorously. Being a little humbler, when we don our respective maintainer hats, would be a good thing.

I am worried that the rhetoric of mediation and consensus leaves little room for justice (by which I mean the remedy of power inequalities). We should be challenging the actually existing power relations. That is what the TC is for.

-- Ian Jackson

Anyway: People predicting doomsday scenarios for Debian do it because they are not familiar with how deep the project runs in us, how important it is socially, almost at a family level, to us that have been long involved in it. Debian is stronger than a technical or political discussion, no matter how harsh it is.

-- Gunnar Wolf

Comments (none posted)

Devuan progress report

The people behind the Devuan project have released a progress report. Devuan is a fork of Debian without systemd. A repository has been set up at GitLab. "This is the most recent achievement on infrastructure development: last night the first devuan-baseconf package was built correctly through our continuous integration infrastructure, pulling directly from our source repository."

Comments (86 posted)

Distribution News

Fedora

Fedora 21 for AArch64 and IBM System z

The Fedora ARM team has released Fedora 21 for AArch64. This release includes a bootable DVD, net installation media, and an installation tree.

Fedora 21 for the IBM System z (s390x) has also been released. Currently Fedora for s390x is available as the Server flavor, although there are plans to add the Cloud flavor in the future.

Comments (none posted)

Ubuntu family

Vivid Vervet Alpha 1 Released

The first alpha version of Vivid Vervet (15.04) is available for Kubuntu, Lubuntu, Ubuntu GNOME, UbuntuKylin, and Ubuntu Cloud.

Full Story (comments: none)

Newsletters and articles of interest

Distribution newsletters

DistroWatch Weekly, Issue 590 (December 22)
Ubuntu Weekly Newsletter, Issue 397 (December 21)

Comments (none posted)

Prokop: Ten years of Grml

Michael Prokop looks at ten years of leading the Grml project. "Over the years we moved from private self-hosted infrastructure to company-sponsored systems, migrated from Subversion (brr) to Mercurial (2006) to Git (2008). Our Zsh-related work became widely known as grml-zshrc. jenkins.grml.org managed to become a continuous integration/deployment/delivery home e.g. for the dpkg, fai, initramfs-tools, screen and zsh Debian packages. The underlying software for creating Debian packages in a CI/CD way became its own project known as jenkins-debian-glue in August 2011. In 2006 I started grml-debootstrap, which grew into a reliable method for installing plain Debian (nowadays even supporting installation as VM, and one of my customers does tens of deployments per day with grml-debootstrap in a fully automated fashion). So one of the biggest achievements of Grml is – from my point of view – that it managed to grow several active and successful sub-projects under its umbrella."

Comments (none posted)

Page editor: Rebecca Sobol
Next page: Development>>