Better code searching for Debian

By Nathan Willis
December 24, 2014

The Debian project works with a rather large source code collection; recent estimates place its total volume at well over 130 gigabytes for the unstable tree. Finding some means to efficiently search through that massive library of code has been the focus of work by Michael Stapelberg, who maintains the Debian Code Search engine. Stapelberg recently unveiled a major upgrade to Debian Code Search that will enable users to run substantially more focused search queries, as well as returning results significantly faster.

Stapelberg launched Debian Code Search in November 2012. The search engine indexes the most recent snapshot of Debian unstable (sid), which covers around 130 GiB across more than 17,000 separate packages. In 2013, Stapelberg estimated that this coverage added up to around 74% of the total contents of unstable, stable, and testing, since not every package differs between the three distributions. In addition, the engine only indexes Debian's free-software packages, not the contrib and nonfree trees.

In the early days, however, Debian Code Search ran on Stapelberg's private server, which was certainly not the most ideal scenario for long-term usage. In mid-2013, though, the project was promoted to an official Debian service, and migrated to an OpenStack cloud instance donated for the purpose by Rackspace. Stapelberg subsequently set out to speed up the performance of the search engine, as well as to implement outstanding feature requests.

The principal concerns about the original Debian Code Search were that there was no way to group search results by package and that queries were set to time out after 60 seconds (which, naturally, limited whether some queries could ever be used to get meaningful results). Implementing fixes for both of those issues, though, mandated some redesign of the search engine architecture. On December 3, Stapelberg announced the immediate availability of the revamped engine, which he called Debian Code Search Instant.

The new search engine—which has replaced the old one entirely—splits the search index across six servers (instead of one), resulting in considerably faster searching at lower latency. The 60-second timeout has been removed, but the top results will be displayed almost immediately for any query (hence the name "Instant"), even if the full query takes a long time to complete.

The new engine also maintains a separate search index for each package, which are then merged into one overall index. That allows search queries to be restricted by package (and allows the results to be grouped or filtered on a per-package basis), but it also allows the system to refresh the index whenever any individual package is updated. The refresh is triggered automatically by Debian's FedMsg notification system, the real-world results of which Stapelberg explained in his announcement:

In the new architecture, we store an index for each source package and then merge these into one big index shard. This currently takes about 4 minutes with the code I wrote, but I’m sure this can be made even faster if necessary. So, whenever new packages are uploaded to the Debian archive, we can just index the new version and trigger a merge. We get notifications about new package uploads from FedMsg. Packages that are not seen on FedMsg for some reason are backfilled every hour.

The time between uploading a package and being able to find it in Debian Code Search therefore now ranges from a couple of minutes to about an hour, instead of about a week!

On December 23, Stapelberg posted an update on Debian Code Search Instant, in which he analyzed the first few weeks' worth of performance data looking for places where the engine performed poorly. As it turns out, the new engine was crashing on queries where the search term consists only of extremely common trigrams (or three-letter substrings).

There were several factors leading to the crashes: trying to keep all of the results in memory, storing the full path for each result in order to ensure that results were returned in the same order every time, and storing the full package name for each result in order to enable the group-by-package feature. All three bottlenecks were eventually resolved: the engine now writes results to temporary files rather than trying to keep them all in memory, uses hashes in place of full paths to retain a stable order, and uses pointers to package names rather than storing the package name for every result.

That set of changes prevented the engine from crashing on queries for common words, but such queries still required an excessive amount of time: Stapelberg's post describes one query that took 20 minutes to complete. After some profiling work, he determined that the culprit was the search engine front-end, which was generating up to several thousand result pages for every request—even though most users would only look at the first few. Making the page-generation an on-demand process cut the typical query's time down considerably.

One final optimization involved investigating why the disk and network I/O bandwidth was far below expectations. Switching from JSON to Cap'n Proto for the serialization protocol linking the front- and back-ends and implementing a buffered reader on the network connection helped quite a bit—the 20-minute query mentioned earlier could be shrunk down to five—but there was still room for improvement.

Ultimately, Stapelberg patched the system to delay converting the search results to JSON format until the last possible moment—right before the result page is sent to the user. That brought the total time required by the long query down from 20 minutes to 2.5—which might still seem like a long wait for a search result, if it was not for the fact that the example query Stapelberg used for the entire process was the word "arse." Fortunately, he concluded, the speed-up needed to reduce the elapsed time for curse words happen to benefit everyone else as well, and the average duration for every query has gone down. Search terms that only include common trigrams may produce too many results to be particularly useful, but they still serve as a decent worst-case-scenario that can reveal the search engine's upper bounds. Stapelberg did not post numbers for the performance of Debian Code Search Instant on average queries, but he did provide several graphs that reveal how much the optimization effort has sped up the overall performance.

The upshot is that Debian now has a fast search engine that indexes its entire source-code repository, supporting complex queries (including regular expressions), and with the ability to filter results by package. That is likely to prove itself a valuable resource not just for Debian, but for other distributions and large free-software projects as well.

Better code searching for Debian

Posted Dec 25, 2014 4:15 UTC (Thu) by pabs (subscriber, #43278) [Link]

Debian services are only "official" if they are running on hardware controlled by the Debian sysadmins and served from a debian.org domain, codesearch isn't yet "official".

Better code searching for Debian

Posted Dec 25, 2014 15:18 UTC (Thu) by amarao (guest, #87073) [Link] (7 responses)

28 thousands pages of result for '\bstrcpy'. Yeah! More strcpy in every project!

Better code searching for Debian

Posted Dec 26, 2014 4:15 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (6 responses)

A bug for you! A bug for you! Bugs for everyone!</oprah>

Better code searching for Debian

Posted Dec 29, 2014 23:56 UTC (Mon) by debacle (subscriber, #7114) [Link] (5 responses)

Let us all LD_PRELOAD this function to everything:

char *strcpy(char *dest, const char *src)
{
	system("logger BUG!");
	return NULL;
}

Better code searching for Debian

Posted Dec 31, 2014 0:13 UTC (Wed) by lsl (subscriber, #86508) [Link] (4 responses)

Why would you want do that? Unlike with gets(3), there's nothing fundamentally broken about strcpy. Cargo culting weird absolute rules ("never use strcpy") is not going to prevent security issues.

Better code searching for Debian

Posted Dec 31, 2014 11:52 UTC (Wed) by JGR (subscriber, #93631) [Link] (2 responses)

Things go very badly wrong if the input string is too long, to the point that using it without calling strlen and performing error-checking first is effectively an error. This is cumbersome boilerplate that people would rather (and do) do without, which leads to security bugs.

I agree that ldpreloading something like the above isn't helpful, but "never use strcpy (directly, use strlcpy or some other wrapper function which takes an output buffer length, does a strlen and does suitable error checking/calls abort first)" is perfectly sound advice.

Better code searching for Debian

Posted Jan 8, 2015 21:59 UTC (Thu) by dfsmith (guest, #20302) [Link] (1 responses)

Note that strlcpy() has an information leak that strncpy() doesn't have. (Fill remaining buffer with '\0': guaranteed for strncpy(), not mentioned for strlcpy().)

Better code searching for Debian

Posted Jan 8, 2015 22:43 UTC (Thu) by zlynx (guest, #2285) [Link]

That isn't an information leak. It's a feature.

I know its a feature because I switched some code from strncpy to strlcpy because profiling showed writing the zeros into an 8KB URL string buffer was wasting a LOT of time.

If you are worried about information leaks then you need to be using a security library and it needs to be written in ASM because compilers are too likely to optimize all the security features away. For example, unless a value is volatile the compiler can remove all writes to it if nothing reads it before destroying it.

Better code searching for Debian

Posted Dec 31, 2014 14:10 UTC (Wed) by debacle (subscriber, #7114) [Link]

The LD_PRELOAD was a joke, of course. Anyway, I avoid strcpy even in cases where I know that the destination can take the copy. strcpy and strncpy just make it too easy to err fataly, e.g. much later when refactoring code. I use either g_strlcpy from glib or strlcpy from BSD instead, even if this has its own problem. Btw. the discussion about adding strlcpy to glibc started in 2000, right? In some hours we have 2015 - is it in there now?