|
|
Subscribe / Log in / New account

A high-level search interface for Debian packages

November 17, 2010

This article was contributed by Raphaël Hertzog

The Debian archive is known to be one of the largest software collections available in the free software world. With more than 16,000 source packages and 30,000 binary packages, users sometimes have trouble finding packages that are relevant to them. Debian developer Enrico Zini has been working on infrastructure to solve this problem. During the recent mini-debconf Paris, Enrico gave a talk presenting what he has been working on in the last few years, which "hasn't gotten yet the attention it deserves".

Enrico is known in the Debian community for the introduction of debtags, a system used to classify all packages using facets. Each facet describes a specific kind of property: type of user-interface, programming language it's written in, type of document manipulated, purpose of the software, etc. His most recent work builds on that. It is available in Debian and Ubuntu in the apt-xapian-index package. Its purpose is to allow advanced queries over the database of available packages.

Users of apt-xapian-index

He started by presenting some early users of the infrastructure. The most widely know is Ubuntu's software center. Its search feature provides results almost instantly thanks to apt-xapian-index. But it is a very simple interface that doesn't exploit many of the advanced features provided by the apt-xapian-index.

[GoPlay!]

Another early adopter, making use of some more advanced features, is GoPlay!. It's a graphical user interface to find games. It makes use of debtags to classify games so that you can browse, for example, all 3D action/arcade games related to cars. GoPlay has even been extended to be a more generic debtags based package browser and the package now also provides GoLearn!, GoAdmin!, GoNet!, GoOffice!, GoSafe!, and GoWeb!.

Fuss-launcher is an application launcher and not a package browser, but by using apt-xapian-index, it's able to reuse information provided at the package level to make it easier to find installed applications. Package descriptions tend to be more verbose than those embedded in .desktop files. Enrico also showed another nice feature to the audience: if you drag a document onto its window, it will show you a list of applications that can open it.

Last but not least, apt-xapian-index provides a command line search tool that is vastly superior to the traditional apt-cache search: it's axi-cache search (axi stands for apt-xapian-index). Enrico compared the output of a search on the letter "r". While apt-cache spits out an infinite list of packages containing this letter somewhere in the description, axi-cache only listed packages related to GNU R. He also demonstrated the contextual tab completion. It makes it easy to use debtags and to refine your search. Once you have typed a first keyword, the tab-completion for the second one only contains keywords or debtags that are actually able to provide more restrictive results. Advanced queries with logical operations (AND, OR, NOT, XOR) are also supported.

Features of the backend

Enrico then dived into the internals. Xapian's search engine is at the root of this infrastructure. He likes it because it's a simple library (i.e. no daemon) and it has nice Python bindings. While apt-xapian-index's core work is to index the descriptions of all the packages, it actually stores much more and can be easily extended with plugins (written in Python).

For instance, the information stored encompasses:

  • words appearing in the description of the packages (including the translated descriptions if the user uses a non-English locale);

  • their origin;

  • their section;

  • their size and installed size;

  • the time they have been first seen;

  • icons, categories, descriptions from the .desktop files they contain (through app-install-data);

  • aliases for names of some popular applications that are not available on Linux (for instance "excel" maps to the debtag office::spreadsheet).

He already has plans to store more: adding popularity contest data (see wishlist bugs #602180 and #602182) will make it possible to sort query results in a useful way. The most widely used applications are good choices when it comes to community support, and they are likely of better quality due to the larger user base. Adding timestamps of the last installation/upgrade/removal, will make it easier to pin-point a regression to a specific package update.

The generated index is world-readable and can be used from any application provided it can use the Xapian library—which is written in C++ but has bindings for Perl, Python, PHP, Java, Tcl, C#, and Ruby.

Call for experimentation

Enrico believes that many useful applications have yet to be invented on top of apt-xapian-index's features. He's calling for experimentation and asking for new ideas. The only practical limit that he has encountered is the size of the index: currently it varies between 50 Mb (Debian unstable without translation) and 70 Mb (Debian stable/testing/unstable with one translation). He would like it to not grow over 100 Mb since it's installed by default (due to aptitude recommending it) and he's not comfortable with the idea of using more than 20% of the disk footprint of a basic install just for this service. That's why the index was configured to not store the position of the terms: it's thus not possible to find out packages whose description contains the word "statistical" immediately followed by the word "computing". You can however find those which have both terms somewhere in their description.

Enrico wondered if apt-xapian-index offers too much freedom. That could explain why few people experimented with it despite his numerous blog posts with code samples and information on how to get started using it. But it's not difficult to imagine use cases for this data. It could be used to extend tools like rc-alert or wnpp-alert, for example. They provide a long list of Debian packages that are looking for some help and are installed on the machine. With apt-xapian-index, it would be possible to restrict the results to the set of packages written in a specific programming language or for a particular desktop environment.

The more likely explanation is that too few people know about the tool. There are many more itches to scratch where apt-xapian-index's features could be very useful, and my guess is that Enrico's wishes will eventually come true.


Index entries for this article
GuestArticlesHertzog, Raphaël


to post comments

A high-level search interface for Debian packages

Posted Nov 18, 2010 13:27 UTC (Thu) by Seegras (guest, #20463) [Link] (5 responses)

Actually, one of the things that kept me from noticing it ist the name: Xapian.

It sounds as if it is something complicated (X.400 API Association maybe?); it sounds "Enterprise" (which I again associate with: "Can do everything, but you have to program it yourself, and by the way, it uses an awkward language/API/config-language").

Might this have been a problem with other users too?

Cheers
Seegras

A high-level search interface for Debian packages

Posted Nov 18, 2010 16:02 UTC (Thu) by nye (subscriber, #51576) [Link] (2 responses)

I'd actually seen the package name and assumed for some reason that it was something related to the Apache project - one of those gargantuan java packages that don't integrate with anything native and can't be configured correctly by mere mortals.

This description, on the other hand, sounds like an actually useful piece of software.

So no, it's not just you.

A high-level search interface for Debian packages

Posted Nov 18, 2010 18:53 UTC (Thu) by nix (subscriber, #2304) [Link]

I too thought it was something to do with the monster which is xalan-j.

A high-level search interface for Debian packages

Posted Nov 18, 2010 19:50 UTC (Thu) by ccurtis (guest, #49713) [Link]

My first impression is always that it's a Ximian specific tool.

A high-level search interface for Debian packages

Posted Nov 18, 2010 22:58 UTC (Thu) by cowsandmilk (guest, #55475) [Link]

considering how much coverage lwn gave notmuch ( http://notmuchmail.org/ ), I'm surprised you hadn't heard of xapian

A high-level search interface for Debian packages

Posted Nov 19, 2010 22:14 UTC (Fri) by tack (guest, #12542) [Link]

I've done a bit of work with Xapian. It's really a fantastic, well designed, light and fast full-text search engine. The API is intuitive and it has that Just Works quality that seems rare these days.

A high-level search interface for Debian packages

Posted Nov 19, 2010 10:31 UTC (Fri) by job (guest, #670) [Link]

Most of the full text search interfaces for software packages has been useless. They take a lot of memory and when you search for simple terms you either get a hundred false positives or no hits at all.

That does not mean the idea is bad, just that the problem lies in its execution. Make it nimble and make sure you have everyday use cases, and that you do it well. It's may not be a very rock star programming problem to solve, but as Google has showed, search can be a very useful user interface.

A high-level search interface for Debian packages

Posted Nov 25, 2010 19:26 UTC (Thu) by smemrobo (guest, #61746) [Link] (1 responses)

I always use packages.debian.org/name_of_package and filter of the GUI... it's a great tool...

A high-level search interface for Debian packages

Posted Dec 10, 2010 12:37 UTC (Fri) by ojwb (guest, #67322) [Link]

The search on packages.debian.org is also powered by Xapian.


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds