The Debian archive is known to be one of the largest software
collections available in the free software world. With more than
16,000 source packages and 30,000 binary packages, users sometimes have trouble
finding packages that are relevant to them. Debian developer Enrico
Zini has been working on infrastructure to solve this problem.
During the recent mini-debconf
Paris, Enrico gave a talk presenting what he has
been working on in the last few years, which "hasn't gotten yet the
attention
it deserves".
Enrico is
known in the Debian community for the introduction of debtags, a system
used to classify all packages using facets.
Each facet describes a specific kind of property: type of user-interface,
programming language it's written in, type of document manipulated,
purpose of the software, etc. His most recent work builds on that. It is
available in Debian and Ubuntu
in the apt-xapian-index
package. Its purpose is to allow advanced queries over the database of
available packages.
Users of apt-xapian-index
He started by presenting some early users of the infrastructure. The most
widely know is Ubuntu's software center. Its search feature provides
results almost instantly thanks to apt-xapian-index. But it is
a very simple interface that doesn't exploit many of the advanced
features provided by the apt-xapian-index.
Another early adopter, making use of some more advanced features, is
GoPlay!. It's a
graphical user interface to find games. It makes use of debtags to
classify games so that you can browse, for example, all 3D action/arcade
games related to cars. GoPlay has even been extended to be a more generic
debtags based package browser and the package now also provides
GoLearn!, GoAdmin!, GoNet!, GoOffice!, GoSafe!, and GoWeb!.
Fuss-launcher is
an application launcher and not a package browser, but by using
apt-xapian-index, it's able to reuse information provided at the package
level to make it easier to find installed applications. Package
descriptions tend to be more verbose than those embedded in .desktop files.
Enrico also showed another nice feature to the audience: if you drag a
document onto its window, it will show you a list of applications that can
open it.
Last but not least, apt-xapian-index provides a command line search
tool that is vastly superior to the traditional apt-cache
search: it's axi-cache search (axi stands for
apt-xapian-index). Enrico compared the output of a search on the letter
"r". While apt-cache spits out an infinite list of packages containing
this letter somewhere in the description, axi-cache only
listed packages related to GNU R. He also demonstrated the contextual tab
completion. It makes it easy to use debtags and to refine your search. Once
you have typed a first keyword, the tab-completion for the second one only
contains keywords or debtags that are actually able to provide more
restrictive results. Advanced queries with logical operations (AND, OR,
NOT, XOR) are also supported.
Features of the backend
Enrico then dived into the internals. Xapian's search engine is at the
root of this infrastructure. He likes it because it's a simple library
(i.e. no daemon) and it has nice Python bindings. While apt-xapian-index's
core work is to index the descriptions of all the packages, it actually
stores much more and can be easily extended with plugins (written in
Python).
For instance, the information stored encompasses:
words appearing in the description of the packages (including the translated
descriptions if the user uses a non-English locale);
their origin;
their section;
their size and installed size;
the time they have been first seen;
icons, categories, descriptions from the .desktop files they
contain (through app-install-data);
aliases for names of some popular applications that are not
available on Linux (for instance "excel" maps to the debtag
office::spreadsheet).
He already has plans to store more: adding popularity contest data (see wishlist
bugs #602180 and #602182) will make it possible to
sort query results in a useful way. The most widely used applications are
good choices when it comes to community support, and they are likely of
better quality due to the larger user base. Adding timestamps of the last
installation/upgrade/removal, will make it easier to pin-point a regression
to a specific package update.
The generated index is world-readable and can be used from any
application provided it can use the Xapian library—which is written
in C++ but has bindings for Perl, Python, PHP, Java, Tcl, C#, and
Ruby.
Call for experimentation
Enrico believes that many useful applications have yet to be invented
on top of apt-xapian-index's features. He's calling for experimentation
and asking for new ideas. The only practical limit that he has encountered
is the size of the index: currently it varies between 50 Mb (Debian
unstable without translation) and 70 Mb (Debian stable/testing/unstable
with one translation). He would like it to not grow over 100 Mb since it's
installed by default (due to aptitude recommending it) and he's not
comfortable with the idea of using more than 20% of the disk footprint
of a basic install just for this service. That's why the index was
configured to not store the position of the terms: it's thus not possible
to find out packages whose description contains the word "statistical"
immediately followed by the word "computing". You can however find
those which have both terms somewhere in their description.
Enrico wondered if apt-xapian-index offers too much freedom.
That could explain why few people experimented with it despite his
numerous blog posts
with code samples and information on how to get started using it.
But it's not difficult to imagine use cases for this data. It could be used
to extend tools like rc-alert or wnpp-alert, for example. They provide a long
list of Debian packages that are looking for some help and are
installed on the machine. With apt-xapian-index, it would be possible to
restrict the results to the set of packages written in a specific programming
language or for a particular desktop environment.
The more likely explanation is that too few people know about
the tool. There are many more itches to scratch where
apt-xapian-index's features could be very useful, and my guess is
that Enrico's wishes will eventually come true.
(
Log in to post comments)