LWN.net Logo

Nepomuk: sharing application metadata

November 11, 2009

This article was contributed by Ben Martin

The Nepomuk project has the potential to unlock the data from its originating application so that it can used by other applications on the desktop. If Nepomuk becomes pervasive, history logs, bookmarks, file metadata, email, instant messages, photo tags, or other metadata will be shared between various desktop applications. Why should music metadata like track length or artist and song title be locked away in an index created and used explicitly by a music playing application?

Consider a download assistant such as kget. The subversion branch of kget recently got the ability to store its download history using Nepomuk. kget could already save the transfer history in XML or SQLite. The advantage of using Nepomuk is that other desktop tools can easily see where a file was downloaded from and when; the information is unlocked from just kget. With Nepomuk, other applications don't need to know where the SQLite file is, or find and parse an XML file. All of the sudden, the file manager can let you know where this file came from so you can easily return for newer versions, or a desktop search can reveal all the files downloaded last year from http://example.com.

To allow data to be stored, exchanged, and understood by many applications, Nepomuk uses the same underlying technology that the Semantic Web is designed around. The Semantic Web tries to separate the data from the presentation in a way that allows for both humans and computers to inspect and digest the data. At the base of the Semantic Web is Resource Description Framework (RDF) which aims to allow metadata to be exchanged in an unambiguous, machine-processing-friendly format.

There are many who dismiss the Semantic Web as an ivory tower pipe dream. Various concerns are cited as reasons that RDF will not be adopted: it takes extra time to generate RDF data, it allows for automated comparisons which will make companies uncomfortable, and there will no agreement between companies on which schemas to use, etc.

Nepomuk and RDF have a huge potential on the Free and Open Source Software (FOSS) desktop because application developers have no vested interest in locking their data away, and due to the nature of free software, one can patch in RDF and/or Nepomuk support into projects. The latter problem about projects designing their own schema is still present for FOSS, but, luckily, schema mismatches in and of themselves are not a show stopper for RDF adoption. By definition, once data is in RDF it can be processed automatically by a computer, so the machine, rather than the human, can always work around schema differences.

RDF tries to capture information in the form of triples. The classic examples are relationships and ownerships, for example: "Mary knows Mark" and "dog has tail". To avoid name clashes for things described in RDF, longer URL style identifiers are used for the three pieces of information. To get back to smaller text strings for these URLs, prefixes are used in the style of XML namespaces. For example, foaf:name could be used for a human name which expands to the URL http://xmlns.com/foaf/0.1/name. This way, individual things can still be described concisely, but they should also have globally understood meaning. A foaf:name is a person's name, whereas a toolshed:name might name a screwdriver.

Below is an example of using Nepomuk from the command line to create and list an RDF file:

    $ sopranocmd --backend redland add \
	"<http://onto.libferris.com/things/1234>" "<http://onto.libferris.com/price>" "30"
    $ sopranocmd --backend redland add \
	"<http://onto.libferris.com/things/1234>" "<http://onto.libferris.com/title>" \
 	"super crazy magical item"

    $ sopranocmd --backend redland list
    <http://onto.libferris.com/things/1234> <http://onto.libferris.com/price> \
	"30"^^<http://www.w3.org/2001/XMLSchema#int> (empty)
    <http://onto.libferris.com/things/1234> <http://onto.libferris.com/title> \
	"super crazy magical item"^^<http://www.w3.org/2001/XMLSchema#string> (empty)
    Total results: 2
    Execution time: 00:00:00.1

While an RDF repository can be used to just store, update, and query these triples, a schema can also be imposed so that applications know what to expect. For example, that the foaf:homepage is a link to a web site with certain constraints. Examples of constraints include the type of data stored (integer, date, etc), how many times a property can appear (only one homepage), and so on.

The SPARQL query language can be used to join together the triples and select the information of interest. While SPARQL uses familiar SQL, like the SELECT, WHERE, ORDER BY, and LIMIT keywords, joining triples is a bit different than with SQL. For example, the query below grabs the price and title for "something". We don't particularly care what the something is, as long as the same something has a title and a price of less than 30.5.

    SELECT  ?title ?price
    WHERE   { ?x ns:price ?price .
	      FILTER (?price < 30.5)
	      ?x dc:title ?title . }

With all this talk of RDF, triples, and ivory towers, one might think that using Nepomuk and RDF will be painful and have an extremely long learning curve. Below are a few examples of using Nepomuk in a KDE application to quell those fears. Nepomuk makes using RDF simple because it provides a code generator that makes native C++ classes to allow interaction with the RDF store:

    Nepomuk::File f( "/home/foo/bar.txt" );
    f.setAnnotation( "This is just a test file that contains nothing of interest." );

The above is much neater than thinking in terms of the triples shown below which might be stored to represent it. In this case X will really be a persistent unique identifier used to identify the file, similar to the device number and inode in the kernel. The type, file etc will of course be longer URIs in the real RDF store.

    X type        file
    X url         "/home/foo/bar.txt"
    X annotation  "This is just a..."

The above example which uses setAnnotation() takes advantage of a schema for annotating and tagging files which comes with Nepomuk itself. The kget program mentioned earlier in the article is a good example of not using a standard schema. In the sources of kget, the transferhistorystore.cpp file manages the XML, SQLite, and Nepomuk representations of download history. At the end of transferhistorystore.cpp file, there is the following code:

    void NepomukStore::saveItem(const TransferHistoryItem &item)
    {
      Nepomuk::HistoryItem historyItem(item.source());
      historyItem.setDestination(item.dest());
      historyItem.setSource(item.source());
      historyItem.setState(item.state());
      historyItem.setSize(item.size());
      historyItem.setDateTime(item.dateTime());
    }

    void NepomukStore::deleteItem(const TransferHistoryItem &item)
    {
      Nepomuk::HistoryItem historyItem(item.source());
      historyItem.remove(); 
      ...

The HistoryItem class is generated by Nepomuk using the custom schema file kget_history.trig, part of which is repeated below. While the schema language that kget_history.trig is using may be unfamiliar, it should still be clear that there is a ndho:HistoryItem which has properties of various types with various restrictions on them, such as a destination property which can appear zero or one times and is a string. Given the below schema file, Neopmuk can generate the C++ class Nepomuk::HistoryItem needed to allow the above C++ code compile.

    <http://nepomuk.kde.org/ontologies/2008/10/06/ndho> {

	ndho:HistoryItem
	a rdfs:Class ;
	rdfs:comment "A kget history item." ;
	rdfs:label "application" ;
	rdfs:subClassOf rdfs:Resource .

	ndho:destination
	a rdf:Property ;
	rdfs:comment "Destination of the download." ;
	rdfs:domain ndho:HistoryItem ;
	rdfs:label "source" ;
	rdfs:range xsd:string ;
	nrl:maxCardinality "1" . 
	...

At the base of the Nepomuk project is the Soprano library and command line tools which depend only on QtCore, making them a useful RDF library for use on both desktop and mobile platforms. The Nepomuk libraries build on Soprano to make writing KDE applications using RDF simple. One of the great things about the design of Soprano is that there are multiple backends which can store and query RDF. So there can be a memory mapped implementation for a mobile device, or a full-blown database server for a LAN, and applications still use the same API.

For a long time Soprano has had two main backends: Redland and Sesame2. The former is a C library for RDF and the latter a Java implementation. While Sesame2 is written in Java it can deliver better query performance than Redland. This left KDE4 in the predicament that it required Java to achieve good RDF performance. To solve this issue the new Virtuoso backend was created and is getting to the point where it is now stable. As I discovered recently, the main impediment to developing a backend for soprano is implementing SPARQL.

Adoption still remains the major hurdle for Nepomuk and Soprano. With the host of persistence options available, the first thing that comes to an application developer's mind might be flat files, MySQL, Sqlite, Berkeley DB, or some generic relational database library, when wanting to store and retrieve data. However, when storing data that might be of interest to other applications, using Nepomuk or Soprano has the potential to unlock an application's data. As can be seen above, the main thing to learn is a bit about the schema language and then native C++ objects can be used to interact with Nepomuk from an application.


(Log in to post comments)

Nepomuk: sharing application metadata

Posted Nov 12, 2009 8:36 UTC (Thu) by spaetz (subscriber, #32870) [Link]

If I get it right, then Nepomuk tries to achieve similar things as CouchDB which is an Spache project and which Ubuntu seems to be heavily pushing now. Is that about right?

One problem I always have when I hear a talk of Nepomuk/Tracker developers is that they have evolved such a geek jargon of their own, that if you don't know what SPARQL is from the very beginning, you have a hard time understanding anything Nepomuk can do for you. An accessible description of what Nepomuk can do for a simple-minded application developer such as me, would be nice...

Nepomuk: sharing application metadata

Posted Nov 12, 2009 10:55 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

Warning: guesses follow. It sounds to me like a database of attributes for files on the filesystem. So for example a music player opening the file Music.mp3 might save the attributes "Artist" == "Talented person" and "Track" == "Wonderful symphony". Other applications which knew about the attribute names used by the music player could ask Nepomuk what attributes the file has, and find out the artist and track name without having to delve into the file's format. (Nepomuk might also have a built-in tool that would automatically extract that information from mp3s).

I presume that this information is matched to inodes, not (only) to file names, or moving and renaming files would be a recipe for meta-data-loss...

Nepomuk: sharing application metadata

Posted Nov 12, 2009 11:35 UTC (Thu) by spaetz (subscriber, #32870) [Link]

So in essence, extended file attributes with an indexer/query system. I see.

Nepomuk: sharing application metadata

Posted Nov 12, 2009 11:54 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

If my understanding is correct :) And of course, with the additional difference that the attributes are "name" "verb" "value" triples, and not just "name" "value" pairs like in my examples. Although in many cases the "verb" part will probably just be "is" or "equals".

I wonder whether Nepomuk can store this information in the filesystem for filesystems supporting extended attributes? I think BeOS did this (effectively implementing all this new stuff years ago).

Nepomuk: sharing application metadata

Posted Nov 12, 2009 16:36 UTC (Thu) by nix (subscriber, #2304) [Link]

Using EAs would work great *if* everything you wanted to collect data on
was certainly writable by you. It often isn't, and EAs don't have their
own timestamps: often you don't want your backup program backing
everything up again merely because an indexer iterated over it!

Nepomuk: sharing application metadata

Posted Nov 14, 2009 18:17 UTC (Sat) by Thalience (subscriber, #4217) [Link]

BeOS provided fast extended attributes on files, and optional kernel-maintained indexes on them. It has taken some time, but ext4 and other modern linux filesystems provide reasonable performance on ext attrs. The in-kernel indexer was nifty, but somewhat inflexible.

They also provided conventions for the names and formats of commonly used attributes. For example, the "length" attribute of audio files should be an integer number of seconds, not a string or a floating point number of minutes. I think that this was the most important aspect of making the feature usable on the desktop.

The combination of inotify/tracker (or nepomuk) could do all the same cool things (live queries etc), if people could agree to all use it the same way. Having one clearly defined api to support made it a no-brainer for 3rd party application developers.

The Be community was small enough that converging on a single standard wasn't difficult. The Be weekly newsletter would announce a convention, Be would release first-party apps that supported/expected it, and... done. 3rd party developers either got with the program, or users scorned them.

I turned off the indexer support on my gnome desktop once it became clear that the Nautilus developers were not interested in providing integration with it. Hope the KDE project does better.

Nepomuk: sharing application metadata

Posted Nov 12, 2009 14:46 UTC (Thu) by sebas (subscriber, #51660) [Link]

You can use Nepomuk for that, but it's more extensive.

You're only referring to files here, while files are just one example.
Metadata (what you're referring to as extended file attributes) can also
be attached to more abstract objects, such as a contact, for example.

Those abstract objects are described using ontologies, which is, so to
say, a standardized format for metadata for a specific object.

Also, resources (for example a file or a contact in Nepomuk) don't have to
be local, you can just in the same way attach metadata to webpages you
visited, or to a certain activity or project you're working on.

Right now, Nepomuk is already used for tagging and rating across
applications, that is "attaching a string (tag) or a score to a file",
more use cases are coming up. Mandriva is probably the most advanced in
terms of the semantic desktop, have a look at this page to find out more:
http://doc4.mandriva.org/bin/view/labs/Nepomuk-mdv2010-RC

Nepomuk: sharing application metadata

Posted Nov 12, 2009 19:50 UTC (Thu) by oak (guest, #2786) [Link]

> if you don't know what SPARQL is from the very beginning

Google on "SPARQL" tells that it's a RDF Query Language and an official
W3C Recommendation. RDF is a directed, labeled graph data format for
representing information in the Web. I agree that it's geek jargon, but
so are the other W3C standards like CSS and XHTML. :-)

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds