LWN.net Logo

Drupal Government Days: Drupal and the semantic web

April 20, 2011

This article was contributed by Koen Vervloesem

Drupal 7 is the first mainstream content management system with out-of-the-box support for users and developers to share their data in a machine-readable and interoperable way on the semantic web. At the Drupal Government Days in Brussels, there were a few talks about the features in Drupal — both in its core and in extra modules — to present and interlink data on the semantic web.

In his talk "Riding the semantic web with Drupal", Matthias Vandermaesen, senior developer for the Belgian Drupal web development company Krimson, gave both an introduction to the semantic web and an explanation of the Drupal features in this domain. The problem with the "old" web is that it is just a collection of interlinked web pages, according to Vandermaesen: "HTML only describes the structure of documents, and it interlinks documents, not data. The data described by HTML documents is human-understandable but not really machine-readable."

The semantic web, on the other hand, is all about interlinking data in a machine-readable way, and Linked Data, a subtopic of the semantic web, is a way to expose, share and connect pieces of data using URIs (Uniform Resource Identifier) and RDF (Resource Description Framework). This guarantees an open and low-threshold framework, where browsers and search engines can connect related information from different sources. All entities in a Linked Data dataset and their relationships are described by RDF statements. RDF provides a generic, graph-based data model to structure and link data. Each RDF statement comes in the form of a triple: subject - predicate - object. Each subject and predicate is identified by a URI, while an object can be represented by a URI or be a literal value such as a string or a number.

The semantic web is not some vague future vision, it's already here, Vandermaesen emphasized. He talked about some "cool stuff" that the semantic web makes possible. For instance, search engines like Google already enrich their search results with relevant information that is expressed in RDFa or microformats markup: if you search for a movie, Google shows you some extra information under the reference to the IMDb page of the movie, such as the rating, the number of people that have given a rating, the director, and the main actors. Google shows these so-called "rich snippets" in its result page for a lot of other types of structured data, such as recipes. Moreover, many social networking web sites like LinkedIn, Twitter, and Facebook (with its Open Graph Protocol) already markup their profiles with RDFa.

But how do we "get on" the semantic web? This is actually quite simple, according to Vandermaesen: just use the right technologies to work with machine-understandable data, like RDF and RDFa, OWL (Web Ontology Language), XML, and SPARQL (a recursive acronym for SPARQL Protocol and RDF Query Language). There are two common ways to publish RDF. The first one is to use a triplestore, which is a database much like a relational database, but with data following the RDF model. A triplestore is optimized for the storage and retrieval of RDF triples. Well-known triplestores are Jena, Redland, Soprano, and Virtuoso.

The other way to publish RDF is to embed it in XHTML, in the form of RDFa. This W3C recommendation specifies a set of attributes that can be used to carry metadata in an XHTML document. In essence, RDFa maps RDF triples to XHTML attributes. For instance, a predicate of a triple is expressed as the contents of the property attribute in an element, and the object of the same triple is expressed as the contents of the element itself. For example, using the Dublin Core vocabulary:

    <div xmlns:dc="http://purl.org/dc/elements/1.1/">
        <h2 property="dc:title">The trouble with Bob</h2>
    </div>
One of the benefits of RDFa is that publishers don't have to implement two ways to offer the same content (HTML for humans, RDF for computers), but can publish the same content simultaneously in a human-readable and machine-understandable way by adding the right HTML attributes.

Thanks to these machine-readable data, it's quite easy to connect various data sources. Vandermaesen gave some examples: you could add IMDb ratings to the movies in the schedule of your local movie theatre, and you could link the public transport timetables to Google Maps. This shows one of the key features of the semantic web: data is not contained in a single place, but you can mix and match data from different sources. "With the semantic web, the web becomes a federated graph, or (how Tim Berners-Lee calls it) a Giant Global Graph", he said.

RDFa in Drupal

"Drupal 7 makes it really easy to automatically publish your data in RDFa," Vandermaesen said, "and search engines such as Google will automatically pick up this machine-readable data to enrich your search results." Indeed, any Drupal 7 site automatically exposes some basic information about pages and articles with RDFa. For instance, the author of a Drupal article or page will be marked up by default with the property sioc:has_creator (SIOC is the Semantically-Interlinked Online Communities vocabulary). Other vocabularies that are supported by default are FOAF (Friend of a Friend), Dublin Core, and SKOS (Simple Knowledge Organization System). Drupal developers can also customize their RDFa output: if they create a new content type, they can define a custom RDF mapping in their code. A recent article on IBM developerWorks by Lin Clark walks the reader through the necessary steps for this.

But apart from RDFa support in the core, there are a couple of extra modules that let Drupal developers really tap into the potential of the semantic web. One of them is the (still experimental) SPARQL Views module, created by Lin Clark and sponsored by Google Summer of Code and the European Commission. With this module, developers can query RDF data with SPARQL (SPARQL is to RDF documents what SQL is to a relational database) and bring the data into Drupal views. This way, you can import knowledge coming from different sources and display it in your Drupal site in a tabular form, and this with almost no code to write. "Thanks to SPARQL Views, any Drupal web site can integrate Wikipedia info by using the right SPARQL queries to DBpedia," Vandermaesen explained. At his company Krimson, he used (and contributed to) SPARQL Views in a research project sponsored by the Flemish government, with the goal of creating a common platform to facilitate the exchange of data in an open and transparent fashion between large repositories that contain digitized audiovisual heritage.

Linked Open Data

In his presentation "Linked Open Data funding in the EU", Stefano Bertolo, a scientific project officer working at the European Commission, gave an overview of the projects the European Union is currently funding to support linked data technologies. He also maintained that governments are likely to become the first beneficiaries of advances in this domain, thanks to Drupal:

Linked Open Data, which is Linked Data open for anyone to use, is really taking off and Drupal is ready for it. There's a massive amount of information you can re-use in your Drupal installation, and this re-usability is the most important aspect of the semantic web. Just like a typical software developer re-uses a lot of software libraries for generic tasks, the semantic web allows you to re-use a lot of generic data. That's why the European Commission has been investing in Linked Open Data technology. Drupal and Linked Data have much to offer to each other, especially in the domain of publishing government data.

Bertolo mentioned three Linked Open Data projects funded by the European Commission. One is OKKAM, a project that ran from January 2008 to June 2010. Its name refers to the philosophical principle Occam's razor, "Entities should not be multiplied beyond necessity", to which the OKKAM project wants to be a 21st-century equivalent: "Entity identifiers should not be multiplied beyond necessity. What this means is that OKKAM offers an open service on the web to give a single and globally unique identifier for any entity which is named on the (semantic) web. This Entity Name System currently has about 7.5 million entities, such as Barack Obama, European Union, or Linus Torvalds. When you have found the entity you need in the OKKAM search engine, you can re-use its ID in all your RDF triples to refer unambiguously to the entity.

Another deliverable of the OKKAM project is sig.ma, a data aggregator for the semantic web. When you search for a keyword, sig.ma combines all information it can find in the "web of data" and presents it in a tabular form. Recently, a spin-off company started, based on the results of the research project.

The second European-funded project Bertolo talked about was LOD2, a large-scale project with many deliverables. The project aims to contribute high-quality interlinked versions of public semantic web data sets, and it will develop new technologies to raise the performance of RDF triplestores to be on par with relational databases. This is a huge challenge, because a graph-based data model like RDF has many freedoms, which is difficult to optimize as there is no strict database schema. The LOD2 project will also develop new algorithms and tools for data cleaning, linking, and merging. For instance, these tools could make it possible to diagnose and repair semantic inconsistencies. Bertolo gave an example: "Let's say that a database lists that a person has had a car insurance since 1967 while the same database lists the person's age as 18 years. Syntactically, there are no errors in the database, but semantically we should be able to diagnose the inconsistency here."

A third project by the European Commission is Linked Open Data Around the Clock. Bertolo explains its goal: "The value of a network highly depends on the number of links, and currently the links across Linked Open Data datasets are not enough. The mission of the Linked Open Data Around the Clock project is to interlink these much more, to give people more bang for their RDF buck. Our objective is to have 500 million links in two years." As a testbed, the project started with publishing datasets produced by the European Commission, the European Parliament, and other European institutions as Linked Data on the Web and interlinking them with other governmental data.

Drupal paving the way

At the moment, the semantic web is still struggling with a chicken-and-egg problem: many semantic web tools are still experimental and not easy to use for end users, and publishers still have trouble finding a good business model to publish their data as RDF when their competitors don't do so. However, with out-of-the-box RDFa support in Drupal 7, the open source CMS could pave the way for a more widespread adoption of semantic web technologies: Drupal founder Dries Buytaert claims that his CMS is already powering more than 1 percent of all websites in the world. If Drupal keeps growing its market share, the CMS could help to bring Linked Open Data to the masses, and we could soon have millions of web sites with RDFa data on the web.


(Log in to post comments)

Drupal Government Days: Drupal and the semantic web

Posted Apr 21, 2011 6:40 UTC (Thu) by Tara_Li (subscriber, #26706) [Link]

"Let's say that a database lists that a person has had a car insurance since 1967 while the same database lists the person's age as 18 years. Syntactically, there are no errors in the database, but semantically we should be able to diagnose the inconsistency here."

And someone will have to go through and fix the inconsistency - yeah, right. The software will be written to make some kind of semi-educated guess from its Artificial Stupidity engine, and change one of the data to make the database consistent. And all of a sudden, the IRS is knocking on some poor kid's dorm room, asking where his 1040s for 1980-2000 are, since he obviously had enough money to pay for car insurance. Reality will not matter - the computer says, the computer is right, pay up.

Drupal Government Days: Drupal and the semantic web

Posted Apr 21, 2011 13:17 UTC (Thu) by dmk (subscriber, #50141) [Link]

We have these problems already with digital maps... errors in any map providers source material will lead to delivery problems in the real world 'cause, well, people are nowadays using navigation systems... this means, no fed-ex, no pizza, no amazon, no nothing anymore.... until all people have updated their navigations software. And yeah... that will happen... in opposite world..

Drupal Government Days: Drupal and the semantic web

Posted Apr 21, 2011 16:25 UTC (Thu) by mrfredsmoothie (guest, #3100) [Link]

Don't confuse the capability to do inference with the response to the knowledge inferred.

For example, if you own the both system which holds the insurance info and the system that holds the user info, you could use the information to reject entering the inconsistent data in the first place. What you do with it is almost beside the point. The value of having this capability in a way which still allows your systems to be loosely coupled should be fairly obvious.

Drupal Government Days: Drupal and the semantic web

Posted Apr 21, 2011 18:10 UTC (Thu) by Tara_Li (subscriber, #26706) [Link]

Only as long as long as someone actually does it. It's like the wonderful world of tagging - great, as long as someone actually *enters* all of the tags on the data, rationalizes and fixes typos, links synonyms in some manner. What people *WILL* do seldom matches what they *should* do.

This "semantic web" stuff is well and good - I just don't think it's really "all that". After all, someone has to actually get around to asking the computer, in the example provided, to check age against length of time holding the policy.

Drupal Government Days: Drupal and the semantic web

Posted Apr 21, 2011 18:30 UTC (Thu) by vonbrand (subscriber, #4458) [Link]

Exactly. The big advantage Google had over Altavista and other early search engines was that they didn't rely on human-entered data descriptions, just raw computer power (and let the user sort it out). Still mostly the same, AFAICS.

Drupal Government Days: Drupal and the semantic web

Posted Apr 22, 2011 3:16 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Judging myself from some of the example it seems organizations like Google are the primary consumers of this data, they retrieve it, index it, store it and answer questions using it. They have the data and the raw compute power to do interesting things with it should they find a market for the answers.

RDFa exists in regular HTML as well

Posted Apr 21, 2011 17:46 UTC (Thu) by Simetrical (guest, #53439) [Link]

Nitpick:

"The other way to publish RDF is to embed it in XHTML, in the form of RDFa. This W3C recommendation specifies a set of attributes that can be used to carry metadata in an XHTML document. In essence, RDFa maps RDF triples to XHTML attributes."

RDFa can be output in HTML 4 and 5, not just XHTML. The relevant standard is here:

http://dev.w3.org/html5/rdfa/

A competing standard for outputting semantic info in HTML is microdata:

http://dev.w3.org/html5/md/

It's newer and has a lot less buy-in than RDFa, but its proponents consider it a lot simpler.

Spam

Posted May 17, 2011 18:31 UTC (Tue) by job (guest, #670) [Link]

When I think about the enormous resources that go into simple comment spamming on the web, it scares me to think what they could do on the semantic web. I'm sure there's a solution to this, I'm just not sure it will turn out feasible.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds