April 20, 2011
This article was contributed by Koen Vervloesem
Drupal 7 is the first mainstream content management system with
out-of-the-box support for users and developers to share their data in a
machine-readable and interoperable way on the semantic web. At the Drupal Government Days in Brussels, there were
a few talks about the features in Drupal — both in its core and in
extra modules — to present and interlink data on the semantic web.
In his talk "Riding the semantic web with Drupal", Matthias Vandermaesen, senior developer for the Belgian Drupal web development company Krimson, gave both an introduction to the semantic web and an explanation of the Drupal features in this domain. The problem with the "old" web is that it is just a collection of interlinked web pages, according to Vandermaesen: "HTML only describes the structure of documents, and it interlinks documents, not data. The data described by HTML documents is human-understandable but not really machine-readable."
The semantic web, on the other hand, is all about interlinking data in a machine-readable way, and Linked Data, a subtopic of the semantic web, is a way to expose, share and connect pieces of data using URIs (Uniform Resource Identifier) and RDF (Resource Description Framework). This guarantees an open and low-threshold framework, where browsers and search engines can connect related information from different sources. All entities in a Linked Data dataset and their relationships are described by RDF statements. RDF provides a generic, graph-based data model to structure and link data. Each RDF statement comes in the form of a triple: subject - predicate - object. Each subject and predicate is identified by a URI, while an object can be represented by a URI or be a literal value such as a string or a number.
The semantic web is not some vague future vision, it's already here, Vandermaesen emphasized. He talked about some "cool stuff" that the semantic web makes possible. For instance, search engines like Google already enrich their search results with relevant information that is expressed in RDFa or microformats markup: if you search for a movie, Google shows you some extra information under the reference to the IMDb page of the movie, such as the rating, the number of people that have given a rating, the director, and the main actors. Google shows these so-called "rich snippets" in its result page for a lot of other types of structured data, such as recipes. Moreover, many social networking web sites like LinkedIn, Twitter, and Facebook (with its Open Graph Protocol) already markup their profiles with RDFa.
But how do we "get on" the semantic web? This is actually quite simple,
according to Vandermaesen: just use the right technologies to work with
machine-understandable data, like RDF and RDFa, OWL (Web Ontology Language), XML,
and SPARQL (a
recursive acronym for SPARQL Protocol and RDF Query Language). There are two common ways to publish RDF. The first one is to use a triplestore, which is a database much like a relational database, but with data following the RDF model. A triplestore is optimized for the storage and retrieval of RDF triples. Well-known triplestores are Jena, Redland, Soprano, and Virtuoso.
The other way to publish RDF is to embed it in XHTML, in the form of
RDFa. This W3C recommendation specifies a set of attributes that can be
used to carry metadata in an XHTML document. In essence, RDFa maps RDF
triples to XHTML attributes. For instance, a predicate of a triple is
expressed as the contents of the property attribute in an element,
and the object of the same triple is expressed as the contents of the
element itself. For example, using the Dublin Core vocabulary:
<div xmlns:dc="http://purl.org/dc/elements/1.1/">
<h2 property="dc:title">The trouble with Bob</h2>
</div>
One of the benefits of RDFa is that publishers don't have to implement two ways to offer the same content (HTML for humans, RDF for computers), but can publish the same content simultaneously in a human-readable and machine-understandable way by adding the right HTML attributes.
Thanks to these machine-readable data, it's quite easy to connect
various data sources. Vandermaesen gave some examples: you could add IMDb
ratings to the movies in the schedule of your local movie theatre, and you
could link the public transport timetables to Google Maps. This shows one
of the key features of the semantic web: data is not contained in a single
place, but you can mix and match data from different sources. "With
the semantic web, the web becomes a federated graph, or (how Tim
Berners-Lee calls it) a Giant Global
Graph", he said.
RDFa in Drupal
"Drupal 7 makes it really easy to automatically publish your data in RDFa," Vandermaesen said, "and search engines such as Google will automatically pick up this machine-readable data to enrich your search results." Indeed, any Drupal 7 site automatically exposes some basic information about pages and articles with RDFa. For instance, the author of a Drupal article or page will be marked up by default with the property sioc:has_creator (SIOC is the Semantically-Interlinked Online Communities vocabulary). Other vocabularies that are supported by default are FOAF (Friend of a Friend), Dublin Core, and SKOS (Simple Knowledge Organization System). Drupal developers can also customize their RDFa output: if they create a new content type, they can define a custom RDF mapping in their code. A recent article on IBM developerWorks by Lin Clark walks the reader through the necessary steps for this.
But apart from RDFa support in the core, there are a couple of extra modules that let Drupal developers really tap into the potential of the semantic web. One of them is the (still experimental) SPARQL Views module, created by Lin Clark and sponsored by Google Summer of Code and the European Commission. With this module, developers can query RDF data with SPARQL (SPARQL is to RDF documents what SQL is to a relational database) and bring the data into Drupal views. This way, you can import knowledge coming from different sources and display it in your Drupal site in a tabular form, and this with almost no code to write. "Thanks to SPARQL Views, any Drupal web site can integrate Wikipedia info by using the right SPARQL queries to DBpedia," Vandermaesen explained. At his company Krimson, he used (and contributed to) SPARQL Views in a research project sponsored by the Flemish government, with the goal of creating a common platform to facilitate the exchange of data in an open and transparent fashion between large repositories that contain digitized audiovisual heritage.
Linked Open Data
In his presentation "Linked Open Data funding in the EU", Stefano Bertolo, a scientific project officer working at the European Commission, gave an overview of the projects the European Union is currently funding to support linked data technologies. He also maintained that governments are likely to become the first beneficiaries of advances in this domain, thanks to Drupal:
Linked Open Data, which is Linked Data open for anyone to use, is really taking off and Drupal is ready for it. There's a massive amount of information you can re-use in your Drupal installation, and this re-usability is the most important aspect of the semantic web. Just like a typical software developer re-uses a lot of software libraries for generic tasks, the semantic web allows you to re-use a lot of generic data. That's why the European Commission has been investing in Linked Open Data technology. Drupal and Linked Data have much to offer to each other, especially in the domain of publishing government data.
Bertolo mentioned three Linked Open Data projects funded by the European Commission. One is OKKAM, a project that ran from January 2008 to June 2010. Its name refers to the philosophical principle Occam's razor, "Entities should not be multiplied beyond necessity", to which the OKKAM project wants to be a 21st-century equivalent: "Entity identifiers should not be multiplied beyond necessity. What this means is that OKKAM offers an open service on the web to give a single and globally unique identifier for any entity which is named on the (semantic) web. This Entity Name System currently has about 7.5 million entities, such as Barack Obama, European Union, or Linus Torvalds. When you have found the entity you need in the OKKAM search engine, you can re-use its ID in all your RDF triples to refer unambiguously to the entity.
Another deliverable of the OKKAM project is sig.ma, a data aggregator for the semantic web. When you search for a keyword, sig.ma combines all information it can find in the "web of data" and presents it in a tabular form. Recently, a spin-off company started, based on the results of the research project.
The second European-funded project Bertolo talked about was LOD2, a large-scale project with many
deliverables. The project aims to contribute high-quality interlinked
versions of public semantic web data sets, and it will develop new
technologies to raise the performance of RDF triplestores to be on par with
relational databases. This is a huge challenge, because a graph-based data
model like RDF has many freedoms, which is difficult to optimize as there is no strict database schema. The LOD2 project will also develop new algorithms and tools for data cleaning, linking, and merging. For instance, these tools could make it possible to diagnose and repair semantic inconsistencies. Bertolo gave an example: "Let's say that a database lists that a person has had a car insurance since 1967 while the same database lists the person's age as 18 years. Syntactically, there are no errors in the database, but semantically we should be able to diagnose the inconsistency here."
A third project by the European Commission is Linked Open Data Around the Clock. Bertolo explains its goal: "The value of a network highly depends on the number of links, and currently the links across Linked Open Data datasets are not enough. The mission of the Linked Open Data Around the Clock project is to interlink these much more, to give people more bang for their RDF buck. Our objective is to have 500 million links in two years." As a testbed, the project started with publishing datasets produced by the European Commission, the European Parliament, and other European institutions as Linked Data on the Web and interlinking them with other governmental data.
Drupal paving the way
At the moment, the semantic web is still struggling with a
chicken-and-egg problem: many semantic web tools are still experimental and
not easy to use for end users, and publishers still have trouble finding a
good business model to publish their data as RDF when their competitors
don't do so. However, with out-of-the-box RDFa support in Drupal 7, the
open source CMS could pave the way for a more widespread adoption of
semantic web technologies: Drupal founder Dries Buytaert claims that his
CMS is already powering more than 1 percent of all websites in the
world. If Drupal keeps growing its market share, the CMS could help to bring Linked Open Data to the masses, and we could soon have millions of web sites with RDFa data on the web.
(
Log in to post comments)