LWN.net Logo

Semantic MediaWiki: Toward smarter wikis

July 13, 2011

This article was contributed by Koen Vervloesem

Interactive Knowledge Stack (IKS) is an open source project focused on building an open and flexible technology platform for semantically enhanced Content Management Systems (CMS). It is a collaboration between academia, industry, and open source developers, co-funded with €6.58 million by the European Union. The goal is to enrich content management systems with semantic content in order to let the users benefit from more intelligent extraction and linking of their content. This could solve part of the chicken-and-egg problem for the semantic web that arises because end users don't have easy-to-use semantic web tools.

At the recent IKS workshop in Paris, one of the keynote speakers was Mark Greaves, who spoke about the possibilities of the semantic web in the wiki setting. His speech looked at the limits of traditional wikis, the promise of semantic wikis, and the birth of Semantic MediaWiki (SMW), an extension to MediaWiki, the wiki software that powers Wikipedia.

Wikis have become a powerful instrument for crowdsourcing, but they're not the only types of content management systems that tap into the potential of the crowd. Greaves, who is working as Director of Knowledge Systems at Paul Allen's asset management company Vulcan, emphasized that bulletin boards, forums, and newsgroups are the antecedents of wikis and even the beginnings of social networks. Now we have many websites that crowdsource their content from their users:

Users write reviews on Amazon, give recommendations on Amazon and Netflix, add tags to content on Flickr and YouTube, and so on. So we see in a lot of cases that users can help building your website, not only the content but even the structure. For instance, the system administrators of WikiMedia only have to run the servers; all the rest is done by volunteers.

A critical property of wikis is consensus, which comes thanks to collaboration and custom policies. For instance, one of the core content policies of the Wikipedia encyclopedia is that each article should be written from a neutral point of view (NPOV). This forces authors to not write from their own point of view and that helps lead them to consensus with authors that have another point of view about the topic. The MediaWiki software also has software support to facilitate reaching consensus, such as the talk pages and change tracking.

But traditional wikis have their limits, as most knowledge is locked inside text and cannot be queried in a smart way. Wikipedia has an answer for this with thousands of lists, for instance lists of countries (which is itself a list of lists). But these are all manually maintained, each of them ordered by another property, like birth rate, literacy rate, population, income equality, and so on. So Greaves asked the logical question: "Why don't we give Wikipedia authors a way to add structure to their content?"

Semantic MediaWiki

That's where semantic wikis come in, and according to Greaves they hold a lot of promise:

Semantic wikis augment traditional wikis with database capabilities, but with one crucial difference: traditional databases are schema-first, while semantic wikis are schema-last: the database schema is developed and maintained in the wiki by the authors themselves. So with semantic wikis we get a lot more flexibility and we have the means to reach social consensus over data. Unfortunately, consensus over data is very hard, and one of the prime reasons is that data modeling is a highly specialized skill.

One project working to add semantics to wiki systems is Semantic MediaWiki (SMW), a GPL licensed extension to MediaWiki that allows annotating semantic data within wiki pages. This means that a MediaWiki wiki that incorporates the extension is turned into a semantic wiki: content that has been enriched with semantic information can be used in specialized searches, used for aggregation of pages, displayed in alternate formats like maps, calendars, or graphs, and exported to formats like RDF (Resource Description Framework) and CSV (Comma-Separated Values).

How does this work?

Some examples will make it clear what SMW adds. For instance, on the normal Wikipedia page of France, there's a link to its capital city, Paris:

    ... the capital city is [[Paris]] ...

The [[Paris]] code is a link to a wiki page about Paris, but there's no information encoded about the specific relationship between France and Paris.

In contrast to this classical approach, the semantic web is all about interlinking data in a machine-readable way. The core technology under the hood is RDF, which is used to describe entities and their relationships. Each RDF statement comes in the form of a triple: subject - predicate - object. Each subject and predicate is identified by a URI, while an object can be represented by a URI or be a literal value such as a string or a number. So, a Semantic MediaWiki version of the sentence about Paris could be:

    ... the capital city is [[Has capital::Paris]] ...

The [[Has capital::Paris]] code not only adds a link to a wiki page about Paris, it also specifies the nature of the relationship between France and Paris: France has Paris as its capital. Or to translate it into an RDF triple: "France" (which is implicit as it is the topic of the current page) is the subject, "has capital" is the predicate, and "Paris" is the object.

This is an example where the object can be represented by a URI, but there are also other examples where the object is represented by a literal value such as a number:

    ... its population is [[Has population::65,821,885]] ...

When this code is on the page about France, it represents an RDF triple with "France" as its subject, "has population" as its predicate, and "65,821,885" as its object. These typed links (with the predicate as the type) give SMW an out-of-the-box mechanism to automatically generate lists. With SMW's inline queries feature, it's easy to re-use this structured information to generate lists and tables which are automatically updated and cached. For instance, users can easily generate a page with a list of all countries ordered by their population, or a list of all countries with a population greater than 20 million, or a table of all countries with their capitals, and so on.

Automatically-generated lists are not the only possibility when you start adding semantic links. You can also display the information in various formats, you can have different language versions of a wiki using the same data, you can integrate and mash-up your wiki's data and export it for external re-use, and more.

Ecosystem

The development of Semantic MediaWiki was initially funded by the EU project SEKT (Semantically Enabled Knowledge Technologies), and after this supported in part by the University of Karlsruhe in Germany. The first release was version 0.1 in 2005. In 2007, Vulcan started sponsoring the German company Ontoprise to develop a commercial version of the extension, Semantic MediaWiki+ (SMW+).

According to Greaves, there are 50 open source MediaWiki extensions that use the semantic information provided by SMW. For example, there's Halo, funded by Vulcan, that facilitates creation, retrieval, navigation and organization of semantic data with some intuitive graphical user interfaces, Semantic Drilldown that provides a faceted browser interface for viewing semantic data by filtering, and Semantic Result Formats that provides a large number of display formats, including maps, calendars, graphs, and charts.

If you want to install SMW on your own wiki, there's an extensive administrator manual with installation instructions and a list of the configuration options. For users who will be entering the semantic markup, the project also has a user manual

Some semantic wikis

Semantic MediaWiki is already used by over 300 public active wikis around the world. Greaves called these semantic wiki applications "the icing on the cake", as they really show the flexibility of adding semantics to a wiki. Some notable examples are Open Energy Information, a crowdsourced wiki with information about energy resources, including real-time data and visualizations, SKYbrary, a wiki created by several European aviation organizations to create a comprehensive source of aviation safety information, Familypedia, a wiki on family history and genealogy, SNPedia, a wiki investigating human genetics, Oh Internet, a wiki to track internet memes, and Ultrapedia, a search engine for OCR'd books.

Many organizations also use SMW internally, including Pfizer, Johnson & Johnson Pharmaceutical Research and Development, and the U.S. Department of Defense. Greaves added that Vulcan is eating its own dog food:

We have created a lightweight project management tool with Semantic MediaWiki+. Our developers even change the ontology of this Scrum wiki [Scrum is an iterative, incremental framework for project management] once a month, which proves that the added flexibility of a schema-last database is welcome.

Towards a semantic Wikipedia

Some academics have already proposed using SMW on Wikipedia to tackle the problem of the many lists that have to be created manually, but according to Wikimedia Foundation Deputy Director Erik Möller it's still unclear whether SMW is up to the task of supporting a web site on the scale of Wikipedia. So while Semantic MediaWiki already powers a lot of web sites and is quite user-friendly, it remains to be seen whether it will eventually bring semantics to the ultimate wiki, Wikipedia.

The SMW project has a fairly detailed roadmap. Some of the interesting tasks are an improvement of the usability of the semantic search features (part of Google Summer of Code 2011), a light version of SMW without query capabilities, improvements for the Semantic Drilldown extension, and so on. It's already quite usable, as many of the active SMW wikis show, but to really reach the vision of the semantic web and be able to link various semantic wikis and other content management systems, Semantic MediaWiki needs to become as easy to use as Wikipedia.


(Log in to post comments)

Semantic MediaWiki: Toward smarter wikis

Posted Jul 14, 2011 3:00 UTC (Thu) by Baylink (subscriber, #755) [Link]

> ... the capital city is [[Has capital::Paris]] ...
> ... its population is [[Has population::65,821,885]] ...

And that's a perfect example that not only is data modeling hard, so is syntax.

Either Paris is *not* the name of a page, and the syntax of the first example is wrong, or 65,821,885 *is* the name of a page... or the second example is wrong.

In short, you can't *just* do this as a marker on a wikilink, or else you have no way to mark inline non-link text that is suddenly also data.

So, you either need *two different ways* to tag semantic data (as a marker inside a standard wikilink *and* as some other syntax on unlinked text -- and believe me, mediawikitext syntax is messy enough right now)...

*or*, you need two *layers* of marker, the traditional way to mark wikilinks and a separate way to mark semantic text... which is *equally* confusing to editors in the first example -- and do you put the semantic marks inside or outside the wikilink marks? (The problem is even more difficult than it was for me to punctuate that paragraph readably. ;-)

(And note that I'm not even mentioning that we're out of paired markers here...)

This is a great idea... but coming up with an end-user implementation that doesn't cause everyone to tear out their hair is going to be non-trivial in the extreme. (For those who haven't read the Jargon file recently, that translates as "very *very* close to theoretically impossible" :-}).

Semantic MediaWiki: Toward smarter wikis

Posted Jul 14, 2011 5:04 UTC (Thu) by BrucePerens (guest, #2510) [Link]

The article calls them "typed links".

Semantic MediaWiki: Toward smarter wikis

Posted Jul 14, 2011 10:00 UTC (Thu) by skierpage (guest, #70911) [Link]

Amazingly, they thought of that. At one point there was different syntax for annotating links and values, but you've always been able to specify the type of a property using the special property [[Has type::Number]] (for population, other types include Date, String, URL, Geographic coordinate). See http://semantic-mediawiki.org/wiki/Help:Properties_and_ty...

SMW annotations make statements about wiki pages, and much of its workings including user properties like [[Property:Has capital]], live in wiki pages. It's oddly powerful.

DBpedia

Posted Jul 14, 2011 10:11 UTC (Thu) by skierpage (guest, #70911) [Link]

" it remains to be seen whether it will eventually bring semantics to the ultimate wiki, Wikipedia."
That would be nice, meanwhile Wikipedia's heavy use of templates means there is already a ton of structured data in articles, thus:
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

Semantic Wikipedia

Posted Jul 14, 2011 22:30 UTC (Thu) by Simetrical (guest, #53439) [Link]

FWIW, the problem with deploying SMW on Wikimedia sites like Wikipedia has always been that it's a big codebase (tens of thousands of lines), which shares few to no active developers with MediaWiki proper, and which has had never had thorough review by core MediaWiki developers for security or performance. It would need a great deal of resources to review, and it's certain that large parts would have to be rewritten or disabled to meet Wikipedia's performance requirements and MediaWiki coding standards. From the perspective of the people making decisions on this sort of thing for Wikimedia, it would probably be less effort to rewrite from scratch.

If Erik Moeller said it was "unclear" whether SMW is up to to task of running on Wikipedia, he was either being polite or didn't ask core developers who have looked at it. It's not. I don't say this to be negative -- it's an awesome project, and its functionality is absolutely make-or-break for countless small to medium MediaWiki installs. But it's not possible for a project of this scale to be usable on a site as large as Wikipedia unless it was written that way to begin with, and (like almost all software) it wasn't. Even if it were, the kind of general-purpose data-munging that makes some SMW-related extensions so valuable is just not possible to do on very large datasets.

So I can pretty confidently say that awesome though it may be, SMW is not going to be enabled on Wikipedia at any time in the foreseeable future.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds