|Benefits for LWN subscribers|
The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!
Every now and then, one finds oneself in a place where the near-ubiquitous Internet connectivity of today is absent, unusably slow, or prohibitively expensive. Some network functionality (like email) may be worth hassle and expense, while others (like streaming media) are not. Somewhere in between, though, lies reference data, which would be nice to cache locally for offline access, if it were technically feasible. To that end, some "open content" projects, such as OpenStreetMap, make configuring offline access relatively painless, but many others do not. For Wikipedia and the related Wikimedia projects (Wiktionary, Wikivoyage, etc.), the combination of an exceptionally large data set, constant editing, and multiple languages makes for a more challenging target—and a niche has developed for offline Wikipedia access software.
Of course, the "correct" solution to providing offline Wikipedia access would arguably be to run a mirror of the real site, which it is certainly possible to do. But, even then, mirrors start with a hefty Wikipedia database dump that requires considerable storage space: around 44GB for the basic text of the English Wikipedia site, without the "talk" or individual user pages. The media content is larger still; around 40TB are currently in Wikimedia's Commons, of which roughly 37TB is still images. Moreover, the database-import method does not allow a mirror to keep up with ongoing edits, although doing so would consume considerable system resources anyway.
On the other hand, in many cases, Wikipedia's usefulness as a general-purpose reference does not depend on having the absolute newest version of each article. Wikimedia makes periodic database dumps, which can suffice for weeks or even months at a time, depending on the subject. It is probably no surprise, then, that the most popular offline-Wikipedia tools focus on turning these periodic database releases into an approximation of the live site. Many also take a number of steps to conserve space—usually by storing a compressed version of the data, but in some cases by also omitting major sections of the content as well. There are two actively developed open-source tools for desktop Linux systems at present: XOWA and Kiwix. Both support storing compressed, searchable archives of multiple Wikimedia sites, although they differ on quite a few of the details.
Kiwix uses the openZIM file format for its content storage. The Wikipedia database dump is converted into static HTML beforehand, then compressed into the ZIM format. The basic ZIM format includes a metadata index that supports searching article titles, but to enable full-text search, the file must be indexed. The Kiwix project offers both indexed and unindexed archives for download; the indexed files are (naturally) larger, and they also come bundled with the Windows build of Kiwix. The ZIM format is designed with this usage in mind; its development is spearheaded by Switzerland's Wikimedia CH.
As far as content availability is concerned, the Kiwix project periodically updates its official ZIM releases for Wikipedia only—albeit in multiple languages (69 at present, not counting image-free variants available for a handful of the larger editions). In addition, volunteers produce ZIM files for other sites, at the moment including Wikivoyage, Wikiquote, Wiktionary, and Project Gutenberg, with TED and other efforts still in the works.
Kiwix itself is a GPLv3-licensed, standalone graphical application that most closely resembles a "help browser" or e-book reader. The content displayed is HTML, of course, but the user interface is limited to the content installed in the local "library." Users can search for new ZIM content from within the application as well as check for updates to the installed files.
Interestingly enough, there are many more ZIM archives listed within Kiwix's available-files browser than there are listed on the project's web site; why any particular offering is listed in the application is not clear, since some of the options appear to be personal vanity-publishing works. Searching and browsing installed archives is simple and fast; type-ahead search suggestions are available and one can bookmark individual pages. There are also built-in tools for checking the integrity of downloaded archives and exporting pages to PDF.
In broad strokes, XOWA offers much the same experience as Kiwix: one installs a browser-like standalone application (AGPL-licensed, in this case), for which individual offline-site archives must be manually installed. Like Kiwix, XOWA can download and install content from its own, official archives. But while Kiwix archives contain indexed, pre-generated HTML, XOWA archives include XML from the original database dumps (stored in SQLite files), which is then dynamically rendered into HTML whenever a new page is opened.
In theory, the XML in the Wikipedia database dumps is the original Wiki markup of the articles, so it should be more compact than the equivalent rendered HTML. In practice, though, such a comparison is less simple. The latest Kiwix ZIM file for the English Wikipedia is 42GB with images, 12GB without, whereas the latest XOWA releases are 89.6GB with images and 14.6GB without. But XOWA also makes a point of the fact that in includes not only the basic articles, but also the "Category," "Portal," and "Help" namespaces, as well as multiple sizes of the included images.
When comparing the two approaches, it is also important to note that XOWA is specifically designed for use with Wikimedia database dumps, a choice that has both pros and cons. In the pro column, virtually any compatible database dump can be used with the application; XOWA offers Wikipedia for 30 languages and a much larger selection of the related sites (Wiktionary, Wikivoyage, Wikiquote, Wikisource, Wikibooks, Wikiversity, and Wikinews, which are bundled together for most languages). XOWA's releases also tend to be more up-to-date; at present none is older than a few months, while some of the less-popular Kiwix archives are several years old.
The downsides, though, start with the fact that only Wikimedia-compatible content is supported. Thus, there is no Project Gutenberg archive available, nor could your favorite Linux news site generate a handy offline article archive should it feel compelled to do so. But perhaps more troubling is the fact that XOWA archives do not support full-text searching. Lookup by title is supported, but that may not always be sufficient for research.
The browsing experience of the XOWA application is similar to Kiwix; both HTML renderers use Mozilla's XULRunner. XOWA also supports bookmarking pages and library maintenance. XOWA gains a point for allowing the user to seamlessly jump between installed wikis; a Wikipedia link to a Wiktionary page works automatically in XOWA, while a Kiwix user must return to the "library" screen and manually open up a second archive in order to change sites.
On the other hand, XOWA does not support printing or PDF export, and there is a noticeable lag between clicking on a link and seeing the page load. The status bar at the bottom of the window is informative enough to indicate that the delay is due to XOWA's JTidy-based parser; it reports the loading of the page content as well as each template and navigation element used. The parser can also still trip up in its XML-to-HTML conversion. If one is concerned about the accuracy of the conversion, of course, Kiwix's pre-generated HTML offers no guarantees either, but at least its results are static and will not crash on an odd bit of Wiki-markup syntax.
Ultimately, though, if the question is whether XOWA or Kiwix generates pages more like those one sees in the web browser from the live Wikimedia site, neither standalone application is perfect. But users may chafe at the very need to run a separate application to read Wikipedia to begin with. Fortunately, both projects are also pursuing another option: serving up their content with an embedded web server, which permits users to access the offline archives from any browser they choose.
XOWA's server can be started with:
java -jar /xowa/xowa_linux.jar --app_mode http_server --http_server_port 8080
Kiwix's server (which, like Kiwix, is written in C++) can be started from the command line with:
kiwix-serve --port=8000 wikipedia.zim
or launched from the application's "Tools" menu. A nice touch for those experimenting with both is that Kiwix defaults to TCP port 8000, XOWA to port 8080. The XOWA project also offers a Firefox extension that directs xowa: URIs to the local XOWA web server process.
Moving forward, it will be interesting to watch how both projects are affected by changes to Wikimedia's infrastructure. The XOWA internal documentation notes that Wikipedia is, at some point, planning to implement diff-style database update releases in addition to its full-database dumps. Incremental updates are one of the factors that makes OpenStreetMap so usable in offline mode, and Wikipedia's lack of such updates is what contributes the most pain to Kiwix and XOWA usage: waiting for those multi-gigabyte downloads to finish.
As unsatisfying as it may seem, neither application emerges as the clear winner for someone inspired to head off to a rustic cabin in the mountains and read Wikipedia at length. At its most basic, the trade-off would seem to be Kiwix's support for non-Wikimedia sites and its full-text search versus XOWA's cross-wiki link support and more predictable update process. Either will likely serve the casual user well.
Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds