|Please consider subscribing to LWN|
Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.
The Economist is concerned that our "digital heritage" may be lost because the formats (or media) may be unreadable in, say, 20 years time. The problem is complicated by digital rights management (DRM), of course, and the magazine is spot on with suggestions that circumventing those restrictions is needed to protect that heritage. But in calls for more regulation (not a usual Economist stance) the magazine misses one of the most important ways that digital formats can be future-proofed: free and open data standards.
DRM is certainly a problem, but a bigger problem may well be the formats that much of digital data is stored in. The vast majority of that data is not stored in DRM-encumbered formats, it is, instead, stored in "secret" data formats. Proprietary software vendors are rather fond of creating their own formats, updating them with some frequency, and allowing older versions to (surprise!) become unsupported. If users of those formats are not paying attention, documents and other data from just a few years ago can sometimes become unreadable.
There are few advantages to users from closed formats, but there are several for the vendors involved, of course. Lock-in and the income stream from what become "forced" upgrades are two of the biggest reasons that vendors continue with their "secret sauce" formats. But it is rather surprising that users, businesses and governments in particular, haven't rebelled. How did we get to a point where we will pay for the "privilege" of having a vendor take our data and lock it up such that we have to pay them, again and again, to access it?
There is a cost associated with documenting a data format, so the proprietary vendors would undoubtedly cite that as leading to higher purchase prices. But that's largely disingenuous. In many cases, there are existing formats (e.g. ODF, PNG, SVG, HTML, EPUB, ...) that could be used, or new ones that could be developed. The easiest way to "document" a format is to release code—not binaries—that can read it, but that defeats much of the purpose for using the proprietary formats in the first place so it's not something that most vendors are willing to do.
Obviously, free software fits the bill nicely here. Not only is code available to read the format, but the code that writes the format is there as well. While documentation that specifies all of the different values, flags, corner cases, and so on, would be welcome, being able to look at the code that actually does the work will ensure that data saved in that format can be read for years (centuries?) to come. As long as the bits that make up the data can be retrieved from the storage medium and that quantum computers running Ubuntu 37.04 ("Magnificent Mastodon") can still be programmed, the data will still be accessible. There may even be a few C/C++ programmers still around who can be lured out of retirement to help—if they aren't all busy solving the 2038 problem, anyway.
More seriously, though, maintaining access to digital data will require some attention. Storage device technology continues to evolve, and there are limits on the lifetime of the media itself. CDs, DVDs, hard drives, tapes, flash, and so on all will need refreshing from time to time. Moving archives from one medium to another is costly enough, why add potentially lossy format conversions and the cost of upgrading software to read the data—if said software is even still available.
Proprietary vendors come and go; their formats right along with them. Trying to read a Microsoft Word document from 20 years ago is likely to be an exercise in frustration, but trying to read a Windows 3.0 WordStar document will be far worse. There are ways to do so, of course, but they are painful—if one can even track down a 3.5" floppy drive (not to mention 5.25"). If the original software is still available somewhere (e.g. Ebay, backup floppies, ...) then it may be possible to use emulators to run the original program, but that still may not help with getting the data into a supported format.
Amusingly, free software often supports older formats far longer than the vendors do. While the results are often imperfect, reverse engineering proprietary data formats is a time-honored tradition in our communities. Once that's been done, there's little reason not to keep supporting the old format. That's not to say that older formats don't fall off the list at times, but the code is still out there for those who need it.
As internet services come and go, there will also be issues with preserving data from those sources. Much of it is stored in free software databases, though that may make little difference if there is no access to the raw data. In addition, the database schema and how it relates articles, comments, status updates, wall postings, and so on, is probably not available either. If some day Facebook, Google+, Twitter, Picasa, or any of the other proprietary services goes away—perhaps with little or no warning—that data may well be lost to the ages too. Some might argue that the majority of it should be lost, but some of it certainly qualifies as part of our digital heritage.
Beyond the social networks and their ilk, there are a huge number of news and information sites with relevant data locked away on their servers. Data from things like the New York Times (or Wall Street Journal), Boing Boing and other blogs, the article from The Economist linked above, the articles and comments here at LWN, and thousands (perhaps millions) more, are all things that one might like to preserve. The Internet Archive can only do so much.
Solutions for data from internet sites are tricky, since the data is closely held by the services and there are serious privacy considerations for some of it. But some way to archive some of that data is needed. By the time the service or site itself is on the ropes, it may well be too late.
Users should think long and hard before they lock up their long-term data in closed formats. While yesterday's email may not be all that important (maybe), that unfinished novel, last will and testament, or financial records from the 80s may well be. Beyond that, shareholders and taxpayers should be pressuring businesses and governments to store their documents in open formats. In the best case scenario, it will just cost more money to deal with old, closed-format data; in the worst case, after enough time passes, there may be no economically plausible way to retrieve it. That is something worth avoiding.
Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds