By Jake Edge
May 9, 2012
The Economist is concerned that our
"digital heritage" may be lost because the formats (or media) may be
unreadable in, say, 20 years time. The problem is complicated by digital
rights management (DRM), of course, and the magazine is spot on with
suggestions that circumventing those restrictions is needed to protect that
heritage. But in calls for more regulation (not a usual Economist
stance) the magazine misses one of the most important ways that digital
formats can be future-proofed: free and open data standards.
DRM is certainly a problem, but a bigger problem may well be the formats
that much of digital data is stored in. The vast majority of that data
is not stored in DRM-encumbered formats, it is, instead, stored in "secret" data
formats.
Proprietary software vendors are
rather fond of creating their own formats, updating them with some
frequency, and allowing older versions to (surprise!) become unsupported.
If users of those formats are not paying attention, documents and other
data from just a few years ago can sometimes become unreadable.
There are few advantages to users from closed formats, but there are
several for the vendors involved, of course. Lock-in and the income stream
from what become "forced" upgrades are two of the biggest reasons that
vendors continue with their "secret sauce" formats. But it is rather
surprising that users, businesses and governments in particular, haven't
rebelled. How did we get to a point where we will pay for the "privilege"
of having a vendor take our data and lock it up such that we have to pay
them, again and again, to access it?
There is a cost associated with documenting a data format, so the
proprietary vendors would undoubtedly cite that as leading to higher
purchase prices. But that's largely disingenuous. In many cases, there
are existing formats (e.g. ODF, PNG, SVG, HTML, EPUB, ...) that could be
used, or new ones that
could be developed. The easiest way to "document" a format is to release
code—not binaries—that can read it, but that defeats
much of the purpose for using the
proprietary formats in the first place so it's not something that most
vendors are willing to do.
Obviously, free software fits the bill nicely here. Not only is code
available to read the format, but the code that writes the format is there
as well. While documentation that specifies all of the different values,
flags, corner cases, and so on, would be welcome, being able to look at the
code that actually does the work will ensure that data saved in that format
can be read for years (centuries?) to come. As long as the bits that make
up the data can be retrieved from the storage medium and that quantum
computers running Ubuntu 37.04 ("Magnificent Mastodon") can still be
programmed, the data will still be accessible. There may even be a few
C/C++ programmers still around who can be lured out of retirement to help—if they aren't all busy solving the 2038 problem, anyway.
More seriously, though, maintaining access to digital data will require
some attention. Storage device technology continues to evolve, and there
are limits on the lifetime of the media itself. CDs, DVDs, hard drives,
tapes, flash, and so on all will need refreshing from time to time. Moving
archives from one medium to another is costly enough, why add potentially
lossy format
conversions and the cost of upgrading software to read the data—if
said software is even still available.
Proprietary vendors come and go; their formats right along with them.
Trying to read a Microsoft Word document from 20 years ago is likely to be
an exercise in frustration, but trying to read a Windows 3.0 WordStar
document will be far worse. There are ways to do so, of course, but they
are painful—if one can even track down a 3.5" floppy drive (not to
mention 5.25"). If the original software is still available somewhere
(e.g. Ebay, backup floppies, ...) then it may be possible to use emulators
to run the original program, but that still may not help with getting the
data into a supported format.
Amusingly, free software often supports older formats far longer than the
vendors do. While the results are often imperfect, reverse engineering
proprietary data formats is a time-honored tradition in our communities.
Once that's been done, there's little reason not to keep supporting the old
format. That's not to say that older formats don't fall off the list at
times, but the code is still out there for those who need it.
As internet services come and go, there will also be issues with preserving
data from those sources. Much of it is stored in free software
databases, though that may make little difference if there is no access to
the raw data. In addition, the database schema and how it relates articles,
comments, status updates, wall postings, and so on, is probably not
available either.
If some day Facebook, Google+, Twitter, Picasa, or any of the other
proprietary services goes away—perhaps with little or no
warning—that data may well be lost to the ages too. Some might argue
that the majority of it should be lost, but some of it certainly
qualifies
as part of our digital heritage.
Beyond the social networks and their ilk, there are a huge number of news
and information sites with relevant data locked away on their servers.
Data from things like the New York Times (or Wall Street Journal),
Boing Boing and other blogs, the article from The Economist linked above, the
articles and comments here at LWN, and thousands (perhaps millions) more,
are all things that one might like to preserve. The Internet Archive can only do so much.
Solutions for data from internet sites are tricky, since the data is
closely
held by the services and there are serious privacy considerations for some
of it. But some way to archive some of that data is needed. By the time
the service or site itself is on the ropes, it may well be too late.
Users should think long and hard before they lock up their long-term data
in closed formats. While yesterday's email may not be all that important
(maybe), that unfinished novel, last will and testament, or financial
records from the 80s may well be. Beyond that, shareholders and taxpayers
should be pressuring businesses and governments to store their documents in
open formats. In the best case scenario, it will just cost more money to
deal with old, closed-format data; in the worst case, after enough time
passes, there may be no economically plausible way to retrieve it. That is
something worth avoiding.
(
Log in to post comments)