LWN.net Logo

That Which Survives (TuxDeluxe)

Jeremy Allison writes about the impermanence of proprietary data formats. "I think proprietary record formats will present a problem for historians. Perhaps not in the short-term, but certainly in the medium to long term (and remember I'm talking about hundreds if not thousands of years now). Imagine that some historian in 500 years time discovers Vice President Cheney's "undisclosed location" and finds his secret laptop computer. "Finally," the historian thinks, "we will know who advised this administration about energy policy!" as he swims back to the surface of the ocean above the Washington monument. Unfortunately it turns out the data was written in the "Word-mangler for Windows 2002" format, for which no specifications were ever published, and which was deliberately designed to be difficult for the competition to read."
(Log in to post comments)

That Which Survives (TuxDeluxe)

Posted Jul 3, 2007 16:13 UTC (Tue) by pcampe (guest, #28223) [Link]

500-1000 years from now, they will have so much computational power and so many advances in algorithm theory that they will read the supposed-intact laptop within seconds. The real showstopper is that the vp's laptop will be unreadable, if not physically broken apart.

That Which Survives (TuxDeluxe)

Posted Jul 3, 2007 16:26 UTC (Tue) by richo123 (guest, #24309) [Link]

Nahh the real showstopper is the Dead eye Dick does not own a laptop.

That Which Survives (TuxDeluxe)

Posted Jul 3, 2007 16:43 UTC (Tue) by JoeBuck (subscriber, #2330) [Link]

You're assuming that progress marches uniformly in one direction. Consider the situation in Europe 500 years after the fall of the Western Roman Empire; Europe had actually marched backward technologically.

Also, there are still ancient written languages that we cannot decipher, for example the Indus script, from the Indus Valley civilization of Pakistan/India around 2500 BC.

But decoding Windows format is the least of an archaeologist's problems. Given a physical medium, you've got to be able to figure out how to get the bits off, and even if they haven't degraded, it's a tough job. If I give you an 8-inch floppy or a DEC RK05 disk (circa 1980), good luck reading it.

Deep time

Posted Jul 3, 2007 22:20 UTC (Tue) by tony (guest, #3654) [Link]

Check out Gregory Benford's Deep Time. He discusses these very problems, in great (and strangely interesting) detail.

How do you communicate across time, especially when we're not writing on stone, or even relatively-permanent paper?

That Which Survives (TuxDeluxe)

Posted Jul 4, 2007 2:25 UTC (Wed) by vmlinuz (guest, #24) [Link]

Sickening though it is to see the focus on Microsoft in this article, the point about virtualisation is interesting - obviously it's a lot less efficient than just using open formats and platforms now, but for the 30, 40, 50+ years of existing nasty proprietary-format data, it may be at least a somewhat-reliable solution. Particularly if you can nest your virtualisation/emulation...

Not a word about DRM ?

Posted Jul 8, 2007 10:30 UTC (Sun) by khim (subscriber, #9252) [Link]

it may be at least a somewhat-reliable solution

It may be somewhat-reliable solution for the data created 10 years back or 20 years, but not for the data created today! Why ? DRM. If the DRM works as intended then virtualisation will not work and we enter digital dark age pretty fast. If the DRM does not work and virtualisation can circumvent it - then what's the point ?

It's kind of strange to see any article where "digital dark age" is discussed and DRM is not even mentioned. Because where proprietary formats are problematic for archivist DRM is total disaster: problems with proprietary formats stem from ignorance, the same problems in DRM stem from deliberate design!

That Which Survives (TuxDeluxe)

Posted Jul 4, 2007 8:47 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

As Jeremy probably actually knows the real product to which he's given the pseudonym "Word-mangler for Windows" actually stores text as, well, text*. So if there are dozens of documents on a laptop that contain the string "Harold Francis" you don't need the source code of Word-mangler to realise that Harold Francis is the man you're looking for.

For the bulk of documents actually sitting in storage that might one day be interesting to historians much the same is true. It will be slightly inconvenient to extract the contents, they may not be able to replicate the exact intended appearance on screen, but the text survives.

Similarly for images, at the surface claims that some format is opaque and proprietary may seem justified, but to someone who (like these hypothetical future data historians) spends all day up to their neck in a hex editor, they're nearly always completely transparent, a few bytes of header on uncompressed raster data, or a non-JFIF way to write JPEG files. You can write (and in the past I have written) software to automatically detect such things.

The problem is vastly over-stated for data, the real trouble for that historian of the 26th century is that a 500 year old laptop is just a bunch of rust and sand. If Cheney's secret documents still exist in 2507 it will be because of a comprehensive document retention and backup system that has copied the data, uninterpreted from one medium to the next as they wear out or become obsolete.

* Text was standardised as early as the 1960s. Archivists new to computer data like to say that "nothing" lasts in the computer industry for more than a decade or so, but ASCII is included in ISO 8859-1, which in turn is included in Unicode / ISO 10646, the same number that meant '=' in ASCII data from the 1960s represents '=' in this Unicode text I'm writing now.

That Which Survives (TuxDeluxe)

Posted Jul 4, 2007 11:52 UTC (Wed) by arcticwolf (guest, #8341) [Link]

Text was standardised as early as the 1960s. Archivists new to computer data like to say that "nothing" lasts in the computer industry for more than a decade or so, but ASCII is included in ISO 8859-1, which in turn is included in Unicode / ISO 10646, the same number that meant '=' in ASCII data from the 1960s represents '=' in this Unicode text I'm writing now.

Funny that you should say that; I've just been reading a number of threads about EBCDIC-related problems again on perl5-porters recently...

That Which Survives (TuxDeluxe)

Posted Jul 4, 2007 15:07 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Right, but that exactly figures. Where would you expect to find people with unmaintainable mission critical Perl scripts? The exact same places that are still buying new z/OS machines from IBM for their EBCDIC data in 2007. In another couple of decades they'll hear about the 1980s desktop revolution and their tiny minds will explode.

The rational way for a scripting language like Perl to deal with EBCDIC in the 21st century is to convert to and from Unicode at the border, but because Perl's Unicode support is so hopeless already that would actually make the problem worse not better.

That Which Survives (TuxDeluxe)

Posted Jul 5, 2007 3:41 UTC (Thu) by jordanb (guest, #45668) [Link]

If you treat EBCDIC as a substitution cipher it becomes trivially easy to decode even if you don't have a translation table.

The problem with MS Word isn't the encoding, but all the other stuff wrapped around it.

But even that's not a major issue because the text can easily be extracted from a MS Word file, sans formatting.

I have a lot of trouble seeing a problem here to be honest. Document restoration has always been kind of an expensive process. A video stored in mpg on a hard drive isn't a huge deal more volatile than a film stored on fire-prone cellophane from the 1920s. They can do amazing things with that cellophane (for a price) and when recovering data from hard drives gets to be a big deal they'll be able to do amazing things there too.

That Which Survives (TuxDeluxe)

Posted Jul 7, 2007 20:17 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

I have a lot of trouble seeing a problem here to be honest. Document restoration has always been kind of an expensive process. ...

If document restoration in 500 years is as expensive as it is today, that's still a problem. Many of the films you're talking about are gone forever, and it would nice if they weren't. Some cost two million dollars to restore and it would be nice if it cost nothing.

Archaelogy was a bad example, since it is traditionally expensive. Deciphering a ancient Word document and creating a device to interpret one would be fairly normal archaelogy and would make a great Nova installment. A better example is a historian 25 years from now who can't use the document unless he can access its information within a few days and a few hundred dollars. After all, he doesn't even know if he'll find any useful information in it.

And that is a novel problem because for most of history, if you wanted to extract information from a 25 year old document, it was trivial. Today, we can read papers from two hundred years ago with little more effort than we use to read one from last week.

I'm not sure the secrecy or obfuscation of the document format is really interesting, because those will be comparatively minor stumbling blocks for historians. But the closedness of the format makes it harder for society to keep the format alive so that programs that read it are readily available to historians 25 years from now.

That Which Survives (TuxDeluxe)

Posted Jul 5, 2007 16:56 UTC (Thu) by njs (guest, #40338) [Link]

>It will be slightly inconvenient to extract the contents, they may not be able to replicate the exact intended appearance on screen, but the text survives.

Text is not necessarily the only important thing. Metadata, for instance, is very important, and may even be more important than the contents. (Imagine wading through petabyte archives of random files, unable to do simple things like "sort by creation date". Then even if you find something, how do you interpret it without context?) This is made worse because of that document retention system you mention, where by the time the historian sees the file it's been shuffled through a dozen different systems and media, all with their own chance to lose track of where the document came from.

Some of this can be handled the same way you describe, of course, and honestly .doc is more interoperable than most other formats -- thanks to the work that *we*, the FLOSS community, have done, not MS. But it is far, far, far from a trivial problem.

BBC article on Microsoft and the British Library

Posted Jul 4, 2007 19:45 UTC (Wed) by alanjwylie (subscriber, #4794) [Link]

And here's a BBC news article on the subject of old file formats, in which
"Adam Farquhar, head of e-architecture at the British Library, praised Microsoft for its adoption of more open standards"
http://news.bbc.co.uk/1/hi/technology/6265976.stm

The British Library are the people who consp^H^H^Hoperated with Microsoft
to use the Windows only "Turning the Pages" software to view a Leonardo da Vinci manuscript.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds