Who owns your data?

By Jake Edge
May 9, 2012

The Economist is concerned that our "digital heritage" may be lost because the formats (or media) may be unreadable in, say, 20 years time. The problem is complicated by digital rights management (DRM), of course, and the magazine is spot on with suggestions that circumventing those restrictions is needed to protect that heritage. But in calls for more regulation (not a usual Economist stance) the magazine misses one of the most important ways that digital formats can be future-proofed: free and open data standards.

DRM is certainly a problem, but a bigger problem may well be the formats that much of digital data is stored in. The vast majority of that data is not stored in DRM-encumbered formats, it is, instead, stored in "secret" data formats. Proprietary software vendors are rather fond of creating their own formats, updating them with some frequency, and allowing older versions to (surprise!) become unsupported. If users of those formats are not paying attention, documents and other data from just a few years ago can sometimes become unreadable.

There are few advantages to users from closed formats, but there are several for the vendors involved, of course. Lock-in and the income stream from what become "forced" upgrades are two of the biggest reasons that vendors continue with their "secret sauce" formats. But it is rather surprising that users, businesses and governments in particular, haven't rebelled. How did we get to a point where we will pay for the "privilege" of having a vendor take our data and lock it up such that we have to pay them, again and again, to access it?

There is a cost associated with documenting a data format, so the proprietary vendors would undoubtedly cite that as leading to higher purchase prices. But that's largely disingenuous. In many cases, there are existing formats (e.g. ODF, PNG, SVG, HTML, EPUB, ...) that could be used, or new ones that could be developed. The easiest way to "document" a format is to release code—not binaries—that can read it, but that defeats much of the purpose for using the proprietary formats in the first place so it's not something that most vendors are willing to do.

Obviously, free software fits the bill nicely here. Not only is code available to read the format, but the code that writes the format is there as well. While documentation that specifies all of the different values, flags, corner cases, and so on, would be welcome, being able to look at the code that actually does the work will ensure that data saved in that format can be read for years (centuries?) to come. As long as the bits that make up the data can be retrieved from the storage medium and that quantum computers running Ubuntu 37.04 ("Magnificent Mastodon") can still be programmed, the data will still be accessible. There may even be a few C/C++ programmers still around who can be lured out of retirement to help—if they aren't all busy solving the 2038 problem, anyway.

More seriously, though, maintaining access to digital data will require some attention. Storage device technology continues to evolve, and there are limits on the lifetime of the media itself. CDs, DVDs, hard drives, tapes, flash, and so on all will need refreshing from time to time. Moving archives from one medium to another is costly enough, why add potentially lossy format conversions and the cost of upgrading software to read the data—if said software is even still available.

Proprietary vendors come and go; their formats right along with them. Trying to read a Microsoft Word document from 20 years ago is likely to be an exercise in frustration, but trying to read a Windows 3.0 WordStar document will be far worse. There are ways to do so, of course, but they are painful—if one can even track down a 3.5" floppy drive (not to mention 5.25"). If the original software is still available somewhere (e.g. Ebay, backup floppies, ...) then it may be possible to use emulators to run the original program, but that still may not help with getting the data into a supported format.

Amusingly, free software often supports older formats far longer than the vendors do. While the results are often imperfect, reverse engineering proprietary data formats is a time-honored tradition in our communities. Once that's been done, there's little reason not to keep supporting the old format. That's not to say that older formats don't fall off the list at times, but the code is still out there for those who need it.

As internet services come and go, there will also be issues with preserving data from those sources. Much of it is stored in free software databases, though that may make little difference if there is no access to the raw data. In addition, the database schema and how it relates articles, comments, status updates, wall postings, and so on, is probably not available either. If some day Facebook, Google+, Twitter, Picasa, or any of the other proprietary services goes away—perhaps with little or no warning—that data may well be lost to the ages too. Some might argue that the majority of it should be lost, but some of it certainly qualifies as part of our digital heritage.

Beyond the social networks and their ilk, there are a huge number of news and information sites with relevant data locked away on their servers. Data from things like the New York Times (or Wall Street Journal), Boing Boing and other blogs, the article from The Economist linked above, the articles and comments here at LWN, and thousands (perhaps millions) more, are all things that one might like to preserve. The Internet Archive can only do so much.

Solutions for data from internet sites are tricky, since the data is closely held by the services and there are serious privacy considerations for some of it. But some way to archive some of that data is needed. By the time the service or site itself is on the ropes, it may well be too late.

Users should think long and hard before they lock up their long-term data in closed formats. While yesterday's email may not be all that important (maybe), that unfinished novel, last will and testament, or financial records from the 80s may well be. Beyond that, shareholders and taxpayers should be pressuring businesses and governments to store their documents in open formats. In the best case scenario, it will just cost more money to deal with old, closed-format data; in the worst case, after enough time passes, there may be no economically plausible way to retrieve it. That is something worth avoiding.

Who owns your data?

Posted May 10, 2012 2:36 UTC (Thu) by Comet (subscriber, #11646) [Link] (1 responses)

Since The Economist is British, I'll point to the British Library's Digital Preservation programme, which is tackling this issue: http://www.bl.uk/dp

They have archived copies of the Microsoft document format specifications; much as we might dislike it, the content they need to preserve is the content created by most of the populace. But folks trying to establish parity for other formats should probably reach out to the preservation officers of the BL to get their specifications archived too.

Who owns your data?

Posted May 10, 2012 13:55 UTC (Thu) by pboddie (guest, #50784) [Link]

They have archived copies of the Microsoft document format specifications; much as we might dislike it, the content they need to preserve is the content created by most of the populace.

Although welcome, this raises additional issues. Given this apparent safety net, people are now likely to say "Great, we're covered!" And then they will carry on churning out proprietary format content. But we are not covered.

Firstly, we don't even know if the specifications are complete or accurate. This is Microsoft we're talking about, so although it is possible that these published specifications have had some auditing as part of a regulatory action in the European Union, we can't be sure that they are usable until someone produces a separate implementation.

Secondly, people will happily start producing content in later versions of those formats which aren't covered by publicly available specifications. Again, we're talking about Microsoft, so any remedy for trouble they have managed to get themselves into will only last as long as the company is under scrutiny. Then, it's back to business as usual. Meanwhile, nobody in wider society will have been educated about the pitfalls of such proprietary formats and systems.

Thirdly, the cost of preservation under such initiatives may well be borne by the people whose data is now imprisoned in such formats, instead of the people responsible for devising the format in the first place. In various environments, there are actually standards for archiving, although I can well imagine that those responsible for enforcing such standards have been transfixed by the sparkle of new gadgetry, the soothing tones of the sales pitch, and the quick hand-over of an awkward problem to a reassuring vendor. Public institutions and the public in general should not have to make up the shortfall in the vendors' lack of investment.

Finally, standards compliance is awkward enough even when standards are open and documented. One can argue that a Free Software reference implementation might encourage overdependence on a particular technology and its peculiarities, potentially undermining any underdocumented standard, but this can really only be fixed when you have a functioning community and multiple Free Software implementations: then, ambiguities and inconsistencies are brought to the surface and publicly dealt with.

Sustainable computing and knowledge management requires a degree of redundancy. Mentions of the celebrated case of the BBC Domesday Project often omit the fact that efforts were made to properly document the technologies involved - it is usually assumed that nobody had bothered, which is not the case - but had that project been able to take advantage of widely supported, genuinely open standards, misplacing documentation would have had a substantially smaller impact on preservation activities.

Indeed, with open formats and appropriate licensing of the content, the output of the project might have been continuously preserved, meaning that the content and the means of deploying it would have adapted incrementally as technology progressed. That's a much more attractive outcome than sealing some notes in a box and hoping that future archaeologists can figure them out.

Who owns your data?

Posted May 10, 2012 3:40 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

Multimedia seems to be at least a minor exception to this. Most of the important formats were created by industry consortia, so they're fairly well documented and widely available. That also limits the extent to which companies can tamper with the formats in an attempt at user control, since they have to retain compatibility with well established standards. The biggest exception I can think of are the huge range of proprietary raw image formats created by digital camera companies, and even there we have projects like dcraw that have effectively documented the formats in the form of functioning decoding code.

Who owns your data?

Posted May 10, 2012 11:15 UTC (Thu) by robbe (guest, #16131) [Link] (2 responses)

You are talking about a recent development.

May I remind you of realaudio, indeo, cinepak, etc.? Videos of this time (1990s) were generally too crappy to remember, but a lot of actually useful audio recordings are still locked up in RA format.

Who owns your data?

Posted May 11, 2012 21:27 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link]

Didn't Real Audio eventually release an Open Source version of their player?

Who owns your data?

Posted May 12, 2012 15:51 UTC (Sat) by jengelh (guest, #33263) [Link]

The Fury3 game CD has videos in Cinepak format, and today's MPlayer still recognizes and plays them.

Who owns your data?

Posted May 10, 2012 4:19 UTC (Thu) by djfoobarmatt (guest, #6446) [Link]

In the past I worked with a digital repository project that made use of JHove (http://hul.harvard.edu/jhove/index.html) for verifying file formats and flagging documents that used proprietary extensions to open formats (such as some types of PDF documents). The project seems to be dormant now but it's an interesting part of the digital sustainability landscape and prompted a lot of thinking about the kinds of documents we could accept when trying to guarantee that the files could be accessed in 5/10/100 years.

Who owns your data?

Posted May 10, 2012 5:42 UTC (Thu) by eru (subscriber, #2753) [Link] (1 responses)

Trying to read a Microsoft Word document from 20 years ago is likely to be an exercise in frustration,

At work I have sometimes had reasons to do precisely that, and found OpenOffice manages it better than modern MS Office. Only the layout may be a bit off. Actually the old Word format is simple enough that even strings(1) can be used to recover the plain text parts.

if one can even track down a 3.5" floppy drive (not to mention 5.25").

That's one reason I'm keeping a couple of those in my basement (along with a couple of old computers). Actually 5.25" formatted as 360k (DSDD) is surprisingly durable. I once did a transfer job for an author who wanted to access some old manuscripts (?, can you call them that?) written on a MS-DOS machine with Wordstar and kept in boxes of 5.25" disks for 10 years, and found just two or three files that failed to be read.

Who owns your data?

Posted May 10, 2012 14:43 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

The Floppies with Prince of Persia source code for the Apple II were also successfully read after 20 years! I don't recall 1.44 MB floppies to be particularly durable though.

On text documents

Posted May 10, 2012 9:23 UTC (Thu) by philipstorry (subscriber, #45926) [Link] (25 responses)

Proprietary vendors come and go; their formats right along with them. Trying to read a Microsoft Word document from 20 years ago is likely to be an exercise in frustration, but trying to read a Windows 3.0 WordStar document will be far worse.

Assuming that we can't find a specification for the WordStar file in question, it will be an inconvenience. It would probably be easier to find the application, install it on a VM, and re-save the file to an acceptable intermediate format than to do any kind of reverse engineering.

But if file fidelity is important, then the original software may be the only option anyway. I've tried to open old Word 2.x/6.x for Windows documents with recent versions of Word - and if there's any complex formatting, it's pretty much a waste of time.

There's a naive assumption here that software with the same brand name (if it survives the years) is always going to be backwards compatible. Not only is that not born out with my own experience today, but I suspect it will only get worse.

Ultimately, if you want to still be able to access it in the future with decent fidelity, I see only three options.

Plain Text (as plain as you can get, but if you need unicode that'll probably work too)
PDF/A (it seems like a reasonable bet)
The exact format you're using now, and a VM image you update/migrate yearly

Yes, having to test (and upgrade to later versions if necessary) a VM image every year will be a pain. But it's probably the only reliable way.

If text documents are this much of a hassle, despite being the largest type of file by count, imagine how painful the other formats are going to be!

On text documents

Posted May 10, 2012 17:36 UTC (Thu) by iabervon (subscriber, #722) [Link] (10 responses)

Text documents are actually harder than a lot of things, in that the instructions to the system can be much more complicated. Audio ultimately amounts to moving some speakers, and video to coloring some dots, but text documents have a lot of information about font choice and positioning rules, as well as information on how the glyphs go in sequence. I expect to be able to cut-and-paste a paragraph out of a text document and put in in a document with a different font and a different width, and have line breaks put in appropriate places and my exponents and subscripts turn up as exponents and subscripts, and I expect to get the paragraph as a whole and not get text from the adjacent column or the page number (even if the paragraph is split across pages). This sort of information is not available as part of the content of most other sorts of file, even with the original software, so there's nothing to degrade with version changes.

On text documents

Posted May 10, 2012 23:42 UTC (Thu) by mrons (subscriber, #1751) [Link] (9 responses)

Of course if you had used TeX 20 years ago, the document would look exactly the same today, even down to the line-breaks, and be in an easily editable form.

On text documents

Posted May 11, 2012 2:34 UTC (Fri) by iabervon (subscriber, #722) [Link] (5 responses)

20 years ago, I only had a dot-matrix printer, and TeX isn't really set up to deal with extremely limited positioning granularity. Certainly everything I've written in the last 15 years has been in TeX unless it's been in HTML or something which renders to HTML. But TeX can actually be kind of problematic: you have to modify the file in order to avoid generating recto and verso pages, which are inappropriate for e-readers (or, really, any presentation form which doesn't involve dead trees). And TeX documents actually often make a lot of assumptions about the form of the result which means that there isn't machine-readable available to produce other presentation reasonably.

(Not to mention that building TeX requires implementations of at least two language dialects (WEB and \ph) which aren't used for anything else on any modern system; it's easier to make an emulator for the computers that Wordperfect ran on than to make a compiler able to build TeX, although people have done both.)

On text documents

Posted May 11, 2012 4:33 UTC (Fri) by eru (subscriber, #2753) [Link] (2 responses)

Huh? Most Linux distributions provide a TeX package. I believe it is build using a portable C implementation of WEB (web2c), which is a source to source translator. So just C is required for that part. Browsing the READMEs of a recent TeX for Linux implementation (http://www.tug.org/svn/texlive/trunk/Build/) there certainly are also other dependencies for building and auxiliary programs, but that is stuff that typical Linux implementations already provide. Of course bootstrapping TeX for a very different computer and OS from scratch would be a lot of work, but at least it is possible, thanks to the good documentation of teX and its source.

On text documents

Posted May 11, 2012 16:03 UTC (Fri) by iabervon (subscriber, #722) [Link] (1 responses)

web2c (plus a C compiler) is an implementation of WEB and \ph, and is of comparable portability and complexity to dosbox, which will run your old word processors. TeX's source does contain a lot of documentation about the expected Pascal dialect and the preprocessor; but there's even more documentation about the x86 and DOS. Your old DOS programs don't come with an extensive description of the platform they run on, but they're also not the only things that use that platform, so they don't have to.

On text documents

Posted May 11, 2012 17:07 UTC (Fri) by eru (subscriber, #2753) [Link]

Still have to disagree here. I'm pretty sure I could port web2c to a new platform in an evening or two, provide it has a decent ANSI C compiler (which is now a very common piece of infrastructure and can legitimately be assumed). Porting DOSBOX would be a much larger task, unless the new target is very similar to some of the existing ones. Yes, there is more documentation about x86 and DOS, because a lot more is needed to describe the complicated and ugly interface, and it is still incomplete...
I have found bugs in DOSBOX, which I currently use to support some legacy cross-compilation tools at my workplace. Also used DOSEMU+FreeDOS for the same task, and found it has some different bugs... I could work around the problems for the limited set of programs that were needed. But the fact is the only thing that is completely MS-DOS compatible for all programs still is the original MS-DOS.

the computers that Wordperfect ran on

Posted May 22, 2012 21:15 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Although the CURRENT WordPerfect file format is about 18 years old (it was created, as a result of MS breaking the old 5 format) in 1994.

To the best of my knowledge, WordPerfect files are both backwards AND forwards compatible between v6.0 (released in 1994 as I said) and the latest version.

So incompatibility like this is a deliberate or accidental vendor choice, not something that is inevitable ...

Cheers,
Wol

the computers that Wordperfect ran on

Posted May 26, 2012 19:06 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

What influence has MS had on Wordperfect, which, like Wordstar, was something entirely different from MS Word and Winword.

On text documents

Posted May 11, 2012 4:49 UTC (Fri) by eru (subscriber, #2753) [Link]

Of course if you had used TeX 20 years ago, the document would look exactly the same today, even down to the line-breaks, and be in an easily editable form.

I mostly agree from personal experience. I have some large LaTeX documents that were started that long ago, and which I still maintain now and then. Not quite pure LaTeX, because they contain diagrams that were done with xfig (but that also is still available, and quite good for simple diagrams). Some changes in LaTeX (mainly the transition to 3.x) required minor changes to the source, but these were limited just to the macro settings at the beginning of the document. Also I started to use some PostScript-related font packages for much improved PDF output, which slightly changed final layout. But the bulk of the text has not needed any changes attributable only to the formatting tool evolution. Supposing I had not been maintaining the documents for 20 years, suddenly getting them formatted with current versions of the tools might be slightly more work, but not much.

On text documents

Posted May 11, 2012 5:07 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Except that you've probably used some form of TeX macro library (MikTeX, LaTeX, etc), not raw TeX. In which case you have to hunt down all the dependencies and pray that they work.

On text documents

Posted May 12, 2012 15:56 UTC (Sat) by jengelh (guest, #33263) [Link]

RTF was an option to save in using Microsoft products back then, and shares the readability of TeX (in principle—WYSIWYG editors like Word had a tendency to not collapse redundant formatting statements, so that font name/color info was repeated for like every paragraph and bullet point).

On text documents

Posted May 12, 2012 19:20 UTC (Sat) by giraffedata (guest, #1954) [Link] (12 responses)

Ultimately, if you want to still be able to access it in the future with decent fidelity, I see only three options.
...

The exact format you're using now, and a VM image you update/migrate yearly

Yes, having to test (and upgrade to later versions if necessary) a VM image every year will be a pain. But it's probably the only reliable way.

Do you mean every year forever, even long after you're dead, or just every year while you're creating documents?

On text documents

Posted May 13, 2012 18:59 UTC (Sun) by philipstorry (subscriber, #45926) [Link] (11 responses)

After you're dead, it'll be a bit difficult. ;-)

But I mean every year that you want to be able to retrieve the documents, you should make sure your VM works, migrate it to new storage if necessary, and (if it's needed) upgrade it to work with the version of VM software you're using.

Otherwise, in a decade's time, you'll probably end up firing your VM up, only to find that the image is no longer a supported version and doesn't run anymore.

On text documents

Posted May 13, 2012 19:57 UTC (Sun) by giraffedata (guest, #1954) [Link] (10 responses)

OK, well I think that misses the point of the article, which talks about "heritage." Keeping your own active data usable is one thing, but a more complex concern is storing data for many generations and having it be usable by society at large at a point when it's considered history.

For that, something that requires a significant amount of effort to keep the data vital would probably be more costly than just discarding the data, so people are looking for ways just to stick something in a corner for 50 years, largely forget about it, and still have a decent chance of being able to use it.

Updating all your document reading tools each year to be compatible with this year's environment is an example of something so costly we assume it won't be done. In fact, I think updating the documents regularly would be more practical.

On text documents

Posted May 13, 2012 20:00 UTC (Sun) by philipstorry (subscriber, #45926) [Link] (9 responses)

Whilst I was speaking about personal solutions, it scales to heritage as well.

You only need one VM for all your data, because the access method would be to present some storage with the files you want to the VM.
(Granted, there may be a point where the lack of USB or CD support on hardware may mean that you have to present it with a disk image, but it's still fairly trivial.)

I don't think updating all my documents each year would be practical. The idea that it would be practical for a heritage model seems ridiculous.

The VM is probably the best method we will have to ensure fidelity. It's the least amount of work for the best return.

On text documents

Posted May 13, 2012 20:53 UTC (Sun) by giraffedata (guest, #1954) [Link] (8 responses)

The problem is that some day, your old VM won't run on the new VM host, so you have to update the VM operating system, and you old Word won't run on the new VM operating system.

You acknowledged the concern that the new VM host might not be able to read a CD, but your solution of a disk image (a VM host file that the VM sees as a disk drive, I presume) has the same problem. Not only do you have to store the disk image file on some medium from which the new VM host can read bits, but the new VM host has to be able to interpret those bits as virtual disk content.

I don't think updating all my documents each year would be practical. The idea that it would be practical for a heritage model seems ridiculous.

Agreed.

The VM is probably the best method we will have to ensure fidelity

It still looks to me less likely to succeed than updating the documents.

If we solve the problem (i.e. improve our data heritage), I think it will be like the Economist proposes: with agreements among ourselves to maintain archive formats.

On text documents

Posted May 13, 2012 22:04 UTC (Sun) by apoelstra (subscriber, #75205) [Link] (4 responses)

When your OS no longer supports the VM you want, you should just run an old version in a VM, so your document will be on a VM-within-a-VM. Then eventually you'll need a VM-within-a-VM-within-a-VM, and so on...

On text documents

Posted May 13, 2012 22:29 UTC (Sun) by giraffedata (guest, #1954) [Link] (3 responses)

When your OS no longer supports the VM you want, you should just run an old version in a VM, so your document will be on a VM-within-a-VM. Then eventually you'll need a VM-within-a-VM-within-a-VM, and so on...

I can't tell what you're describing. Can you phrase this without the word "support" so it's more precise?

I'm also unclear on what "the VM you want" is and whether when you say OS, you're talking about a particular instance or a class such as "Fedora".

On text documents

Posted May 13, 2012 22:54 UTC (Sun) by apoelstra (subscriber, #75205) [Link] (2 responses)

I was being somewhat tongue-in-cheek, but here is an example of what I meant:

Suppose your documents live in Wordstar for windows 3.1. So you keep Windows 3.1/MS-DOS 5 on a VM for a while. But one day you wake up to Windows 7, and it's 64-bit only, and won't your old DOS-supporting VM software anymore.

(I don't know if this is actually a problem. It's just an example.)

So you go ahead and install XP in a VM under Windows 7. On XP, you run a VM containing DOS, on which you run Wordstar.

Some years later, XP won't run on a VM since it's 2045 and nobody has heard of BIOS anymore. So you have a VM running Win7, which runs a VM running XP, which runs a VM running DOS, which finally runs Wordstar.

Then 25 years later, your VM software doesn't work, so you add another layer...

On text documents

Posted May 14, 2012 13:00 UTC (Mon) by philipstorry (subscriber, #45926) [Link] (1 responses)

I see your point, but I think we can circumvent some of this.

I think that at some point - probably host-architecture bound - we have to switch from VM-as-supervisor to straight emulation.

I've mentioned this in another reply, so apologies if you've read it already - but basically, when your VM solution finally stops supporting your version of the client OS then it's time to look at a switch to emulating the entire machine, QEMU style.

The advantage of that is that the emulation is much more likely to last longer, albeit be somewhat slower to run.

The Intel/AMD 64-bit chips are (I believe) incapable of running 16-bit code when in 64-bit mode. They can run 32-bit, just not 16-bit.
So we're already at the point where VM systems are unable to run some old OSes or apps without resorting to emulation behind the scenes.

Rather than rely on that assumed emulation, I think we should build in a stage where we simple say "all 16-bit code is emulated", and prepare for the idea that 128-bit processors in a decade or two might mean we have to add 32-bit code to the "emulated by default" pile.

That stops us from having to do VMs within VMs, as you describe. (And if the chip won't run the code, and there's no emulator, I'm not sure VMs within VMs will work anyway.)

On text documents

Posted May 14, 2012 13:19 UTC (Mon) by paulj (subscriber, #341) [Link]

I think you've missed their point. Switching to, say, QEMU and software-virtualisation doesn't solve things. Eventually one day QEMU will no longer be maintained. Some time after that, the systems on which QEMU runs (software and hardware wise) will be obsolete. You will then have to run a VM of the system on which QEMU runs, in order to be able to run QEMU, in order to run the VM that contains the software you need to run in order to view the document you're interested in. Even that new system will eventually one day become obsolete, necessitating another layer of VMs. You end up with VM turtles all the way down.

Further, you are assuming that for every such system that becomes obsolete there will be a VM on the newer system to run the older system. This is far from guaranteed. If that assumption ever fails, access to that document is lost after that time. For that assumption to always be true, every system needs to be sufficiently well documented by its maker that someone in the future will be able to emulate it. I.e. it assumes your chain of VMs will never become dependent on a monolithically proprietary system.

So, given that your VM system also relies on open specifications, wouldn't it be much better & simpler to just work towards ensuring documents are stored in openly specified formats? That seems far more future proof to me..

On text documents

Posted May 14, 2012 12:36 UTC (Mon) by philipstorry (subscriber, #45926) [Link] (2 responses)

I suspect that at some point, there will be a necessary move from VM-as-supervisor to full emulation.

For example, if you're using VirtualBox, then at some point the version of Windows may no longer be supported as a client OS. That's the time to shift the image to something like QEMU.

I should point out I wasn't envisaging the idea of a VM disk image as long-term storage, more as a transport medium. If there's genuinely no other way to get the data into the VM to be used, then simply giving it a fake hard disk is the ideal method - use a more modern VM to save the data to the disk, shut that down and then present it to the VM that has the software you need.

I envisage the data itself being seperate to the VMs themselves in all of this - the VMs should be small "access points". The disk image idea is just a way to get data into them temporarily.

So, to be clear, we have a two parts to the solution - your storage, which you can do what you want with. Keep multiple copies, keep checking the medium is good (via md5sum or similar), and so forth. And the access system, which is a VM you check once a year. And if it needs to be updated/transitioned to emulation, at least you know and can deal with tha.

On a very large scale, this divides the work between two teams - a storage team maintain the actual archives, and an apps team who maintain the access.

Of course, this is only if we want full fidelity. If we're OK with bad reformatting by a later version of the program, then we don't need the second team at all. :-)

On text documents

Posted May 14, 2012 14:25 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

>For example, if you're using VirtualBox, then at some point the version of Windows may no longer be supported as a client OS. That's the time to shift the image to something like QEMU.

Why? There's no innate reason for that. Emulators can be rewritten and/or forward ported. Besides, x86 is highly documented and known. I wouldn't be surprised if it would still be used in 1000 years.

On text documents

Posted May 14, 2012 16:08 UTC (Mon) by giraffedata (guest, #1954) [Link]

Emulators can be rewritten and/or forward ported.

Remember the parameters of the problem. We're not talking in this thread about what society could do here; we're talking about a strategy one person could use to make his data live forever. (If we branch out into the larger question, then we can consider things like making laws that people have to make emulators available to other people).

The fear is that people won't care enough about old documents to make the substantial investment in that forward porting. We see backward compatibility broken all the time, so it's a valid concern.

Given that, a QEMU platform is surely a better guess at something the next Windows will run on than a VirtualBox platform. (If VirtualBox VMs become far more common hosts of Windows than x86 hardware, the opposite will be true).

A system based on a chain of virtualization, which relies on there always being N-1 compatibility (the world will never switch to a new platform that can't run the previous one as a guest) also could work, but I think there's a good chance that compatibility chain will be broken in the natural course of things.

On text documents

Posted May 14, 2012 19:41 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link]

Yes, having to test (and upgrade to later versions if necessary) a VM image every year will be a pain. But it's probably the only reliable way.

I disagree. As I see it, there are only two ways of preserving a file: in an editable form that's intended to be updated further or in an archival form that's intended to preserve the file as close to its existing form as possible. If you intend to edit the file further, you can't guarantee that you'll be able to preserve its existing formatting anyway, so you might as well migrate it to a modern, well documented format like ODF while trying to preserve the existing formatting as well as possible. If you're trying to preserve it as a finished, archival document, you're best off translating it into a format like Postscript or PDF that is properly designed to preserve formatting at the expense of being editable.

What you really don't want to do is to rely on a brittle solution like running old software in a VM. It may be able to preserve fidelity a little bit better than the alternatives, but that only works as long as you have a working VM. Going from perfect fidelity to nothing is not a graceful failure! Rather than worrying about maintaining a working VM indefinitely, you'd be much better off spending your effort on a virtual printer for your existing VM that would let you export all your documents to PDF.

What it means in practice.

Posted May 10, 2012 10:48 UTC (Thu) by stevan (guest, #4342) [Link]

Here in Scotland, I have become involved in the creation of a community digital archive. It is fully free-software based (postgresql and dspace,) KVM-virtualised and all ancillary aspects, like presentation of the archival objects is done via free software too. We view virtualisation as an aspect of longevity too and the choice of dspace was done after checking that scripts were available to migrate if necessary. We mandate that only non-proprietary formats are used in the Archive, and we include standard (non-Adobe-encumbered) pdf in that definition. We have had a request from historians to ensure audio is stored lossless (that's OK - flac) but when it comes to things like video, especially compressed video, it becomes more problematic. It's quite straight forward to mandate vorbis or similar, in the absence of knowing where webm is heading. The reason is trting to guess what is likely to be demanded of the file in the future and to what extent current open standards are going to be transferrable in future. You could argue that file storing everything uncompressed is the answer, but it's not a wildly practical one. The same thing applies to images. A 6cm x 6cm negative from 50 years ago still contains a huge amount of information, but original digital images vary.

I'm not complaining - I think both free software and free and open formats are by no means lacking, and intellectually they are the way to go, but when it comes to applying them it's quite difficult to think practically into the longer term.

And when I say "we," above, I mean "I," as many of these nuances are lost on users of the Archive. It is necessary to have a story to tell to explain why docx gets a swift blow from the digital lead piping when it reaches the Archive.

Digital Restrictions Management

Posted May 10, 2012 11:45 UTC (Thu) by robbe (guest, #16131) [Link]

Contrary to Jake I actually see DRM as the main driver to invent new proprietary formats. Sure, lazyness can induce programmers to just shove their data structures on disk, but outside of the embedded space, I see that less and less often. Actually, this prime virtue of all programmers works in the other way too: there are readily available libraries that help to export pdf, odf, mpeg, etc -- why not just rely on these and be done?

That's where DRM comes in. Since it is basically an arm's race there is always motivation to crank out new schemes and formats.

The most problematic restrictions managment requires an online server to open a document. When this server inevitably goes away, the only hope of future historians is that they can easily crack our puny crypto on their (quantum?)computers.

Who owns your data?

Posted May 10, 2012 14:13 UTC (Thu) by nsheed (subscriber, #5151) [Link]

Speaking from personal experience, you have to be very careful even when saying specific formats are good/bad.

As an example, TIFF of all things has proved to be an ongoing source of pain due to the joys of a) odd JPEG usage in older files, b) explicitly allowing for vendor extensions in the specification (annotations & highlighting that appear/disappear depending on the viewing app).

So far most of the issues of this type are work-aroundable, the issue is every time we hit a new scenario it takes time to investigate/find workarounds (if possible)/go back to the source (again if possible - discovery of an issue may be months/years after file creation).

Even free software can age less than gracefully

Posted May 10, 2012 20:47 UTC (Thu) by roblatham (guest, #1579) [Link]

The story is not limited to proprietary formats. I found a 10 year old Gnucash file generated with gnucash-1.6, but today's Gnucash does not know how to read that file.

In the end, i installed debian sarge in a chroot (thanks archive.debian.org!) so I could run gnucash 1.8 without building the 20 little dependencies.

Who owns your data?

Posted May 11, 2012 4:03 UTC (Fri) by ringerc (subscriber, #3071) [Link] (1 responses)

The newspaper I work for is suffering from this at the moment.

We have a vast library of material in QuarkXPress 3.3 and QuarkXPress 4.0 format. Quark has never been what you'd call an "open" company; this is, after all, the company whose CEO has said that "all customers are liars, thieves and bastards".

Quark upgrades are expensive. Old versions of Quark don't work on newer OSes, and Quark doesn't fix even simple bugs in old versions. New versions of Quark don't import documents from old versions all that reliably, especially where things like font format changes are involved. More importantly, if you move to a non-Quark product, you lose access to all your historical work, because you have to keep on paying Quark upgrade fees to retain access to it on updated systems.

We landed up keeping an old Mac around to open Quark docs, and another slightly-less-old machine that has an importer plugin for InDesign that lets us open old Quark docs, convert them to a slightly less old InDesign format, save that, and open it in our current versions of InDesign.

Of course, InDesign has exactly the same problems as Quark; it's a locked down format under Adobe's total control. The problem continues to grow as we produce more work.

While everything is in PDF format too, that's not much good if we need to edit it - and there simply are no good open standard desktop publishing formats. OpenDocument is very poorly suited to DTP's layout-oriented approach, detailed typography, etc. Scribus's format isn't specified formally, is painful to work with, evolves continuously, and may as well be closed because nothing else supports it. There isn't anything else out there.

My point: Sometimes we'd like to avoid closed formats, but there aren't any alternatives to choose. The newspaper's technical debt keeps on growing, and there's not much I can do about it, as we're way too small to have the resources to create a competing format and support for it.

Who owns your data?

Posted May 11, 2012 10:37 UTC (Fri) by ebirdie (guest, #512) [Link]

Sounds very familiar although I worked for a company, which produced materials for learning and teaching like books. I once argued that the Quark-files the company had worth of in thousands and thousands paid work hours from script writers to layout designers, all that possibly reusable, weren't worth any work hours while upgrading up away from previous files but only to have the latest and greatest software. Well the files' ownership changed to a big corporation later, for which I didn't give a penny.

However Quark-files make an exception. The end product of publishing still finds its way to paper many times, on paper it gets distributed and can't be in such a stranglehold of file formats, DRM, cloud services etc as digital information. As the famous clause goes "information wants to be free" it is still too easy to forget that the freedom has independence coded into it. And here everyone knows, what free code is, but it seems like it hasn't yet produced as free information as paper still does - although there is arguably more restrictions printed on paper nowadays.

Seeing the current challenges and threads in future in maintaining free information, I'm glad that paper as medium was invented first in humankind. At least the paper offers some reference. It seems to be good business to reinvent everything digitally, so digital information is doomed. It is much cheaper and effortless to me give space to my books and carry them while moving (except I'll do everything in my power to not move anymore) than do the required work to maintain digital information usable and accessible.

Who owns your data?

Posted May 12, 2012 15:59 UTC (Sat) by jengelh (guest, #33263) [Link]

>As internet services come and go, there will also be issues with preserving data from those sources. Much of it is stored in free software databases, though that may make little difference if there is no access to the raw data.

Sometimes we would be glad if Facebook, Google, et al lost all their user profiling data about us in an instant because of that.

Who owns your data?

Posted May 18, 2012 1:57 UTC (Fri) by steffen780 (guest, #68142) [Link] (3 responses)

Seeing how LWN was specifically mentioned.. is there a way to download LWN archives? It doesn't have to be current to the day, nor even up to the point where articles are free, but would it be possible to make yearly archives of e.g. everything up to the year before the last completed year, ie. currently up to and including 2010? Bonus points if it includes the comments :)
Even more bonus points if it's under a free or CC license (I say "or" because for this I'd consider NC-ND perfectly acceptable, though I'm not a huge fan of that one).

Alternatively, could we get an official ok for using a script/tool to (slowly) run through the archives and download everything? Feels naughty to just download it all.

Who owns your data?

Posted May 18, 2012 2:26 UTC (Fri) by apoelstra (subscriber, #75205) [Link]

I'd be interested in this as well. (And yes, it'd be naughty to just download it all -- the admins would probably suspect you were an attacker and block your IP ;)).

Who owns your data?

Posted May 18, 2012 13:43 UTC (Fri) by corbet (editor, #1) [Link] (1 responses)

Please don't play "download the whole LWN site." We have enough people doing that as it is for no real reason that I can figure out.

We have various schemes for improving access to the archives. A lot of things are on hold at the moment, unfortunately, but stay tuned, we'll get there.

Who owns your data?

Posted May 18, 2012 14:52 UTC (Fri) by jackb (guest, #41909) [Link]

We have enough people doing that as it is for no real reason that I can figure out.

It may be related to brand management businesses.