LWN.net Logo

On text documents

On text documents

Posted May 13, 2012 19:57 UTC (Sun) by giraffedata (subscriber, #1954)
In reply to: On text documents by philipstorry
Parent article: Who owns your data?

OK, well I think that misses the point of the article, which talks about "heritage." Keeping your own active data usable is one thing, but a more complex concern is storing data for many generations and having it be usable by society at large at a point when it's considered history.

For that, something that requires a significant amount of effort to keep the data vital would probably be more costly than just discarding the data, so people are looking for ways just to stick something in a corner for 50 years, largely forget about it, and still have a decent chance of being able to use it.

Updating all your document reading tools each year to be compatible with this year's environment is an example of something so costly we assume it won't be done. In fact, I think updating the documents regularly would be more practical.


(Log in to post comments)

On text documents

Posted May 13, 2012 20:00 UTC (Sun) by philipstorry (subscriber, #45926) [Link]

Whilst I was speaking about personal solutions, it scales to heritage as well.

You only need one VM for all your data, because the access method would be to present some storage with the files you want to the VM.
(Granted, there may be a point where the lack of USB or CD support on hardware may mean that you have to present it with a disk image, but it's still fairly trivial.)

I don't think updating all my documents each year would be practical. The idea that it would be practical for a heritage model seems ridiculous.

The VM is probably the best method we will have to ensure fidelity. It's the least amount of work for the best return.

On text documents

Posted May 13, 2012 20:53 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

The problem is that some day, your old VM won't run on the new VM host, so you have to update the VM operating system, and you old Word won't run on the new VM operating system.

You acknowledged the concern that the new VM host might not be able to read a CD, but your solution of a disk image (a VM host file that the VM sees as a disk drive, I presume) has the same problem. Not only do you have to store the disk image file on some medium from which the new VM host can read bits, but the new VM host has to be able to interpret those bits as virtual disk content.

I don't think updating all my documents each year would be practical. The idea that it would be practical for a heritage model seems ridiculous.

Agreed.

The VM is probably the best method we will have to ensure fidelity

It still looks to me less likely to succeed than updating the documents.

If we solve the problem (i.e. improve our data heritage), I think it will be like the Economist proposes: with agreements among ourselves to maintain archive formats.

On text documents

Posted May 13, 2012 22:04 UTC (Sun) by apoelstra (subscriber, #75205) [Link]

When your OS no longer supports the VM you want, you should just run an old version in a VM, so your document will be on a VM-within-a-VM. Then eventually you'll need a VM-within-a-VM-within-a-VM, and so on...

On text documents

Posted May 13, 2012 22:29 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

When your OS no longer supports the VM you want, you should just run an old version in a VM, so your document will be on a VM-within-a-VM. Then eventually you'll need a VM-within-a-VM-within-a-VM, and so on...

I can't tell what you're describing. Can you phrase this without the word "support" so it's more precise?

I'm also unclear on what "the VM you want" is and whether when you say OS, you're talking about a particular instance or a class such as "Fedora".

On text documents

Posted May 13, 2012 22:54 UTC (Sun) by apoelstra (subscriber, #75205) [Link]

I was being somewhat tongue-in-cheek, but here is an example of what I meant:

Suppose your documents live in Wordstar for windows 3.1. So you keep Windows 3.1/MS-DOS 5 on a VM for a while. But one day you wake up to Windows 7, and it's 64-bit only, and won't your old DOS-supporting VM software anymore.

(I don't know if this is actually a problem. It's just an example.)

So you go ahead and install XP in a VM under Windows 7. On XP, you run a VM containing DOS, on which you run Wordstar.

Some years later, XP won't run on a VM since it's 2045 and nobody has heard of BIOS anymore. So you have a VM running Win7, which runs a VM running XP, which runs a VM running DOS, which finally runs Wordstar.

Then 25 years later, your VM software doesn't work, so you add another layer...

On text documents

Posted May 14, 2012 13:00 UTC (Mon) by philipstorry (subscriber, #45926) [Link]

I see your point, but I think we can circumvent some of this.

I think that at some point - probably host-architecture bound - we have to switch from VM-as-supervisor to straight emulation.

I've mentioned this in another reply, so apologies if you've read it already - but basically, when your VM solution finally stops supporting your version of the client OS then it's time to look at a switch to emulating the entire machine, QEMU style.

The advantage of that is that the emulation is much more likely to last longer, albeit be somewhat slower to run.

The Intel/AMD 64-bit chips are (I believe) incapable of running 16-bit code when in 64-bit mode. They can run 32-bit, just not 16-bit.
So we're already at the point where VM systems are unable to run some old OSes or apps without resorting to emulation behind the scenes.

Rather than rely on that assumed emulation, I think we should build in a stage where we simple say "all 16-bit code is emulated", and prepare for the idea that 128-bit processors in a decade or two might mean we have to add 32-bit code to the "emulated by default" pile.

That stops us from having to do VMs within VMs, as you describe. (And if the chip won't run the code, and there's no emulator, I'm not sure VMs within VMs will work anyway.)

On text documents

Posted May 14, 2012 13:19 UTC (Mon) by paulj (subscriber, #341) [Link]

I think you've missed their point. Switching to, say, QEMU and software-virtualisation doesn't solve things. Eventually one day QEMU will no longer be maintained. Some time after that, the systems on which QEMU runs (software and hardware wise) will be obsolete. You will then have to run a VM of the system on which QEMU runs, in order to be able to run QEMU, in order to run the VM that contains the software you need to run in order to view the document you're interested in. Even that new system will eventually one day become obsolete, necessitating another layer of VMs. You end up with VM turtles all the way down.

Further, you are assuming that for every such system that becomes obsolete there will be a VM on the newer system to run the older system. This is far from guaranteed. If that assumption ever fails, access to that document is lost after that time. For that assumption to always be true, every system needs to be sufficiently well documented by its maker that someone in the future will be able to emulate it. I.e. it assumes your chain of VMs will never become dependent on a monolithically proprietary system.

So, given that your VM system also relies on open specifications, wouldn't it be much better & simpler to just work towards ensuring documents are stored in openly specified formats? That seems far more future proof to me..

On text documents

Posted May 14, 2012 12:36 UTC (Mon) by philipstorry (subscriber, #45926) [Link]

I suspect that at some point, there will be a necessary move from VM-as-supervisor to full emulation.

For example, if you're using VirtualBox, then at some point the version of Windows may no longer be supported as a client OS. That's the time to shift the image to something like QEMU.

I should point out I wasn't envisaging the idea of a VM disk image as long-term storage, more as a transport medium. If there's genuinely no other way to get the data into the VM to be used, then simply giving it a fake hard disk is the ideal method - use a more modern VM to save the data to the disk, shut that down and then present it to the VM that has the software you need.

I envisage the data itself being seperate to the VMs themselves in all of this - the VMs should be small "access points". The disk image idea is just a way to get data into them temporarily.

So, to be clear, we have a two parts to the solution - your storage, which you can do what you want with. Keep multiple copies, keep checking the medium is good (via md5sum or similar), and so forth. And the access system, which is a VM you check once a year. And if it needs to be updated/transitioned to emulation, at least you know and can deal with tha.

On a very large scale, this divides the work between two teams - a storage team maintain the actual archives, and an apps team who maintain the access.

Of course, this is only if we want full fidelity. If we're OK with bad reformatting by a later version of the program, then we don't need the second team at all. :-)

On text documents

Posted May 14, 2012 14:25 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

>For example, if you're using VirtualBox, then at some point the version of Windows may no longer be supported as a client OS. That's the time to shift the image to something like QEMU.

Why? There's no innate reason for that. Emulators can be rewritten and/or forward ported. Besides, x86 is highly documented and known. I wouldn't be surprised if it would still be used in 1000 years.

On text documents

Posted May 14, 2012 16:08 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

Emulators can be rewritten and/or forward ported.

Remember the parameters of the problem. We're not talking in this thread about what society could do here; we're talking about a strategy one person could use to make his data live forever. (If we branch out into the larger question, then we can consider things like making laws that people have to make emulators available to other people).

The fear is that people won't care enough about old documents to make the substantial investment in that forward porting. We see backward compatibility broken all the time, so it's a valid concern.

Given that, a QEMU platform is surely a better guess at something the next Windows will run on than a VirtualBox platform. (If VirtualBox VMs become far more common hosts of Windows than x86 hardware, the opposite will be true).

A system based on a chain of virtualization, which relies on there always being N-1 compatibility (the world will never switch to a new platform that can't run the previous one as a guest) also could work, but I think there's a good chance that compatibility chain will be broken in the natural course of things.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds