LWN: Comments on "Who owns your data?"

the computers that Wordperfect ran on

mirabilos — Sat, 26 May 2012 19:06:38 +0000

What influence has MS had on Wordperfect, which, like Wordstar, was something entirely different from MS Word and Winword.

the computers that Wordperfect ran on

Wol — Tue, 22 May 2012 21:15:02 +0000

Although the CURRENT WordPerfect file format is about 18 years old (it was created, as a result of MS breaking the old 5 format) in 1994.

To the best of my knowledge, WordPerfect files are both backwards AND forwards compatible between v6.0 (released in 1994 as I said) and the latest version.

So incompatibility like this is a deliberate or accidental vendor choice, not something that is inevitable ...

Cheers,
Wol

Who owns your data?

jackb — Fri, 18 May 2012 14:52:52 +0000

We have enough people doing that as it is for no real reason that I can figure out.

It may be related to brand management businesses.

Who owns your data?

corbet — Fri, 18 May 2012 13:43:55 +0000

Please don't play "download the whole LWN site." We have enough people doing that as it is for no real reason that I can figure out.

We have various schemes for improving access to the archives. A lot of things are on hold at the moment, unfortunately, but stay tuned, we'll get there.

Who owns your data?

apoelstra — Fri, 18 May 2012 02:26:35 +0000

I'd be interested in this as well. (And yes, it'd be naughty to just download it all -- the admins would probably suspect you were an attacker and block your IP ;)).

Who owns your data?

steffen780 — Fri, 18 May 2012 01:57:08 +0000

Seeing how LWN was specifically mentioned.. is there a way to download LWN archives? It doesn't have to be current to the day, nor even up to the point where articles are free, but would it be possible to make yearly archives of e.g. everything up to the year before the last completed year, ie. currently up to and including 2010? Bonus points if it includes the comments :)
Even more bonus points if it's under a free or CC license (I say "or" because for this I'd consider NC-ND perfectly acceptable, though I'm not a huge fan of that one).

Alternatively, could we get an official ok for using a script/tool to (slowly) run through the archives and download everything? Feels naughty to just download it all.

On text documents

rgmoore — Mon, 14 May 2012 19:41:24 +0000

Yes, having to test (and upgrade to later versions if necessary) a VM image every year will be a pain. But it's probably the only reliable way.

I disagree. As I see it, there are only two ways of preserving a file: in an editable form that's intended to be updated further or in an archival form that's intended to preserve the file as close to its existing form as possible. If you intend to edit the file further, you can't guarantee that you'll be able to preserve its existing formatting anyway, so you might as well migrate it to a modern, well documented format like ODF while trying to preserve the existing formatting as well as possible. If you're trying to preserve it as a finished, archival document, you're best off translating it into a format like Postscript or PDF that is properly designed to preserve formatting at the expense of being editable.

What you really don't want to do is to rely on a brittle solution like running old software in a VM. It may be able to preserve fidelity a little bit better than the alternatives, but that only works as long as you have a working VM. Going from perfect fidelity to nothing is not a graceful failure! Rather than worrying about maintaining a working VM indefinitely, you'd be much better off spending your effort on a virtual printer for your existing VM that would let you export all your documents to PDF.

On text documents

giraffedata — Mon, 14 May 2012 16:08:24 +0000

Emulators can be rewritten and/or forward ported.

Remember the parameters of the problem. We're not talking in this thread about what society could do here; we're talking about a strategy one person could use to make his data live forever. (If we branch out into the larger question, then we can consider things like making laws that people have to make emulators available to other people).

The fear is that people won't care enough about old documents to make the substantial investment in that forward porting. We see backward compatibility broken all the time, so it's a valid concern.

Given that, a QEMU platform is surely a better guess at something the next Windows will run on than a VirtualBox platform. (If VirtualBox VMs become far more common hosts of Windows than x86 hardware, the opposite will be true).

A system based on a chain of virtualization, which relies on there always being N-1 compatibility (the world will never switch to a new platform that can't run the previous one as a guest) also could work, but I think there's a good chance that compatibility chain will be broken in the natural course of things.

On text documents

Cyberax — Mon, 14 May 2012 14:25:27 +0000

>For example, if you're using VirtualBox, then at some point the version of Windows may no longer be supported as a client OS. That's the time to shift the image to something like QEMU.

Why? There's no innate reason for that. Emulators can be rewritten and/or forward ported. Besides, x86 is highly documented and known. I wouldn't be surprised if it would still be used in 1000 years.

On text documents

paulj — Mon, 14 May 2012 13:19:43 +0000

I think you've missed their point. Switching to, say, QEMU and software-virtualisation doesn't solve things. Eventually one day QEMU will no longer be maintained. Some time after that, the systems on which QEMU runs (software and hardware wise) will be obsolete. You will then have to run a VM of the system on which QEMU runs, in order to be able to run QEMU, in order to run the VM that contains the software you need to run in order to view the document you're interested in. Even that new system will eventually one day become obsolete, necessitating another layer of VMs. You end up with VM turtles all the way down.

Further, you are assuming that for every such system that becomes obsolete there will be a VM on the newer system to run the older system. This is far from guaranteed. If that assumption ever fails, access to that document is lost after that time. For that assumption to always be true, every system needs to be sufficiently well documented by its maker that someone in the future will be able to emulate it. I.e. it assumes your chain of VMs will never become dependent on a monolithically proprietary system.

So, given that your VM system also relies on open specifications, wouldn't it be much better & simpler to just work towards ensuring documents are stored in openly specified formats? That seems far more future proof to me..

On text documents

philipstorry — Mon, 14 May 2012 13:00:16 +0000

I see your point, but I think we can circumvent some of this.

I think that at some point - probably host-architecture bound - we have to switch from VM-as-supervisor to straight emulation.

I've mentioned this in another reply, so apologies if you've read it already - but basically, when your VM solution finally stops supporting your version of the client OS then it's time to look at a switch to emulating the entire machine, QEMU style.

The advantage of that is that the emulation is much more likely to last longer, albeit be somewhat slower to run.

The Intel/AMD 64-bit chips are (I believe) incapable of running 16-bit code when in 64-bit mode. They can run 32-bit, just not 16-bit.
So we're already at the point where VM systems are unable to run some old OSes or apps without resorting to emulation behind the scenes.

Rather than rely on that assumed emulation, I think we should build in a stage where we simple say "all 16-bit code is emulated", and prepare for the idea that 128-bit processors in a decade or two might mean we have to add 32-bit code to the "emulated by default" pile.

That stops us from having to do VMs within VMs, as you describe. (And if the chip won't run the code, and there's no emulator, I'm not sure VMs within VMs will work anyway.)

On text documents

philipstorry — Mon, 14 May 2012 12:36:50 +0000

I suspect that at some point, there will be a necessary move from VM-as-supervisor to full emulation.

For example, if you're using VirtualBox, then at some point the version of Windows may no longer be supported as a client OS. That's the time to shift the image to something like QEMU.

I should point out I wasn't envisaging the idea of a VM disk image as long-term storage, more as a transport medium. If there's genuinely no other way to get the data into the VM to be used, then simply giving it a fake hard disk is the ideal method - use a more modern VM to save the data to the disk, shut that down and then present it to the VM that has the software you need.

I envisage the data itself being seperate to the VMs themselves in all of this - the VMs should be small "access points". The disk image idea is just a way to get data into them temporarily.

So, to be clear, we have a two parts to the solution - your storage, which you can do what you want with. Keep multiple copies, keep checking the medium is good (via md5sum or similar), and so forth. And the access system, which is a VM you check once a year. And if it needs to be updated/transitioned to emulation, at least you know and can deal with tha.

On a very large scale, this divides the work between two teams - a storage team maintain the actual archives, and an apps team who maintain the access.

Of course, this is only if we want full fidelity. If we're OK with bad reformatting by a later version of the program, then we don't need the second team at all. :-)

On text documents

apoelstra — Sun, 13 May 2012 22:54:54 +0000

I was being somewhat tongue-in-cheek, but here is an example of what I meant:

Suppose your documents live in Wordstar for windows 3.1. So you keep Windows 3.1/MS-DOS 5 on a VM for a while. But one day you wake up to Windows 7, and it's 64-bit only, and won't your old DOS-supporting VM software anymore.

(I don't know if this is actually a problem. It's just an example.)

So you go ahead and install XP in a VM under Windows 7. On XP, you run a VM containing DOS, on which you run Wordstar.

Some years later, XP won't run on a VM since it's 2045 and nobody has heard of BIOS anymore. So you have a VM running Win7, which runs a VM running XP, which runs a VM running DOS, which finally runs Wordstar.

Then 25 years later, your VM software doesn't work, so you add another layer...

On text documents

giraffedata — Sun, 13 May 2012 22:29:55 +0000

When your OS no longer supports the VM you want, you should just run an old version in a VM, so your document will be on a VM-within-a-VM. Then eventually you'll need a VM-within-a-VM-within-a-VM, and so on...

I can't tell what you're describing. Can you phrase this without the word "support" so it's more precise?

I'm also unclear on what "the VM you want" is and whether when you say OS, you're talking about a particular instance or a class such as "Fedora".

On text documents

apoelstra — Sun, 13 May 2012 22:04:56 +0000

When your OS no longer supports the VM you want, you should just run an old version in a VM, so your document will be on a VM-within-a-VM. Then eventually you'll need a VM-within-a-VM-within-a-VM, and so on...

On text documents

giraffedata — Sun, 13 May 2012 20:53:14 +0000

The problem is that some day, your old VM won't run on the new VM host, so you have to update the VM operating system, and you old Word won't run on the new VM operating system.

You acknowledged the concern that the new VM host might not be able to read a CD, but your solution of a disk image (a VM host file that the VM sees as a disk drive, I presume) has the same problem. Not only do you have to store the disk image file on some medium from which the new VM host can read bits, but the new VM host has to be able to interpret those bits as virtual disk content.

I don't think updating all my documents each year would be practical. The idea that it would be practical for a heritage model seems ridiculous.

Agreed.

The VM is probably the best method we will have to ensure fidelity

It still looks to me less likely to succeed than updating the documents.

If we solve the problem (i.e. improve our data heritage), I think it will be like the Economist proposes: with agreements among ourselves to maintain archive formats.

On text documents

philipstorry — Sun, 13 May 2012 20:00:47 +0000

Whilst I was speaking about personal solutions, it scales to heritage as well.

You only need one VM for all your data, because the access method would be to present some storage with the files you want to the VM.
(Granted, there may be a point where the lack of USB or CD support on hardware may mean that you have to present it with a disk image, but it's still fairly trivial.)

I don't think updating all my documents each year would be practical. The idea that it would be practical for a heritage model seems ridiculous.

The VM is probably the best method we will have to ensure fidelity. It's the least amount of work for the best return.

On text documents

giraffedata — Sun, 13 May 2012 19:57:04 +0000

OK, well I think that misses the point of the article, which talks about "heritage." Keeping your own active data usable is one thing, but a more complex concern is storing data for many generations and having it be usable by society at large at a point when it's considered history.

For that, something that requires a significant amount of effort to keep the data vital would probably be more costly than just discarding the data, so people are looking for ways just to stick something in a corner for 50 years, largely forget about it, and still have a decent chance of being able to use it.

Updating all your document reading tools each year to be compatible with this year's environment is an example of something so costly we assume it won't be done. In fact, I think updating the documents regularly would be more practical.

On text documents

philipstorry — Sun, 13 May 2012 18:59:39 +0000

After you're dead, it'll be a bit difficult. ;-)

But I mean every year that you want to be able to retrieve the documents, you should make sure your VM works, migrate it to new storage if necessary, and (if it's needed) upgrade it to work with the version of VM software you're using.

Otherwise, in a decade's time, you'll probably end up firing your VM up, only to find that the image is no longer a supported version and doesn't run anymore.

On text documents

giraffedata — Sat, 12 May 2012 19:20:31 +0000

Ultimately, if you want to still be able to access it in the future with decent fidelity, I see only three options.
...

The exact format you're using now, and a VM image you update/migrate yearly

Yes, having to test (and upgrade to later versions if necessary) a VM image every year will be a pain. But it's probably the only reliable way.

Do you mean every year forever, even long after you're dead, or just every year while you're creating documents?

Who owns your data?

jengelh — Sat, 12 May 2012 15:59:27 +0000

>As internet services come and go, there will also be issues with preserving data from those sources. Much of it is stored in free software databases, though that may make little difference if there is no access to the raw data.

Sometimes we would be glad if Facebook, Google, et al lost all their user profiling data about us in an instant because of that.

On text documents

jengelh — Sat, 12 May 2012 15:56:21 +0000

RTF was an option to save in using Microsoft products back then, and shares the readability of TeX (in principle—WYSIWYG editors like Word had a tendency to not collapse redundant formatting statements, so that font name/color info was repeated for like every paragraph and bullet point).

Who owns your data?

jengelh — Sat, 12 May 2012 15:51:14 +0000

The Fury3 game CD has videos in Cinepak format, and today's MPlayer still recognizes and plays them.

Who owns your data?

rgmoore — Fri, 11 May 2012 21:27:23 +0000

Didn't Real Audio eventually release an Open Source version of their player?

On text documents

eru — Fri, 11 May 2012 17:07:34 +0000

Still have to disagree here. I'm pretty sure I could port web2c to a new platform in an evening or two, provide it has a decent ANSI C compiler (which is now a very common piece of infrastructure and can legitimately be assumed). Porting DOSBOX would be a much larger task, unless the new target is very similar to some of the existing ones. Yes, there is more documentation about x86 and DOS, because a lot more is needed to describe the complicated and ugly interface, and it is still incomplete...
I have found bugs in DOSBOX, which I currently use to support some legacy cross-compilation tools at my workplace. Also used DOSEMU+FreeDOS for the same task, and found it has some different bugs... I could work around the problems for the limited set of programs that were needed. But the fact is the only thing that is completely MS-DOS compatible for all programs still is the original MS-DOS.

On text documents

iabervon — Fri, 11 May 2012 16:03:58 +0000

web2c (plus a C compiler) is an implementation of WEB and \ph, and is of comparable portability and complexity to dosbox, which will run your old word processors. TeX's source does contain a lot of documentation about the expected Pascal dialect and the preprocessor; but there's even more documentation about the x86 and DOS. Your old DOS programs don't come with an extensive description of the platform they run on, but they're also not the only things that use that platform, so they don't have to.

Who owns your data?

ebirdie — Fri, 11 May 2012 10:37:37 +0000

Sounds very familiar although I worked for a company, which produced materials for learning and teaching like books. I once argued that the Quark-files the company had worth of in thousands and thousands paid work hours from script writers to layout designers, all that possibly reusable, weren't worth any work hours while upgrading up away from previous files but only to have the latest and greatest software. Well the files' ownership changed to a big corporation later, for which I didn't give a penny.

However Quark-files make an exception. The end product of publishing still finds its way to paper many times, on paper it gets distributed and can't be in such a stranglehold of file formats, DRM, cloud services etc as digital information. As the famous clause goes "information wants to be free" it is still too easy to forget that the freedom has independence coded into it. And here everyone knows, what free code is, but it seems like it hasn't yet produced as free information as paper still does - although there is arguably more restrictions printed on paper nowadays.

Seeing the current challenges and threads in future in maintaining free information, I'm glad that paper as medium was invented first in humankind. At least the paper offers some reference. It seems to be good business to reinvent everything digitally, so digital information is doomed. It is much cheaper and effortless to me give space to my books and carry them while moving (except I'll do everything in my power to not move anymore) than do the required work to maintain digital information usable and accessible.

On text documents

Cyberax — Fri, 11 May 2012 05:07:54 +0000

Except that you've probably used some form of TeX macro library (MikTeX, LaTeX, etc), not raw TeX. In which case you have to hunt down all the dependencies and pray that they work.

On text documents

eru — Fri, 11 May 2012 04:49:20 +0000

Of course if you had used TeX 20 years ago, the document would look exactly the same today, even down to the line-breaks, and be in an easily editable form.

I mostly agree from personal experience. I have some large LaTeX documents that were started that long ago, and which I still maintain now and then. Not quite pure LaTeX, because they contain diagrams that were done with xfig (but that also is still available, and quite good for simple diagrams). Some changes in LaTeX (mainly the transition to 3.x) required minor changes to the source, but these were limited just to the macro settings at the beginning of the document. Also I started to use some PostScript-related font packages for much improved PDF output, which slightly changed final layout. But the bulk of the text has not needed any changes attributable only to the formatting tool evolution. Supposing I had not been maintaining the documents for 20 years, suddenly getting them formatted with current versions of the tools might be slightly more work, but not much.

On text documents

eru — Fri, 11 May 2012 04:33:32 +0000

(Not to mention that building TeX requires implementations of at least two language dialects (WEB and \ph) which aren't used for anything else on any modern system; it's easier to make an emulator for the computers that Wordperfect ran on than to make a compiler able to build TeX, although people have done both.)

Huh? Most Linux distributions provide a TeX package. I believe it is build using a portable C implementation of WEB (web2c), which is a source to source translator. So just C is required for that part. Browsing the READMEs of a recent TeX for Linux implementation (http://www.tug.org/svn/texlive/trunk/Build/) there certainly are also other dependencies for building and auxiliary programs, but that is stuff that typical Linux implementations already provide. Of course bootstrapping TeX for a very different computer and OS from scratch would be a lot of work, but at least it is possible, thanks to the good documentation of teX and its source.

Who owns your data?

ringerc — Fri, 11 May 2012 04:03:27 +0000

The newspaper I work for is suffering from this at the moment.

We have a vast library of material in QuarkXPress 3.3 and QuarkXPress 4.0 format. Quark has never been what you'd call an "open" company; this is, after all, the company whose CEO has said that "all customers are liars, thieves and bastards".

Quark upgrades are expensive. Old versions of Quark don't work on newer OSes, and Quark doesn't fix even simple bugs in old versions. New versions of Quark don't import documents from old versions all that reliably, especially where things like font format changes are involved. More importantly, if you move to a non-Quark product, you lose access to all your historical work, because you have to keep on paying Quark upgrade fees to retain access to it on updated systems.

We landed up keeping an old Mac around to open Quark docs, and another slightly-less-old machine that has an importer plugin for InDesign that lets us open old Quark docs, convert them to a slightly less old InDesign format, save that, and open it in our current versions of InDesign.

Of course, InDesign has exactly the same problems as Quark; it's a locked down format under Adobe's total control. The problem continues to grow as we produce more work.

While everything is in PDF format too, that's not much good if we need to edit it - and there simply are no good open standard desktop publishing formats. OpenDocument is very poorly suited to DTP's layout-oriented approach, detailed typography, etc. Scribus's format isn't specified formally, is painful to work with, evolves continuously, and may as well be closed because nothing else supports it. There isn't anything else out there.

My point: Sometimes we'd like to avoid closed formats, but there aren't any alternatives to choose. The newspaper's technical debt keeps on growing, and there's not much I can do about it, as we're way too small to have the resources to create a competing format and support for it.

On text documents

iabervon — Fri, 11 May 2012 02:34:10 +0000

20 years ago, I only had a dot-matrix printer, and TeX isn't really set up to deal with extremely limited positioning granularity. Certainly everything I've written in the last 15 years has been in TeX unless it's been in HTML or something which renders to HTML. But TeX can actually be kind of problematic: you have to modify the file in order to avoid generating recto and verso pages, which are inappropriate for e-readers (or, really, any presentation form which doesn't involve dead trees). And TeX documents actually often make a lot of assumptions about the form of the result which means that there isn't machine-readable available to produce other presentation reasonably.

On text documents

mrons — Thu, 10 May 2012 23:42:17 +0000

Of course if you had used TeX 20 years ago, the document would look exactly the same today, even down to the line-breaks, and be in an easily editable form.

Even free software can age less than gracefully

roblatham — Thu, 10 May 2012 20:47:41 +0000

The story is not limited to proprietary formats. I found a 10 year old Gnucash file generated with gnucash-1.6, but today's Gnucash does not know how to read that file.

In the end, i installed debian sarge in a chroot (thanks archive.debian.org!) so I could run gnucash 1.8 without building the 20 little dependencies.

On text documents

iabervon — Thu, 10 May 2012 17:36:41 +0000

Text documents are actually harder than a lot of things, in that the instructions to the system can be much more complicated. Audio ultimately amounts to moving some speakers, and video to coloring some dots, but text documents have a lot of information about font choice and positioning rules, as well as information on how the glyphs go in sequence. I expect to be able to cut-and-paste a paragraph out of a text document and put in in a document with a different font and a different width, and have line breaks put in appropriate places and my exponents and subscripts turn up as exponents and subscripts, and I expect to get the paragraph as a whole and not get text from the adjacent column or the page number (even if the paragraph is split across pages). This sort of information is not available as part of the content of most other sorts of file, even with the original software, so there's nothing to degrade with version changes.

Who owns your data?

pbonzini — Thu, 10 May 2012 14:43:51 +0000

The Floppies with Prince of Persia source code for the Apple II were also successfully read after 20 years! I don't recall 1.44 MB floppies to be particularly durable though.

Who owns your data?

nsheed — Thu, 10 May 2012 14:13:28 +0000

Speaking from personal experience, you have to be very careful even when saying specific formats are good/bad.

As an example, TIFF of all things has proved to be an ongoing source of pain due to the joys of a) odd JPEG usage in older files, b) explicitly allowing for vendor extensions in the specification (annotations & highlighting that appear/disappear depending on the viewing app).

So far most of the issues of this type are work-aroundable, the issue is every time we hit a new scenario it takes time to investigate/find workarounds (if possible)/go back to the source (again if possible - discovery of an issue may be months/years after file creation).

Who owns your data?

pboddie — Thu, 10 May 2012 13:55:40 +0000

They have archived copies of the Microsoft document format specifications; much as we might dislike it, the content they need to preserve is the content created by most of the populace.

Although welcome, this raises additional issues. Given this apparent safety net, people are now likely to say "Great, we're covered!" And then they will carry on churning out proprietary format content. But we are not covered.

Firstly, we don't even know if the specifications are complete or accurate. This is Microsoft we're talking about, so although it is possible that these published specifications have had some auditing as part of a regulatory action in the European Union, we can't be sure that they are usable until someone produces a separate implementation.

Secondly, people will happily start producing content in later versions of those formats which aren't covered by publicly available specifications. Again, we're talking about Microsoft, so any remedy for trouble they have managed to get themselves into will only last as long as the company is under scrutiny. Then, it's back to business as usual. Meanwhile, nobody in wider society will have been educated about the pitfalls of such proprietary formats and systems.

Thirdly, the cost of preservation under such initiatives may well be borne by the people whose data is now imprisoned in such formats, instead of the people responsible for devising the format in the first place. In various environments, there are actually standards for archiving, although I can well imagine that those responsible for enforcing such standards have been transfixed by the sparkle of new gadgetry, the soothing tones of the sales pitch, and the quick hand-over of an awkward problem to a reassuring vendor. Public institutions and the public in general should not have to make up the shortfall in the vendors' lack of investment.

Finally, standards compliance is awkward enough even when standards are open and documented. One can argue that a Free Software reference implementation might encourage overdependence on a particular technology and its peculiarities, potentially undermining any underdocumented standard, but this can really only be fixed when you have a functioning community and multiple Free Software implementations: then, ambiguities and inconsistencies are brought to the surface and publicly dealt with.

Sustainable computing and knowledge management requires a degree of redundancy. Mentions of the celebrated case of the BBC Domesday Project often omit the fact that efforts were made to properly document the technologies involved - it is usually assumed that nobody had bothered, which is not the case - but had that project been able to take advantage of widely supported, genuinely open standards, misplacing documentation would have had a substantially smaller impact on preservation activities.

Indeed, with open formats and appropriate licensing of the content, the output of the project might have been continuously preserved, meaning that the content and the means of deploying it would have adapted incrementally as technology progressed. That's a much more attractive outcome than sealing some notes in a box and hoping that future archaeologists can figure them out.

Digital Restrictions Management

robbe — Thu, 10 May 2012 11:45:18 +0000

Contrary to Jake I actually see DRM as the main driver to invent new proprietary formats. Sure, lazyness can induce programmers to just shove their data structures on disk, but outside of the embedded space, I see that less and less often. Actually, this prime virtue of all programmers works in the other way too: there are readily available libraries that help to export pdf, odf, mpeg, etc -- why not just rely on these and be done?

That's where DRM comes in. Since it is basically an arm's race there is always motivation to crank out new schemes and formats.

The most problematic restrictions managment requires an online server to open a document. When this server inevitably goes away, the only hope of future historians is that they can easily crack our puny crypto on their (quantum?)computers.

Who owns your data?

robbe — Thu, 10 May 2012 11:15:31 +0000

You are talking about a recent development.

May I remind you of realaudio, indeo, cinepak, etc.? Videos of this time (1990s) were generally too crappy to remember, but a lot of actually useful audio recordings are still locked up in RA format.