LWN.net Logo

Not again

Not again

Posted Dec 29, 2010 13:04 UTC (Wed) by man_ls (subscriber, #15091)
In reply to: Not again by Nick
Parent article: Ext4 filesystem hits Android, no need to fear data loss (ars technica)

It's not that developers don't want durability: quite often they do, for example after saving a file to disk. But in those instances it is safe and clean to provide it by using fsync(). I don't care if my application pauses for a few seconds after saving a file to disk, but at that point I want durability. I don't want it to stop every minute when it saves an emergency backup file, but neither do I want it to corrupt said emergency file; at those moments I want atomicity.

I was probably not clear enough when speaking about "developers", since they are not here the determinant force here but merely a vehicle for users needs. Users want robust filesystems which do not corrupt their data; application developers just want to do their stuff without being bothered by the underlying filesystem. It is filesystem developers which need to provide for those seemingly disjoint requirements by providing atomicity always (again, merely a vehicle for robust filesystems) and durability sometimes (only when asked for explicitly).

If POSIX does not require atomicity in renames and appends, then are filesystem developers free to corrupt user files and claim to be POSIX-compliant? Ask XFS developers, which did corrupt files once in a while for a few years and lost most of their users in the process, what good this compliance does. If ext4 did this then the whole Linux ecosystem would suffer in the process as it is the default filesystem.

Implementing filesystems on a database-like layer is what Microsoft Longhorn set out to do, one of the reasons why it was delayed some six years, and also why the final Windows Vista was so utterly bad: a lot of resources were spent in what was, simply put, bad engineering. I would have said "in retrospect" but for many people it was obviously a mistake from the beginning. There have been a few other contenders in the filesystem-on-a-databse category and IIRC all failed.


(Log in to post comments)

Not again

Posted Dec 30, 2010 12:07 UTC (Thu) by Nick (guest, #15060) [Link]

Then you use the normal posix API to write your emergency backup files. I still fail to see what is missing here.

Write the file (to a new name, obviously, don't overwrite the old one until the new one is durable), and optionally kick off an asynchronous writeback of that file. At some point in future when durability is required (eg. if you guarantee some window or interval on your backed up data), then you would fsync (which should be mostly noop if the asynchronous writeback already completed). And then perform some checkpoint such as a lock file or sqlite transaction or something that indicates the file is now durable.

It's really not a big deal to implement some simple transactional-like recovery protocol on top of the POSIX API, and it has been done in these ways for a long time. It seems like a new wave of programmers has just forgotten about this.

Or, if you want to do slow, synchronous operations without blocking the whole app, that's trivial too -- just use processes or threads to do those slow operations.

Not again

Posted Dec 30, 2010 13:43 UTC (Thu) by man_ls (subscriber, #15091) [Link]

What is missing is that you don't really want all the files you write to be durable, even if you think you do: many of them are just temporary files which won't perhaps be needed in the future, and for the rest the write operation can be delayed a few seconds with no ill effects. You just want your renames to be atomic: either keep the old file or overwrite it with the new one. And the same with appends. It's not really important which version (old or new) is left on disk, but do never leave corrupted files (empty files or files with unitialized sectors).

You can paper over these inefficiencies with new APIs; you can require all developers to use them before you make guarantees about not corrupting data; or you can force them to use threads or processes. But those are not very elegant ways to deal with the problem. I prefer to have a filesystem which silently deals with the problem without corrupting my data, like we did with ext3 in the good old times.

But once again, please read the original articles where these issues were truly beaten to death.

Not again

Posted Dec 31, 2010 1:00 UTC (Fri) by Nick (guest, #15060) [Link]

If you don't need files to be durable, then don't fsync them!

If there is a crash, then you don't have to bother with atomicity, just delete the files (which is possible because you didn't require durability).

These aren't inefficiencies, they are actually efficiencies, because they expose an API that is simple and efficient to implement for the memory management and filesystems. You can build arbitrarily more complex protocols on top of that.

And yet again here we are

Posted Dec 31, 2010 14:08 UTC (Fri) by man_ls (subscriber, #15091) [Link]

What I want is not a durable file, but an atomic rename. Not an empty file, but either the old version or the new.

Apparently the atomic rename is too confusing, or I am not explaining the use case clearly. Think about atomic appends to files then. When you append some sectors atomically, you want either the old file without the new sectors or the file with the new sectors at the end. What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.

An atomic append would guarantee that the new sectors would either be present with the expected contents or not present at all, but never contain random garbage. We don't need an fsync after every sector, nor would it solve the problem: after the write but before the fsync the filesystem would be in an incoherent state anyway, albeit for a short time.

And yet again here we are

Posted Jan 5, 2011 12:03 UTC (Wed) by nye (guest, #51576) [Link]

>What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.

To be fair there was more wrong with XFS than that - it could also corrupt *entirely unrelated* files. Personally I stopped using it when a power cut trashed /etc/passwd (and presumably a load of other files that were less obvious) despite nothing having it open at the time.

Not again

Posted Dec 31, 2010 16:08 UTC (Fri) by etienne (subscriber, #25256) [Link]

How about, when you want to save the new configuration file, not trying to replace the old config file - just keep it in case the new config file is incomplete, still not created or corrupted (after crash)?
Should work with any filesystem.

Not again

Posted Dec 31, 2010 16:21 UTC (Fri) by man_ls (subscriber, #15091) [Link]

And, the next time? Would you keep all the old files? You can also keep two of them, but how do you know which one of them was the last one written? What if one of them is half-written? That way leads to madness. Just to avoid a feature which has worked on ext3 (and probably all other popular POSIX filesystems) for decades, and which other OSs are trying to copy.

Not again

Posted Dec 31, 2010 19:05 UTC (Fri) by etienne (subscriber, #25256) [Link]

> Just to avoid a feature which has worked on ext3 for decades

Well, for decades you did not have hard drives with internal command queueing.
To have better performances you need to keep the queue full.
Because you cannot tell the hard drive that updating this sector is more important than that one, that information is probably not managed at all in the driver.
Moreover, you asked for this behaviour by wanting only metadata journaling of the filesystem, explicitely wanting a coherant filesystem (i.e. no fsck after crash) even if it means data inside files may be corrupted.
You can run with data journaling, but people/distributions thinks it does not worth the performance hit.

Once again, reenacting history

Posted Dec 31, 2010 20:21 UTC (Fri) by man_ls (subscriber, #15091) [Link]

It didn't happen that way. I did explicitly use a journaling filesystem thinking that "journaling" meant that it did not lose data in the event of a crash. When I found out that only metadata was guaranteed to be consistent, I thought "What a sham". Then I (and millions of other people) immediately switched to another filesystem: ext3 with data=ordered, which did offer better guarantees (data journaling) at the cost of performance. Who wants performance when your files are being corrupted?

The funny thing is that XFS developers eventually realized their folly and solved the atomicity issues, but now people don't trust them with their data anymore.

Not again

Posted Dec 31, 2010 22:14 UTC (Fri) by neilbrown (subscriber, #359) [Link]

1/ ext3 is the only filesystem I know of that forces and fsync before committing a rename.
2/ ext3 is about 10 years old, so it hasn't been around for "decades" unless you mean "0.9 decades".
3/ When rename was first introduced into Unix in the BSD, it was atomic in the sense that even in the event of a crash there would always be a file with the destination name, either the original or the new. This is in contrast to the previous behaviour. which required:
- create "file.tmp"
- unlink "file"
- link "file.tmp" to "file"
- unlink "file.tmp"

which can easily leave nothing called "file". This is all that "atomic rename" means, or at least all it meant before ext3 gave rename unfortunate (though useful) semantics.

Though I cannot know the intention of the author of that post you linked to, there is no prima-face reason to believe they mean anything more than the atomicity of names (not of contents) that rename has always had in Unix.

(and half-written files are easy to detect by writing a checksum at the end. If you suffix each file with a timestamp it is easy to know which is the most recent. And file older than a few minutes will be safe-on-disk so you are always free to clean up any file older than the youngest file that is older than a few minutes)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds