LWN.net Logo

Not again

Not again

Posted Dec 30, 2010 13:43 UTC (Thu) by man_ls (subscriber, #15091)
In reply to: Not again by Nick
Parent article: Ext4 filesystem hits Android, no need to fear data loss (ars technica)

What is missing is that you don't really want all the files you write to be durable, even if you think you do: many of them are just temporary files which won't perhaps be needed in the future, and for the rest the write operation can be delayed a few seconds with no ill effects. You just want your renames to be atomic: either keep the old file or overwrite it with the new one. And the same with appends. It's not really important which version (old or new) is left on disk, but do never leave corrupted files (empty files or files with unitialized sectors).

You can paper over these inefficiencies with new APIs; you can require all developers to use them before you make guarantees about not corrupting data; or you can force them to use threads or processes. But those are not very elegant ways to deal with the problem. I prefer to have a filesystem which silently deals with the problem without corrupting my data, like we did with ext3 in the good old times.

But once again, please read the original articles where these issues were truly beaten to death.


(Log in to post comments)

Not again

Posted Dec 31, 2010 1:00 UTC (Fri) by Nick (guest, #15060) [Link]

If you don't need files to be durable, then don't fsync them!

If there is a crash, then you don't have to bother with atomicity, just delete the files (which is possible because you didn't require durability).

These aren't inefficiencies, they are actually efficiencies, because they expose an API that is simple and efficient to implement for the memory management and filesystems. You can build arbitrarily more complex protocols on top of that.

And yet again here we are

Posted Dec 31, 2010 14:08 UTC (Fri) by man_ls (subscriber, #15091) [Link]

What I want is not a durable file, but an atomic rename. Not an empty file, but either the old version or the new.

Apparently the atomic rename is too confusing, or I am not explaining the use case clearly. Think about atomic appends to files then. When you append some sectors atomically, you want either the old file without the new sectors or the file with the new sectors at the end. What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.

An atomic append would guarantee that the new sectors would either be present with the expected contents or not present at all, but never contain random garbage. We don't need an fsync after every sector, nor would it solve the problem: after the write but before the fsync the filesystem would be in an incoherent state anyway, albeit for a short time.

And yet again here we are

Posted Jan 5, 2011 12:03 UTC (Wed) by nye (guest, #51576) [Link]

>What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.

To be fair there was more wrong with XFS than that - it could also corrupt *entirely unrelated* files. Personally I stopped using it when a power cut trashed /etc/passwd (and presumably a load of other files that were less obvious) despite nothing having it open at the time.

Not again

Posted Dec 31, 2010 16:08 UTC (Fri) by etienne (subscriber, #25256) [Link]

How about, when you want to save the new configuration file, not trying to replace the old config file - just keep it in case the new config file is incomplete, still not created or corrupted (after crash)?
Should work with any filesystem.

Not again

Posted Dec 31, 2010 16:21 UTC (Fri) by man_ls (subscriber, #15091) [Link]

And, the next time? Would you keep all the old files? You can also keep two of them, but how do you know which one of them was the last one written? What if one of them is half-written? That way leads to madness. Just to avoid a feature which has worked on ext3 (and probably all other popular POSIX filesystems) for decades, and which other OSs are trying to copy.

Not again

Posted Dec 31, 2010 19:05 UTC (Fri) by etienne (subscriber, #25256) [Link]

> Just to avoid a feature which has worked on ext3 for decades

Well, for decades you did not have hard drives with internal command queueing.
To have better performances you need to keep the queue full.
Because you cannot tell the hard drive that updating this sector is more important than that one, that information is probably not managed at all in the driver.
Moreover, you asked for this behaviour by wanting only metadata journaling of the filesystem, explicitely wanting a coherant filesystem (i.e. no fsck after crash) even if it means data inside files may be corrupted.
You can run with data journaling, but people/distributions thinks it does not worth the performance hit.

Once again, reenacting history

Posted Dec 31, 2010 20:21 UTC (Fri) by man_ls (subscriber, #15091) [Link]

It didn't happen that way. I did explicitly use a journaling filesystem thinking that "journaling" meant that it did not lose data in the event of a crash. When I found out that only metadata was guaranteed to be consistent, I thought "What a sham". Then I (and millions of other people) immediately switched to another filesystem: ext3 with data=ordered, which did offer better guarantees (data journaling) at the cost of performance. Who wants performance when your files are being corrupted?

The funny thing is that XFS developers eventually realized their folly and solved the atomicity issues, but now people don't trust them with their data anymore.

Not again

Posted Dec 31, 2010 22:14 UTC (Fri) by neilbrown (subscriber, #359) [Link]

1/ ext3 is the only filesystem I know of that forces and fsync before committing a rename.
2/ ext3 is about 10 years old, so it hasn't been around for "decades" unless you mean "0.9 decades".
3/ When rename was first introduced into Unix in the BSD, it was atomic in the sense that even in the event of a crash there would always be a file with the destination name, either the original or the new. This is in contrast to the previous behaviour. which required:
- create "file.tmp"
- unlink "file"
- link "file.tmp" to "file"
- unlink "file.tmp"

which can easily leave nothing called "file". This is all that "atomic rename" means, or at least all it meant before ext3 gave rename unfortunate (though useful) semantics.

Though I cannot know the intention of the author of that post you linked to, there is no prima-face reason to believe they mean anything more than the atomicity of names (not of contents) that rename has always had in Unix.

(and half-written files are easy to detect by writing a checksum at the end. If you suffix each file with a timestamp it is easy to know which is the most recent. And file older than a few minutes will be safe-on-disk so you are always free to clean up any file older than the youngest file that is older than a few minutes)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds