LWN.net Logo

Better than POSIX?

Better than POSIX?

Posted Mar 17, 2009 15:56 UTC (Tue) by ssam (subscriber, #46587)
Parent article: Better than POSIX?

what seems unanswered is if you can overwrite an existing file, in such a way that you either have the old or the new version, without forcing the physical write to disk.

i think a lot of people are willing to risk recent changes to a file to get performance gains and and power saving. but no one wants to risk completely loosing a file just because you wrote to it recently.

in the sequence
1) open foo.new
2) write to foo.new
3) close foo.new
4) move foo.new to foo
all that needs to happen is that the 4 does not hit the disk before 2 has.

it seems that 2 gets delayed, because that gives performance/powerusage gains, but 4 happens quickly.


(Log in to post comments)

Better than POSIX?

Posted Mar 17, 2009 19:51 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

no, currently there is no API in existance for any filesystem (including ext3) that will _guarantee_ that you have either the old or the new filesystem on disk after a crash with no possibility of garbage (even excluding things that damage the drives themselves)

the probability of having problems varies from filesystem to filesystem, from mount option to mount option within a single filesystem, and is even dependant on kernel tuning parameters that can be set in /etc/sysctl.conf

if you want that guarantee you need to force a write to disk (and make sure that your disk drive doesn't cache or reorder the writes)

it would be nice if there was a write barrier API that let you insert a ordering requirement from userspace without forcing a write to disk, but at this point in time I don't believe that such a API exists.

people were relatively happy with the ~5 second vulnerability window in ext3, but are getting bit by the ~60 second vulnerability window that ext4 defaulted to.

Ok, that's a reasonable problem, and adjusting ext4 to use a smaller window (possibly even by default) is a reasonable thing. the idea that Ted is working on to have ext4 honor the 'how much time am I willing to loose' parameter that was introduced for laptop mode is a good idea in that it's clear what you are adjusting, and it lets you tie all similar types of thing to the one parameter (as opposed to configuring different aspects in different places)

but shortening this time will cost performance. I doubt that the ext4 developers selected the perfect time (I doubt that there is a single perfect time to select), and having it as an easy tunable will let people plot out what the performance/reliability curve looks like. it may be that going from 5 seconds to 10 seconds gains 80% of the performance that going from 5 seconds to 60 seconds gives you (for common desktop workloads ;-) and distros may choose to move the default there.

the problem in this situation is that people are mistakenly believing that this vunerability didn't exist before, it did, it just wasn't as large a window so fewer people were hitting it.
Also they didn't have any other experiances to compare it to. If ext3 had also lost data in similar situations, people would be used to 'if the system crashes and your app didn't do fsync you will loose data modified in the last couple of minutes' and consider that reasonable.

Better than POSIX?

Posted Mar 17, 2009 21:16 UTC (Tue) by foom (subscriber, #14868) [Link]

The problem isn't that the time-window is bigger, it's that even if you crashed within the 5sec
window in ext3, you are almost guaranteed to end up with the old version of the file.

If you crashed within a 60sec window in (pre-fix) ext4, you were almost guaranteed to end up with
a 0-byte file. That's a rather stark difference.

Of course there's no absolute guarantees...if your kernel is crashing particularly hard, the filesystem
driver could go berserk and write 99 red balloons all over the disk. Nothing in the design of the FS
could prevent that...but that's *extraordinarily* rare.

Better than POSIX?

Posted Mar 19, 2009 10:44 UTC (Thu) by Los__D (guest, #15263) [Link]

berserk and write 99 red balloons all over the disk

What a fantastic idea for an easter egg... :)

"What, bug? No no no, it was just a joke. Your data is... Well, buried in baloons."

Better than POSIX?

Posted Mar 17, 2009 21:44 UTC (Tue) by man_ls (subscriber, #15091) [Link]

people were relatively happy with the ~5 second vulnerability window in ext3, but are getting bit by the ~60 second vulnerability window that ext4 defaulted to.
Well, if the 5-second window had left my systems vulnerable to data corruption (in the form of zero-length files) I know I would not have been happy. After all that 5-second window happens every 5 seconds, so you are virtually guaranteed to run into the problem if it exists, whatever the window size.

I remember my horror after finding out that XFS had lost my data, and I read about XFS devs hiding behind the standard "journalling filesystems only make guarantees about the metadata". "The metadata? fsck the metadata! What I want is my wretched data back!" Now it is all coming back in waves.

Better than POSIX?

Posted Mar 18, 2009 2:40 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

that 5 second window did leave you vulnerable to data corruption (in many ways). you just got lucky

Better than POSIX?

Posted Mar 18, 2009 4:12 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

It wasn't really as bad as it sounds. ext3 also doesn't have delayed allocation. Someone on the Ubuntu bug list posted some testcases and really couldn't make ext3 fail at all. (ext4 fell flat.)

Better than POSIX?

Posted Mar 18, 2009 7:28 UTC (Wed) by man_ls (subscriber, #15091) [Link]

I know my data is vulnerable in many ways. I would just like the most common cases to be addressed. It is not luck. According to the discussions, everyone who has had a crash with (first) XFS or (now) ext4 is getting inconsistent states, while no heavy-duty ext3 users have reportedly seen this kind of corruption. Maybe others, yes, but no zero-length files -- which is the issue under discussion.

Better than POSIX?

Posted Mar 19, 2009 18:33 UTC (Thu) by anton (guest, #25547) [Link]

no, currently there is no API in existance for any filesystem (including ext3) that will _guarantee_ that you have either the old or the new filesystem on disk after a crash with no possibility of garbage (even excluding things that damage the drives themselves)
I am pretty sure that there are file systems around that give such guarantees. This was certainly one of the goals of LinLogFS and LLFS; ok, these filesystems have not come out of the development stage (yet), but I would be surprised if there were no others (ZFS?).

Concerning an API, I don't see a need for an additional API, it's just a matter of whether the file system gives consistency guarantees in case of a crash when using the existing file API, and if so, which ones.

Do you really NEED such an API?

Posted Mar 17, 2009 21:30 UTC (Tue) by khim (subscriber, #9252) [Link]

Ted gave interesting answer to this request which I really like: why will your program need such an API? Your system needs such a API - and it's not that hard to implement...

Yes, it'll probably violate POSIX's letter but it'll give much better results in practice.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds