Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.
In the replace-with-rename pattern, it's wrong.
Garrett: ext4, application expectations and power management
Posted Mar 16, 2009 14:59 UTC (Mon) by jamesh (guest, #1159)
Right. There doesn't seem to be a way to do this without requiring that the data be written to disk right now. In these cases, the application is fine with delayed writes -- they just want the ordering of the write and the rename to be preserved.
> That's not wrong. It's just wrong if the application requires that the
> data be on disk after crash, which is what everyone is bitching about.
That isn't what the applications require though. The behaviour they are after is for the rename to be recorded only if the associated writes are also recorded.
It is acceptable if the rename is lost by a system crash. What is not acceptable is for the rename to occur but not the write.
If the application wanted to be sure that the data had been flushed, before the rename, then yes they should call fsync().
Posted Mar 16, 2009 15:03 UTC (Mon) by drag (subscriber, #31333)
Well they want either the old data or new data to be in a file system after recovering from a crash. Not files full of zeroes...
People are willing to put up with missing X number of seconds of work from the vast majority of applications they are using.
It's actually rare that people want data immediately written to disk. Stuff they want saved very carefully and immediately is generally going to be user-generated data (what your editing with Emacs) and not automatically generated data (my application remembering the position of icons in my windows).
Forcing a commit immediately to disk seems to be a much bigger hammer then what is wanted. They just want to have the OS not to corrupt files if it can be helped.
If fsync() is the only way to have the OS not to randomly blow away files on my hard drive, then so be it. It just seems like there should be a better way.
Posted Mar 17, 2009 10:09 UTC (Tue) by malor (subscriber, #2973)
Atomic rename is not the same thing as fsync. Telling application authors that they have to use fsync is yet another example of, when something is hard to do in Linux, telling the user that what he or she wants is wrong and stupid. This pattern goes way, way back.
Once upon a time, in the early days of Linux, I commented on Slashdot that ext2 was a bad filesystem, and would lose data if the computer crashed or lost power. I was informed, by numerous people, that the data loss was my fault because the computer wasn't on a UPS, and that I should 'simply' have manually run a disk editor and restored a backup superblock to recover the corrupted files. Seriously: lost data, they claimed, was my fault because I didn't understand the layout of ext2 well enough to fire up a hex editor when it crashed.
Well, sometime in the next year or two, journaling showed up, and suddenly everyone was all about how wonderful it was, how horrible ext2 was in comparison, and how no sane person would use ext2 in production. But when I'd said that, when there was no other option, I was wrong and stupid for wanting reliability in my filesystem.
I see this argument the same way; by accident, the ext3 writers provided a very useful feature. Atomic rename isn't fsync; it's much lighter weight. People are not wrong and stupid for wanting it, but because it's hard, that's practically the first thing out of people's mouths. "You can't do that on ext4. That's not the POSIX semantics, and you're foolish to expect this behavior."
I disagree vehemently. It's a very good feature, and even if it "isn't the Posix standard", you guys should bring this behavior forward. Doing it via the regular rename operation might be a good choice, because it's backwards-compatible with the original accidental feature. Or, perhaps you'll instead want to add an explicit atomic rename operation, so that filesystems like xfs won't surprise users unpleasantly. That would require more pain on the part of application developers, but would make the guarantee explicit instead of implicit, which is probably better from a design perspective.
But telling people to use fsync instead of atomic rename, and that they're wrong and stupid for wanting a feature that's hard to do, is just a tired repetition of a very old game indeed.
Posted Mar 17, 2009 15:37 UTC (Tue) by smoogen (subscriber, #97)
In the past, the fsync sort of happened every 5 seconds so you never really spun down your disk. It was the reason why people considered ext3 a slow filesystem compared to xfs, etc etc. One can get better performance, but at the price of reliability.
Posted Mar 18, 2009 7:54 UTC (Wed) by malor (subscriber, #2973)
Even if disk spinups were once every five minutes instead of every five seconds, you would still get that behavior; all the data blocks of a given file would be written to disk before that file was renamed over another one.
This means that you're guaranteed to always have either the old data OR the new data. You don't know which you have, after a kernel crash or power failure, but you have one or the other. And this happens without needing to do an fsync, which is a different logical thing, and which absolutely requires a drive spinup. This sync-and-rename functionality is much lighter weight, and can happen pretty much anytime. It doesn't add to the power burden of using the disk, but still guarantees a form of data integrity that many applications find very useful.
Either good old data OR good new data is not the same as fsync. Telling programmers to use fsync is forcing them to use the hammer that's convenient, instead of the screwdriver that would better solve the problem.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds