|
|
Log in / Subscribe / Register

Garrett: ext4, application expectations and power management

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:45 UTC (Mon) by k8to (guest, #15413)
In reply to: Garrett: ext4, application expectations and power management by jamesh
Parent article: Garrett: ext4, application expectations and power management

They are not honouring the requirement for them to express that the data be on the disk when the rename is applied.

That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.

In the replace-with-rename pattern, it's wrong.


to post comments

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:59 UTC (Mon) by jamesh (guest, #1159) [Link]

> They are not honouring the requirement for them to express that the data
> be on the disk when the rename is applied.

Right. There doesn't seem to be a way to do this without requiring that the data be written to disk right now. In these cases, the application is fine with delayed writes -- they just want the ordering of the write and the rename to be preserved.

> That's not wrong. It's just wrong if the application requires that the
> data be on disk after crash, which is what everyone is bitching about.

That isn't what the applications require though. The behaviour they are after is for the rename to be recorded only if the associated writes are also recorded.

It is acceptable if the rename is lost by a system crash. What is not acceptable is for the rename to occur but not the write.

If the application wanted to be sure that the data had been flushed, before the rename, then yes they should call fsync().

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:03 UTC (Mon) by drag (guest, #31333) [Link]

> That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.

Well they want either the old data or new data to be in a file system after recovering from a crash. Not files full of zeroes...

People are willing to put up with missing X number of seconds of work from the vast majority of applications they are using.

It's actually rare that people want data immediately written to disk. Stuff they want saved very carefully and immediately is generally going to be user-generated data (what your editing with Emacs) and not automatically generated data (my application remembering the position of icons in my windows).

Forcing a commit immediately to disk seems to be a much bigger hammer then what is wanted. They just want to have the OS not to corrupt files if it can be helped.

If fsync() is the only way to have the OS not to randomly blow away files on my hard drive, then so be it. It just seems like there should be a better way.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 10:09 UTC (Tue) by malor (guest, #2973) [Link] (2 responses)

What the author is arguing, and I agree with him, is that applications need a method to guarantee that the data on disk is always good, whatever version it is, but without the penalty of a full fsync. That may not matter _that_ much on a server or desktop, but a laptop, that means the drive absolutely has to spin up from sleep, or can't sleep in the first place. This is an substantial battery hit. I don't have any easy way to test it, but hard drive spinups are expensive as hell (and slow), so it wouldn't shock me if this ext4 behavior change singlehandedly wiped out a good chunk of the work done to improve kernel power usage on laptops.

Atomic rename is not the same thing as fsync. Telling application authors that they have to use fsync is yet another example of, when something is hard to do in Linux, telling the user that what he or she wants is wrong and stupid. This pattern goes way, way back.

Once upon a time, in the early days of Linux, I commented on Slashdot that ext2 was a bad filesystem, and would lose data if the computer crashed or lost power. I was informed, by numerous people, that the data loss was my fault because the computer wasn't on a UPS, and that I should 'simply' have manually run a disk editor and restored a backup superblock to recover the corrupted files. Seriously: lost data, they claimed, was my fault because I didn't understand the layout of ext2 well enough to fire up a hex editor when it crashed.

Well, sometime in the next year or two, journaling showed up, and suddenly everyone was all about how wonderful it was, how horrible ext2 was in comparison, and how no sane person would use ext2 in production. But when I'd said that, when there was no other option, I was wrong and stupid for wanting reliability in my filesystem.

I see this argument the same way; by accident, the ext3 writers provided a very useful feature. Atomic rename isn't fsync; it's much lighter weight. People are not wrong and stupid for wanting it, but because it's hard, that's practically the first thing out of people's mouths. "You can't do that on ext4. That's not the POSIX semantics, and you're foolish to expect this behavior."

I disagree vehemently. It's a very good feature, and even if it "isn't the Posix standard", you guys should bring this behavior forward. Doing it via the regular rename operation might be a good choice, because it's backwards-compatible with the original accidental feature. Or, perhaps you'll instead want to add an explicit atomic rename operation, so that filesystems like xfs won't surprise users unpleasantly. That would require more pain on the part of application developers, but would make the guarantee explicit instead of implicit, which is probably better from a design perspective.

But telling people to use fsync instead of atomic rename, and that they're wrong and stupid for wanting a feature that's hard to do, is just a tired repetition of a very old game indeed.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 15:37 UTC (Tue) by smoogen (subscriber, #97) [Link] (1 responses)

As far as I can tell... the only way you are going to get what you want is an fsync() or battery backuped cache. Disk drives are limited to writing or reading and are pretty much a 'linear' device in that regards.

In the past, the fsync sort of happened every 5 seconds so you never really spun down your disk. It was the reason why people considered ext3 a slow filesystem compared to xfs, etc etc. One can get better performance, but at the price of reliability.

Garrett: ext4, application expectations and power management

Posted Mar 18, 2009 7:54 UTC (Wed) by malor (guest, #2973) [Link]

It's not the 5-second thing. Rather, something about how ext3 orders writes means that, purely by accident, a rename of a file will always be done after the data blocks of the file have been written to disk. I have no idea why this happens, and it obviously wasn't an intended feature, but that's how it actually works out in practice. The fact that xfs doesn't do this, in fact, is one of the reasons it's considered unreliable by people who've used it on the desktop.

Even if disk spinups were once every five minutes instead of every five seconds, you would still get that behavior; all the data blocks of a given file would be written to disk before that file was renamed over another one.

This means that you're guaranteed to always have either the old data OR the new data. You don't know which you have, after a kernel crash or power failure, but you have one or the other. And this happens without needing to do an fsync, which is a different logical thing, and which absolutely requires a drive spinup. This sync-and-rename functionality is much lighter weight, and can happen pretty much anytime. It doesn't add to the power burden of using the disk, but still guarantees a form of data integrity that many applications find very useful.

Either good old data OR good new data is not the same as fsync. Telling programmers to use fsync is forcing them to use the hammer that's convenient, instead of the screwdriver that would better solve the problem.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds