|
|
Log in / Subscribe / Register

Garrett: ext4, application expectations and power management

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 3:34 UTC (Mon) by drag (guest, #31333)
In reply to: Garrett: ext4, application expectations and power management by Nick
Parent article: Garrett: ext4, application expectations and power management

What is so wrong with a file system honoring the order of operations?

I mean if a application does a write then rename, why not wait to commit the rename to disk until after the write is committed?

Nobody is caring if the data is flushed to the drive immediately on a rename; just that the data is on the disk by the time the rename is on the disk. That way if the system crashes then your old copy of the data is still valid.


to post comments

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:07 UTC (Mon) by k8to (guest, #15413) [Link]

Because write is not a write to disk, and rename is not a rename to disk. They do occur in order in a perceptual way.

That they do not occur in order on disk is what you would want for the usual case.

This is a situation where the apis should be enhanced so that the application can tell the system what it needs.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 5:02 UTC (Mon) by dlang (guest, #313) [Link] (2 responses)

because providing the ordering that you want would kill performance. it would mean that you could not reorder I/O from the order that the various programs happened to ask for it to something that the storage system can do more efficiently. it would mean that the storage system would (in most cases) not be able to combine separate I/O operations into a smaller number of them.

and as a result, it would also cause the drives to wear out faster as the seek across the entire drive more.

you may think that you want that sort of guarantee, but you really don't. if you did than the 5 second window that ext3 has would be completely unacceptable to you as well.

Partial Ordering and Disk I/O

Posted Mar 16, 2009 13:21 UTC (Mon) by Pc5Y9sbv (guest, #41328) [Link]

I wish someone deeply familiar with file system design would give a detailed answer to this question. I am a computer scientist and software architect but don't have practical experience writing or optimizing general purpose file systems. I would, however, love to see pointers to more detailed reading.

But intuitively, I don't think it is as bad as you state. To honor the POSIX ordering all the way to disk would introduce a partial order on write operations, easily imagined as a queue-like structure comprised of a DAG of requests sequenced by write barrier relationships. Each set of siblings and descendents may be reordered, and this need only be maintained in system RAM and mapped to write barriers in the final queued I/O layer to disk. The kernel I/O scheduling would make some of the ordering decisions in mapping the DAG into a stream with write barriers, and leave the rest up to the disk controller. (Examples of mapping the DAG to the stream include deciding how bands of unordered writes from two different streams would be merged into the same band of the final stream, where that band is a set of writes between two write barriers, versus staggered out at different rates to adjust throughput of different streams.)

The sources of this partial ordering information could be explicit syscall/API extensions for write-barriers, but could also be heuristics for cases like that under discussion: maintain ordering with respect to batches of inode-file content writes and inode-linking metadata writes, and related atomic actions like separate relinks of the same file inode or directory inode. This would cover the broad range of "make file content available under a name" crash-recovery semantics and then some...

Coming from a scientific computing background, I suspect most more complex file writing scenarios, such as shared write access from multiple processes, would already have taken into account more elaborate rollback and recovery strategies for the file content in the case of crashes.

What is so wrong with a file system honoring the order of operations?

Posted Mar 20, 2009 19:06 UTC (Fri) by anton (subscriber, #25547) [Link]

because providing the ordering that you want would kill performance. it would mean that you could not reorder I/O from the order that the various programs happened to ask for it to something that the storage system can do more efficiently. it would mean that the storage system would (in most cases) not be able to combine separate I/O operations into a smaller number of them.
No (to each of these statements). A file system could combine many operations into one large batch, write out the batch in any order and with as few I/O operations as it (or the drive) likes, then commit the whole batch by writing one commit block. That would be efficient. Of course this means that no old block must be overwritten before the commit block is written, but that can be achieved by using a journal or a copy-on-write file system.

And yes, I want that guarantee, I really do, and I don't care if the file system loses 5 seconds or 30 seconds of operations, in case of a crash, but I do care if what it gives me is a state that never logically existed before the crash.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 8:47 UTC (Mon) by Nick (guest, #15060) [Link] (7 responses)

> What is so wrong with a file system honoring the order of operations?

There is nothing wrong with it, what is wrong is an application ignoring the documented
standards and assuming it will "honour" some semantics that they happen to think are
reasonable.

Historically the reason why they don't do this is performance. POSIX as far as I can see encoded
existing semantics in this regard, rather than a case of some particular OS or filesystem
developers making some legal interpretation of the document that goes against the spirit of it.

> I mean if a application does a write then rename, why not wait to commit the rename to
> disk until after the write is committed?

You could, but that's not a trivial thing to do for a lot of filesystems (without resorting to an
fsync), and it would also cost performance for apps that don't want it.

> Nobody is caring if the data is flushed to the drive immediately on a rename; just that the
> data is on the disk by the time the rename is on the disk. That way if the system crashes
> then your old copy of the data is still valid.

The way to do that is with fsync. If some filesystem happens to honour flush on rename, you are
still going to need fsync in order to have a correct and portable app, unfortunately. If you just
want the ordering but not the synchronous write that fsync gives, then you need to propose a
new syscall API for this (which would degenerate to fsync if a particular filesystem can't handle it
nicely).

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:25 UTC (Mon) by jamesh (guest, #1159) [Link] (6 responses)

> There is nothing wrong with it, what is wrong is an application ignoring
> the documented standards and assuming it will "honour" some semantics
> that they happen to think are reasonable.

Which standard is the application not honouring? The POSIX standard leaves behaviour over system crashes undefined so they can't rely on that one.

Absent some other standard to define the behaviour on crash, applications are left to assume that the implementation defined behaviour is sane.

Given the POSIX defined behaviour of rename() when the system isn't crashing and real world behaviour of ext3, zfs, etc, having the filesystem attempt to preserve the atomic "old content or new content" behaviour seems desirable.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:45 UTC (Mon) by k8to (guest, #15413) [Link] (5 responses)

They are not honouring the requirement for them to express that the data be on the disk when the rename is applied.

That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.

In the replace-with-rename pattern, it's wrong.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:59 UTC (Mon) by jamesh (guest, #1159) [Link]

> They are not honouring the requirement for them to express that the data
> be on the disk when the rename is applied.

Right. There doesn't seem to be a way to do this without requiring that the data be written to disk right now. In these cases, the application is fine with delayed writes -- they just want the ordering of the write and the rename to be preserved.

> That's not wrong. It's just wrong if the application requires that the
> data be on disk after crash, which is what everyone is bitching about.

That isn't what the applications require though. The behaviour they are after is for the rename to be recorded only if the associated writes are also recorded.

It is acceptable if the rename is lost by a system crash. What is not acceptable is for the rename to occur but not the write.

If the application wanted to be sure that the data had been flushed, before the rename, then yes they should call fsync().

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:03 UTC (Mon) by drag (guest, #31333) [Link]

> That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.

Well they want either the old data or new data to be in a file system after recovering from a crash. Not files full of zeroes...

People are willing to put up with missing X number of seconds of work from the vast majority of applications they are using.

It's actually rare that people want data immediately written to disk. Stuff they want saved very carefully and immediately is generally going to be user-generated data (what your editing with Emacs) and not automatically generated data (my application remembering the position of icons in my windows).

Forcing a commit immediately to disk seems to be a much bigger hammer then what is wanted. They just want to have the OS not to corrupt files if it can be helped.

If fsync() is the only way to have the OS not to randomly blow away files on my hard drive, then so be it. It just seems like there should be a better way.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 10:09 UTC (Tue) by malor (guest, #2973) [Link] (2 responses)

What the author is arguing, and I agree with him, is that applications need a method to guarantee that the data on disk is always good, whatever version it is, but without the penalty of a full fsync. That may not matter _that_ much on a server or desktop, but a laptop, that means the drive absolutely has to spin up from sleep, or can't sleep in the first place. This is an substantial battery hit. I don't have any easy way to test it, but hard drive spinups are expensive as hell (and slow), so it wouldn't shock me if this ext4 behavior change singlehandedly wiped out a good chunk of the work done to improve kernel power usage on laptops.

Atomic rename is not the same thing as fsync. Telling application authors that they have to use fsync is yet another example of, when something is hard to do in Linux, telling the user that what he or she wants is wrong and stupid. This pattern goes way, way back.

Once upon a time, in the early days of Linux, I commented on Slashdot that ext2 was a bad filesystem, and would lose data if the computer crashed or lost power. I was informed, by numerous people, that the data loss was my fault because the computer wasn't on a UPS, and that I should 'simply' have manually run a disk editor and restored a backup superblock to recover the corrupted files. Seriously: lost data, they claimed, was my fault because I didn't understand the layout of ext2 well enough to fire up a hex editor when it crashed.

Well, sometime in the next year or two, journaling showed up, and suddenly everyone was all about how wonderful it was, how horrible ext2 was in comparison, and how no sane person would use ext2 in production. But when I'd said that, when there was no other option, I was wrong and stupid for wanting reliability in my filesystem.

I see this argument the same way; by accident, the ext3 writers provided a very useful feature. Atomic rename isn't fsync; it's much lighter weight. People are not wrong and stupid for wanting it, but because it's hard, that's practically the first thing out of people's mouths. "You can't do that on ext4. That's not the POSIX semantics, and you're foolish to expect this behavior."

I disagree vehemently. It's a very good feature, and even if it "isn't the Posix standard", you guys should bring this behavior forward. Doing it via the regular rename operation might be a good choice, because it's backwards-compatible with the original accidental feature. Or, perhaps you'll instead want to add an explicit atomic rename operation, so that filesystems like xfs won't surprise users unpleasantly. That would require more pain on the part of application developers, but would make the guarantee explicit instead of implicit, which is probably better from a design perspective.

But telling people to use fsync instead of atomic rename, and that they're wrong and stupid for wanting a feature that's hard to do, is just a tired repetition of a very old game indeed.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 15:37 UTC (Tue) by smoogen (subscriber, #97) [Link] (1 responses)

As far as I can tell... the only way you are going to get what you want is an fsync() or battery backuped cache. Disk drives are limited to writing or reading and are pretty much a 'linear' device in that regards.

In the past, the fsync sort of happened every 5 seconds so you never really spun down your disk. It was the reason why people considered ext3 a slow filesystem compared to xfs, etc etc. One can get better performance, but at the price of reliability.

Garrett: ext4, application expectations and power management

Posted Mar 18, 2009 7:54 UTC (Wed) by malor (guest, #2973) [Link]

It's not the 5-second thing. Rather, something about how ext3 orders writes means that, purely by accident, a rename of a file will always be done after the data blocks of the file have been written to disk. I have no idea why this happens, and it obviously wasn't an intended feature, but that's how it actually works out in practice. The fact that xfs doesn't do this, in fact, is one of the reasons it's considered unreliable by people who've used it on the desktop.

Even if disk spinups were once every five minutes instead of every five seconds, you would still get that behavior; all the data blocks of a given file would be written to disk before that file was renamed over another one.

This means that you're guaranteed to always have either the old data OR the new data. You don't know which you have, after a kernel crash or power failure, but you have one or the other. And this happens without needing to do an fsync, which is a different logical thing, and which absolutely requires a drive spinup. This sync-and-rename functionality is much lighter weight, and can happen pretty much anytime. It doesn't add to the power burden of using the disk, but still guarantees a form of data integrity that many applications find very useful.

Either good old data OR good new data is not the same as fsync. Telling programmers to use fsync is forcing them to use the hammer that's convenient, instead of the screwdriver that would better solve the problem.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds