|
|
Log in / Subscribe / Register

Garrett: ext4, application expectations and power management

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 5:02 UTC (Mon) by dlang (guest, #313)
In reply to: Garrett: ext4, application expectations and power management by drag
Parent article: Garrett: ext4, application expectations and power management

because providing the ordering that you want would kill performance. it would mean that you could not reorder I/O from the order that the various programs happened to ask for it to something that the storage system can do more efficiently. it would mean that the storage system would (in most cases) not be able to combine separate I/O operations into a smaller number of them.

and as a result, it would also cause the drives to wear out faster as the seek across the entire drive more.

you may think that you want that sort of guarantee, but you really don't. if you did than the 5 second window that ext3 has would be completely unacceptable to you as well.


to post comments

Partial Ordering and Disk I/O

Posted Mar 16, 2009 13:21 UTC (Mon) by Pc5Y9sbv (guest, #41328) [Link]

I wish someone deeply familiar with file system design would give a detailed answer to this question. I am a computer scientist and software architect but don't have practical experience writing or optimizing general purpose file systems. I would, however, love to see pointers to more detailed reading.

But intuitively, I don't think it is as bad as you state. To honor the POSIX ordering all the way to disk would introduce a partial order on write operations, easily imagined as a queue-like structure comprised of a DAG of requests sequenced by write barrier relationships. Each set of siblings and descendents may be reordered, and this need only be maintained in system RAM and mapped to write barriers in the final queued I/O layer to disk. The kernel I/O scheduling would make some of the ordering decisions in mapping the DAG into a stream with write barriers, and leave the rest up to the disk controller. (Examples of mapping the DAG to the stream include deciding how bands of unordered writes from two different streams would be merged into the same band of the final stream, where that band is a set of writes between two write barriers, versus staggered out at different rates to adjust throughput of different streams.)

The sources of this partial ordering information could be explicit syscall/API extensions for write-barriers, but could also be heuristics for cases like that under discussion: maintain ordering with respect to batches of inode-file content writes and inode-linking metadata writes, and related atomic actions like separate relinks of the same file inode or directory inode. This would cover the broad range of "make file content available under a name" crash-recovery semantics and then some...

Coming from a scientific computing background, I suspect most more complex file writing scenarios, such as shared write access from multiple processes, would already have taken into account more elaborate rollback and recovery strategies for the file content in the case of crashes.

What is so wrong with a file system honoring the order of operations?

Posted Mar 20, 2009 19:06 UTC (Fri) by anton (subscriber, #25547) [Link]

because providing the ordering that you want would kill performance. it would mean that you could not reorder I/O from the order that the various programs happened to ask for it to something that the storage system can do more efficiently. it would mean that the storage system would (in most cases) not be able to combine separate I/O operations into a smaller number of them.
No (to each of these statements). A file system could combine many operations into one large batch, write out the batch in any order and with as few I/O operations as it (or the drive) likes, then commit the whole batch by writing one commit block. That would be efficient. Of course this means that no old block must be overwritten before the commit block is written, but that can be achieved by using a journal or a copy-on-write file system.

And yes, I want that guarantee, I really do, and I don't care if the file system loses 5 seconds or 30 seconds of operations, in case of a crash, but I do care if what it gives me is a state that never logically existed before the crash.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds