Partial Ordering and Disk I/O
Partial Ordering and Disk I/O
Posted Mar 16, 2009 13:21 UTC (Mon) by Pc5Y9sbv (guest, #41328)In reply to: Garrett: ext4, application expectations and power management by dlang
Parent article: Garrett: ext4, application expectations and power management
But intuitively, I don't think it is as bad as you state. To honor the POSIX ordering all the way to disk would introduce a partial order on write operations, easily imagined as a queue-like structure comprised of a DAG of requests sequenced by write barrier relationships. Each set of siblings and descendents may be reordered, and this need only be maintained in system RAM and mapped to write barriers in the final queued I/O layer to disk. The kernel I/O scheduling would make some of the ordering decisions in mapping the DAG into a stream with write barriers, and leave the rest up to the disk controller. (Examples of mapping the DAG to the stream include deciding how bands of unordered writes from two different streams would be merged into the same band of the final stream, where that band is a set of writes between two write barriers, versus staggered out at different rates to adjust throughput of different streams.)
The sources of this partial ordering information could be explicit syscall/API extensions for write-barriers, but could also be heuristics for cases like that under discussion: maintain ordering with respect to batches of inode-file content writes and inode-linking metadata writes, and related atomic actions like separate relinks of the same file inode or directory inode. This would cover the broad range of "make file content available under a name" crash-recovery semantics and then some...
Coming from a scientific computing background, I suspect most more complex file writing scenarios, such as shared write access from multiple processes, would already have taken into account more elaborate rollback and recovery strategies for the file content in the case of crashes.
