Ts'o: Delayed allocation and the zero-length file problem
Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 18:29 UTC (Fri) by anton (subscriber, #25547)In reply to: Ts'o: Delayed allocation and the zero-length file problem by masoncl
Parent article: Ts'o: Delayed allocation and the zero-length file problem
But, the important discussion isn't if I can sneak in a good implementation for popular but incorrect API usage. The important discussion is, what is the API today and what should it really be?The API in the non-crash case is defined by POSIX, so I translate this as: What guarantees in case of a crash should the file system give?
One ideal is to perform all operations synchronously. That's very expensive, however.
The next-cheaper ideal is to preserve the order of the operations (in-order semantics), i.e., after crash recovery you will find the file system in one of the states that it logically had before the crash; the file system may lose the operations that happened after some point in time, but it will be just as consistent as it was at that point in time. If the file system gives this guarantee, any application that written to be safe against being killed will also have consistent (but not necessarily up-to-date) data in case of a system crash.
This guarantee can be implemented relatively cheaply in a copy-on-write file systems, so I really would like Btrfs to give that guarantee, and give it for its default mode (otherwise things like ext3's data=journal debacle will happen).
How to implement this guarantee? When you decide to do another commit, just remember the then-current logical state of the file system (i.e., which blocks have to be written out), then write them out, then do a barrier, and finally the root block. There are some complications: e.g., you have to deal with some processes being in the middle of some operation at the time; and if a later operation wants to change a block before it is written out, you have to make a new working copy of that block (in addition to the one waiting to be written out), but that's just a variation on the usual copy-on-write routine.
You would also have to decide how to deal with fsync() when you give this guarantee: Can fsync()ed operations run ahead of the rest (unlike what you normally guarantee), or do you just perform a sync when an fsync is requested.
The benefits of providing such a guarantee would be:
- Many applications that work well when killed would work well on Btrfs even upon a crash.
- It would be a unique selling point for Btrfs. Other popular Linux file systems don't guarantee anything at all, and their maintainers only grudgingly address the worst shortcomings when there's a large enough outcry while complaining about "incorrect API usage" by applications, and some play fast and lose in other ways (e.g., by not using barriers). Many users value their data more than these maintainers and would hopefully flock to a filesystem that actually gives crash consistency gurarantees.
