Ts'o: Delayed allocation and the zero-length file problem
Posted Mar 20, 2009 18:29 UTC (Fri) by anton
In reply to: Ts'o: Delayed allocation and the zero-length file problem
Parent article: Ts'o: Delayed allocation and the zero-length file problem
But, the important discussion isn't if I can sneak in a good
implementation for popular but incorrect API usage. The important
discussion is, what is the API today and what should it really be?
The API in the non-crash case is defined by POSIX, so I translate this
as: What guarantees in case of a crash should the file system give?
One ideal is to perform all operations synchronously. That's very
The next-cheaper ideal is to preserve the order of the operations
semantics), i.e., after crash recovery you will find the file
system in one of the states that it logically had before the crash;
the file system may lose the operations that happened after some point
in time, but it will be just as consistent as it was at that point in
time. If the file system gives this guarantee, any application that
written to be safe against being killed will also have consistent (but
not necessarily up-to-date) data in case of a system crash.
This guarantee can be implemented relatively cheaply in a
copy-on-write file systems, so I really would like Btrfs to give that
guarantee, and give it for its default mode (otherwise things like
ext3's data=journal debacle will happen).
How to implement this guarantee? When you decide to do another
commit, just remember the then-current logical state of the file
system (i.e., which blocks have to be written out), then write them
out, then do a barrier, and finally the root block. There are some
complications: e.g., you have to deal with some processes being in the
middle of some operation at the time; and if a later operation wants
to change a block before it is written out, you have to make a new
working copy of that block (in addition to the one waiting to be
written out), but that's just a variation on the usual copy-on-write
You would also have to decide how to deal with fsync() when you
give this guarantee: Can fsync()ed operations run ahead of the rest
(unlike what you normally guarantee), or do you just perform a sync
when an fsync is requested.
The benefits of providing such a guarantee would be:
- Many applications that work well when killed would work well on
Btrfs even upon a crash.
- It would be a unique selling point for Btrfs. Other popular Linux
file systems don't guarantee anything at all, and their maintainers
only grudgingly address the worst shortcomings when there's a large
enough outcry while complaining about "incorrect API usage" by
applications, and some play fast and lose in other ways (e.g., by not
using barriers). Many users value their data more than these
maintainers and would hopefully flock to a filesystem that actually
gives crash consistency gurarantees.
If you don't even give crash consistency guarantees, I don't really
see a point in having the checksums that are one of the main features
of Btrfs. I have seen many crashes, including some where the file
system lost data, but I have never seen hardware go bad in a way where
checksums would help.
to post comments)