ext4 and data loss
Posted Mar 12, 2009 18:12 UTC (Thu) by davecb
In reply to: ext4 and data loss
Parent article: ext4 and data loss
On a system that predates POSIX and/or logging filesystems, you will get the behavior you
expect: this is exactly the Unix V6 behavior. The
data blocks will be written out, then the inode's length field will be updated, then the (atomic) rename will compete and the file will be replaced.
POSIX doesn't guarantee that: it allows people experimenting with delaying or reordering for performance reasons to weaken the guarantees.
Research filesystems tried both, and found that
one could get considerable performance advantages by
reordering the writes to be in elevator order, and
delaying them until there was enough data to coalesce adhacent writes. Some of this is now
broadly available SCSI's "tag queueing".
Alas, if a write failed, the on-disk
data was now inconsistent, and one could end up with a disk of garbage.
A former colleague, then at UofT, found he
could reorder and coalesce with great benefit, so long as he inserted "barriers" into the sequence where there were correctness-critical orderings.
Those has to remain, but most of the performance
could be kept, with a write cache and a delay of
a few seconds.
Now we're working with journaled filesystems,
which reduce the cost of preserving order even more, but have separated metadata from data updates. This introduced an new opportunity to inadvertently
order updates that broke the older, but
unpublished, correctness criteria.
Some journaled filesystems guarantee that
the sequence you (and I) use is correctness-preserving. ZFS is one of these. Others, including ext3 and 4, leave a window in which a crash will will render the filesystem inconsistent. Ext3 has a small window, and for
unknown reasons, ext4 has one as wide as the
I'm of the opinion both could have arbitrarily small risk periods, and with a persistent write cache or journal, both can avoid all risk.
However, changing the algorithm to one which
is correctness-preserving would arguably be a better answer.
to post comments)