Posted Dec 1, 2009 22:48 UTC (Tue) by paulj
In reply to: Delayed reaction
Parent article: Journal-guided RAID resync
My basic, potentially a bit wrong, understanding of ZFS is that:
a) It uses COW, and blocks are arranged in a Merkle tree. An updated block isn't properly part of the ZFS until its referencing meta-data block has been updated with its hash. This means that ultimately there is only block you absolutely must update atomically - the root "uber-block" (and even there, you can arrange to have redundant uber-blocks and you can arrange that old uber-block(s) remain valid until you're confident the uber-block is properly written out).
Even if hardware lies massively and re-arranges the writes and manages to lose/corrupt the child data block write, while still writing the parent, you'll at least be able to detect the inconsistency, generally.
b) Because ZFS integrates RAID and FS, it can do variable-sized stripes. So if you want to write data, there's no requirement to read-in other data in order to calculate parity. You only have to write the data.
Combine the 2 and you have avoided the RAID-5 "write hole" AND gained performance (presuming your data hash function is quite cheap relative to I/O - which tends to be true on any half-recent hardware). See Jeff Bonwick's RAID-Z blog entry.
to post comments)