so you mean that sometimes it uses 16k stripes across the disks, and in the same filesystem will use 64K stripes across the disks? (picking a couple sizes out of thin air here)
how can it decide when to use what, and what data needs to be stored on what stripe size?
given all the people who are claiming that ZFS is the best performing filesystem ever, I don't believe that it is COW and transactional across multiple drives. on rotating media where seeks are expensive these features and performance just don't coexist.
the problem is that when ZFS wants to update a raid 5 strip it has two writes to do, to two different drives. unless it does a full sync and flush of the drive write buffer for each write (which will kill performance), it has no way of knowing which block on disk gets updated first.
if the system crashes after one block is updated and before the second block is updated, how can it know which one is correct?
Posted Dec 5, 2009 1:29 UTC (Sat) by dlang (✭ supporter ✭, #313)
[Link]
note that this says it is tranactional when syncronous writes are desired. that is not the normal operation
second, it days that it uses an intent bitmap to implement this protection. this is one of the methods described in the article to address this problem
so ZFS didn't sidestep this issue, it implemented the same method that MD offers (which is usually disabled for performance reasons), write intent bitmaps.
I stand corrected on the COW issue, but by doing COW you have two major problems
1. every write to a file ends up updating many more places on the filesystem (the file block gets moved and re-written, then the metadata that points to that file gets moved and re-written, then the metadata that points to that metadata..... until the root directory gets re-written) this is one of the failues of btree filesystems that made them unusable on rotating media.
wikipedia states that ZFS works around this by buffering the writes to consolodate them, but that is the same strategy that ext3 uses, and the need to then write all that buffered data in the face of a sync call is why fsync is so horrible on ext3
Delayed reaction
Posted Dec 5, 2009 12:22 UTC (Sat) by paulj (subscriber, #341)
[Link]
Note that the ZFS Intent Log doesn't affect the consistency of the filesystem. The fs is consistent without the intent log, thanks to the COW, Merkle tree arrangement. So its not solving any consistency problems, i.e.:
ZFS is transactional, with or without the ZIL
ZFS is always consistent on disk, with or without the ZIL
The ZIL is there to solve the problem of supporting performant synchronous writes. ZFS would still support consistent, synchronous writes without the ZIL, just they'd be very slow.
Posted Dec 5, 2009 10:20 UTC (Sat) by paulj (subscriber, #341)
[Link]
My understanding is that COW is fundamental to how ZFS works (refer back to my previous comment on Merkle trees and how data and meta-data reference each other). I havn't read the code to verify this, but it's what all the papers and presentations on ZFS say.
I think that also already answers your question as to how ZFS can know which one is correct. I'm not sure there's any value in, again, explaining how ZFS consistency (in theory) comes down to the update of a single "uber-block".