Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
At least this discussion gives some publicity to the nasty shortcomings of RAID-*.
Posted Dec 1, 2009 3:08 UTC (Tue) by dlang (✭ supporter ✭, #313)
note that checksums don't help here as all drives have valid checksums.
Posted Dec 1, 2009 22:48 UTC (Tue) by paulj (subscriber, #341)
My basic, potentially a bit wrong, understanding of ZFS is that:
a) It uses COW, and blocks are arranged in a Merkle tree. An updated block isn't properly part of the ZFS until its referencing meta-data block has been updated with its hash. This means that ultimately there is only block you absolutely must update atomically - the root "uber-block" (and even there, you can arrange to have redundant uber-blocks and you can arrange that old uber-block(s) remain valid until you're confident the uber-block is properly written out).
Even if hardware lies massively and re-arranges the writes and manages to lose/corrupt the child data block write, while still writing the parent, you'll at least be able to detect the inconsistency, generally.
b) Because ZFS integrates RAID and FS, it can do variable-sized stripes. So if you want to write data, there's no requirement to read-in other data in order to calculate parity. You only have to write the data.
Combine the 2 and you have avoided the RAID-5 "write hole" AND gained performance (presuming your data hash function is quite cheap relative to I/O - which tends to be true on any half-recent hardware). See Jeff Bonwick's RAID-Z blog entry.
Posted Dec 4, 2009 1:26 UTC (Fri) by dlang (✭ supporter ✭, #313)
I wasn't aware that ZFS could mix raid types/configurations within a single filesystem. that seems very strange to me, are you sure that it can do this? I thought that when you created the ZFS filesystem you told it the raid type and configuration to use.
the basic problem is that wuthout doing a sync, the OS has no way of knowing when the data written to different drives actually hits the media and is safe (and this can happen in different orders for different writes, even to the same spot), thus the need to do 'raid journaling' or similar.
I don't believe that ZFS does COW, doing that would fragment the data very rapidly, and that is death to performance for drives that need to seek.
I don't believe that your two points are valid in themselves, and even if they were I don't see how they combine to solve this problem.
Posted Dec 4, 2009 19:10 UTC (Fri) by paulj (subscriber, #341)
There are quite a few reasonably authorative sources that say ZFS is COW and transactional. So there's no reason to think it isn't, though I havn't read the code.
I didn't say ZFS mixed RAID types in a single OS. I said it used variable sized stripes. ZFS always writes full-stripes, so it never has to read-back data to finish the write. Combined with the Merkle tree arrangement, it means ordinary writes do not leave the RAID or FS inconsistent, even for short windows of time.
Block layer RAID-5 and separate FS have never been able to achieve that, hence the talk now of introducing more interfaces to make them more closely coupled (and doesnt btrfs have its own device management layer, ala ZFS?).
There's lots of stuff on the net about this.. ;)
Posted Dec 4, 2009 22:56 UTC (Fri) by dlang (✭ supporter ✭, #313)
how can it decide when to use what, and what data needs to be stored on what stripe size?
given all the people who are claiming that ZFS is the best performing filesystem ever, I don't believe that it is COW and transactional across multiple drives. on rotating media where seeks are expensive these features and performance just don't coexist.
the problem is that when ZFS wants to update a raid 5 strip it has two writes to do, to two different drives. unless it does a full sync and flush of the drive write buffer for each write (which will kill performance), it has no way of knowing which block on disk gets updated first.
if the system crashes after one block is updated and before the second block is updated, how can it know which one is correct?
Posted Dec 4, 2009 23:20 UTC (Fri) by foom (subscriber, #14868)
You could, perhaps, go read about it instead of saying what it can't possibly be.
Here's one place to start:
Posted Dec 5, 2009 1:29 UTC (Sat) by dlang (✭ supporter ✭, #313)
second, it days that it uses an intent bitmap to implement this protection. this is one of the methods described in the article to address this problem
so ZFS didn't sidestep this issue, it implemented the same method that MD offers (which is usually disabled for performance reasons), write intent bitmaps.
I stand corrected on the COW issue, but by doing COW you have two major problems
1. every write to a file ends up updating many more places on the filesystem (the file block gets moved and re-written, then the metadata that points to that file gets moved and re-written, then the metadata that points to that metadata..... until the root directory gets re-written) this is one of the failues of btree filesystems that made them unusable on rotating media.
wikipedia states that ZFS works around this by buffering the writes to consolodate them, but that is the same strategy that ext3 uses, and the need to then write all that buffered data in the face of a sync call is why fsync is so horrible on ext3
Posted Dec 5, 2009 12:22 UTC (Sat) by paulj (subscriber, #341)
Note that the ZFS Intent Log doesn't affect the consistency of the filesystem. The fs is consistent without the intent log, thanks to the COW, Merkle tree arrangement. So its not solving any consistency problems, i.e.:
The ZIL is there to solve the problem of supporting performant synchronous writes. ZFS would still support consistent, synchronous writes without the ZIL, just they'd be very slow.
I think it'd be useful if you looked at this ZFS presentation, which is where I'm getting most of my info from. The OSOL ZFS community also has a source overview.
Posted Dec 5, 2009 12:31 UTC (Sat) by paulj (subscriber, #341)
Posted Dec 5, 2009 10:20 UTC (Sat) by paulj (subscriber, #341)
I think that also already answers your question as to how ZFS can know which one is correct. I'm not sure there's any value in, again, explaining how ZFS consistency (in theory) comes down to the update of a single "uber-block".
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds