LWN.net Logo

Delayed reaction

Delayed reaction

Posted Dec 1, 2009 22:48 UTC (Tue) by paulj (subscriber, #341)
In reply to: Delayed reaction by dlang
Parent article: Journal-guided RAID resync

My basic, potentially a bit wrong, understanding of ZFS is that:

a) It uses COW, and blocks are arranged in a Merkle tree. An updated block isn't properly part of the ZFS until its referencing meta-data block has been updated with its hash. This means that ultimately there is only block you absolutely must update atomically - the root "uber-block" (and even there, you can arrange to have redundant uber-blocks and you can arrange that old uber-block(s) remain valid until you're confident the uber-block is properly written out).

Even if hardware lies massively and re-arranges the writes and manages to lose/corrupt the child data block write, while still writing the parent, you'll at least be able to detect the inconsistency, generally.

b) Because ZFS integrates RAID and FS, it can do variable-sized stripes. So if you want to write data, there's no requirement to read-in other data in order to calculate parity. You only have to write the data.

Combine the 2 and you have avoided the RAID-5 "write hole" AND gained performance (presuming your data hash function is quite cheap relative to I/O - which tends to be true on any half-recent hardware). See Jeff Bonwick's RAID-Z blog entry.


(Log in to post comments)

Delayed reaction

Posted Dec 4, 2009 1:26 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

normal raid 5 implementations do not require that you read all the blocks to update one, they only require thatyou know the old and new contents of the block that you are modifying and the parity block. so that isn't a ZFS advantage

I wasn't aware that ZFS could mix raid types/configurations within a single filesystem. that seems very strange to me, are you sure that it can do this? I thought that when you created the ZFS filesystem you told it the raid type and configuration to use.

the basic problem is that wuthout doing a sync, the OS has no way of knowing when the data written to different drives actually hits the media and is safe (and this can happen in different orders for different writes, even to the same spot), thus the need to do 'raid journaling' or similar.

I don't believe that ZFS does COW, doing that would fragment the data very rapidly, and that is death to performance for drives that need to seek.

I don't believe that your two points are valid in themselves, and even if they were I don't see how they combine to solve this problem.

Delayed reaction

Posted Dec 4, 2009 19:10 UTC (Fri) by paulj (subscriber, #341) [Link]

I've only had a quick glance, so I could be completely wrong and just looking for anything consistent with my pre-existing notions, but Linux RAID-5 seems to schedule old blocks to be read-in when a block is modified, in order to update the parity. E.g. look at handle_stripe, handle_strip_dirtying and places where the R5_Wantread bit is set. ICBW though.

There are quite a few reasonably authorative sources that say ZFS is COW and transactional. So there's no reason to think it isn't, though I havn't read the code.

I didn't say ZFS mixed RAID types in a single OS. I said it used variable sized stripes. ZFS always writes full-stripes, so it never has to read-back data to finish the write. Combined with the Merkle tree arrangement, it means ordinary writes do not leave the RAID or FS inconsistent, even for short windows of time.

Block layer RAID-5 and separate FS have never been able to achieve that, hence the talk now of introducing more interfaces to make them more closely coupled (and doesnt btrfs have its own device management layer, ala ZFS?).

There's lots of stuff on the net about this.. ;)

Delayed reaction

Posted Dec 4, 2009 22:56 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

so you mean that sometimes it uses 16k stripes across the disks, and in the same filesystem will use 64K stripes across the disks? (picking a couple sizes out of thin air here)

how can it decide when to use what, and what data needs to be stored on what stripe size?

given all the people who are claiming that ZFS is the best performing filesystem ever, I don't believe that it is COW and transactional across multiple drives. on rotating media where seeks are expensive these features and performance just don't coexist.

the problem is that when ZFS wants to update a raid 5 strip it has two writes to do, to two different drives. unless it does a full sync and flush of the drive write buffer for each write (which will kill performance), it has no way of knowing which block on disk gets updated first.

if the system crashes after one block is updated and before the second block is updated, how can it know which one is correct?

Delayed reaction

Posted Dec 4, 2009 23:20 UTC (Fri) by foom (subscriber, #14868) [Link]

> I don't believe that it is COW and transactional across multiple drives.

You could, perhaps, go read about it instead of saying what it can't possibly be.

Here's one place to start:

http://en.wikipedia.org/wiki/ZFS#Copy-on-write_transactio...

Delayed reaction

Posted Dec 5, 2009 1:29 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

note that this says it is tranactional when syncronous writes are desired. that is not the normal operation

second, it days that it uses an intent bitmap to implement this protection. this is one of the methods described in the article to address this problem

so ZFS didn't sidestep this issue, it implemented the same method that MD offers (which is usually disabled for performance reasons), write intent bitmaps.

I stand corrected on the COW issue, but by doing COW you have two major problems

1. every write to a file ends up updating many more places on the filesystem (the file block gets moved and re-written, then the metadata that points to that file gets moved and re-written, then the metadata that points to that metadata..... until the root directory gets re-written) this is one of the failues of btree filesystems that made them unusable on rotating media.

wikipedia states that ZFS works around this by buffering the writes to consolodate them, but that is the same strategy that ext3 uses, and the need to then write all that buffered data in the face of a sync call is why fsync is so horrible on ext3

Delayed reaction

Posted Dec 5, 2009 12:22 UTC (Sat) by paulj (subscriber, #341) [Link]

Note that the ZFS Intent Log doesn't affect the consistency of the filesystem. The fs is consistent without the intent log, thanks to the COW, Merkle tree arrangement. So its not solving any consistency problems, i.e.:

  • ZFS is transactional, with or without the ZIL
  • ZFS is always consistent on disk, with or without the ZIL

The ZIL is there to solve the problem of supporting performant synchronous writes. ZFS would still support consistent, synchronous writes without the ZIL, just they'd be very slow.

I think it'd be useful if you looked at this ZFS presentation, which is where I'm getting most of my info from. The OSOL ZFS community also has a source overview.

Delayed reaction

Posted Dec 5, 2009 12:31 UTC (Sat) by paulj (subscriber, #341) [Link]

A good blog entry on the ZIL.

Delayed reaction

Posted Dec 5, 2009 10:20 UTC (Sat) by paulj (subscriber, #341) [Link]

My understanding is that COW is fundamental to how ZFS works (refer back to my previous comment on Merkle trees and how data and meta-data reference each other). I havn't read the code to verify this, but it's what all the papers and presentations on ZFS say.

I think that also already answers your question as to how ZFS can know which one is correct. I'm not sure there's any value in, again, explaining how ZFS consistency (in theory) comes down to the update of a single "uber-block".

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds