User: Password:
|
|
Subscribe / Log in / New account

RAID 5/6 code merged into Btrfs

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 18:24 UTC (Mon) by butlerm (guest, #13312)
Parent article: RAID 5/6 code merged into Btrfs

I have question. Is there a plan to store write intent information somewhere so that parity information can be rebuilt for partially completed stripe writes after a system crash? MD uses a write intent bitmap for this.

In an ordinary filesystem the journal or intent log would be an excellent place for this, but I understand Btrfs doesn't use one.


(Log in to post comments)

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 18:57 UTC (Mon) by drag (subscriber, #31333) [Link]

Wouldn't Btrfs's 'COW' design simply mean that partial writes are just discarded? Baring any foolishness in the hardware the filesystem should always remain in a consistent state regardless of it crashing or whatnot. Maybe just have to do a rollback on a commit or two to get a good state.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:57 UTC (Mon) by masoncl (subscriber, #47138) [Link]

Almost. Since the implementation allows us to share stripes between different extents, you might have a shiny new extent going into the same stripe as an old extent.

In this case you need to protect the parity for the old extent just in case we crash while we're rewriting the parity blocks.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 23:51 UTC (Mon) by butlerm (guest, #13312) [Link]

> Wouldn't Btrfs's 'COW' design simply mean that partial writes are just discarded?

That only works if you always write a full stripe, which is generally not the case. ZFS uses variable stripe sizes to achieve this for data writes, but the minimum stripe size tends to be rather large, depending on how many disks you have in your RAID set.

If you spread every filesystem block across all disks the way ZFS does, random read performance suffers dramatically. Every disk has to participate in every uncached data read. Minimum FS block sizes go up with the number of disks, and so on.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:10 UTC (Mon) by Jonno (subscriber, #49613) [Link]

Due to the btrfs design a write intent bitmap isn't necessary. Checksums make it possible to figure out which drive is at fault without one, you just have to do a scrub after a crash.

Additionally, btrfs already keeps track of the last five transactions it committed, so it should be possible to automatically scrub just those, but I don't know if that is planed, or if the devs have something even smarter in mind.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:59 UTC (Mon) by masoncl (subscriber, #47138) [Link]

It's true that crcs allow us to figure out if the data on the drives is correct. But, if you crash while updating the parity and you lose one of the drives (not unusual in a power failure), you need to be able to rebuild the data from parity.

If the parity isn't consistent with the rest of the stripe, the rebuild isn't possible.

-chris

RAID 5/6 code merged into Btrfs

Posted Feb 6, 2013 15:27 UTC (Wed) by Jonno (subscriber, #49613) [Link]

> If the parity isn't consistent with the rest of the stripe, the rebuild isn't possible.
True, but a write-intent bitmap wouldn't help with that either, as all it does is tell you which drive(s), if any, is out of date and need to be rebuilt, information that won't help if you lost a drive (or two for raid6) and can't rebuild anything.

RAID 5/6 code merged into Btrfs

Posted Feb 6, 2013 18:26 UTC (Wed) by butlerm (guest, #13312) [Link]

The purpose of a write intent bitmap is not to recover a failed drive, it is to recover from a lost write. In the event of a power failure or system crash, one or more of the writes may be lost (or partially completed), leaving the stripe parity in an inconsistent state.

Correct parity (sufficient to recover from a subsequent drive failure) can be trivially regenerated using the contents of the write intent bitmap. The data on the blocks actually being written to may be still be incomplete of course, but that doesn't matter for the purpose of protecting the data on other other blocks in the same stripe.

If a drive fails and the system crashes at the same time a stripe update is in progress, it is entirely possible of course that unrelated parts of the stripe being updated may become unrecoverable, for lack of consistent parity information. You can see the attraction of the ZFS full stripe minimum block size policy.

RAID 5/6 code merged into Btrfs

Posted Feb 4, 2013 19:55 UTC (Mon) by masoncl (subscriber, #47138) [Link]

An intent log is similar to how I'll end up preventing bad parity after a crash. That's the part I'm still hacking on.

If we're doing a full stripe write that came from a COW operation, we don't need the extra logging because none of the blocks in the stripe are fully allocated until after the IO is complete.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds