|
|
Subscribe / Log in / New account

Bringing bcachefs to the mainline

Bringing bcachefs to the mainline

Posted May 24, 2022 15:22 UTC (Tue) by Wol (subscriber, #4433)
In reply to: Bringing bcachefs to the mainline by xanni
Parent article: Bringing bcachefs to the mainline

How does that work then? If you've got raid-5, and the checksum reports "this file is corrupt", how do you recover the original file? If all you've got is raid-5 then it's mathematically impossible.

If you've got raid-1, the checksum identifies which copy is corrupt and therefore which copy is correct. If you've got raid-6, then you can solve the equations to get your data back. But raid-5? Sorry, unless that checksum tells you which disk block is corrupt, you're stuffed.

Cheers,
Wol


to post comments

Bringing bcachefs to the mainline

Posted May 24, 2022 16:02 UTC (Tue) by xanni (subscriber, #361) [Link] (3 responses)

RAID5 allows you to recover the data with any 2 of the 3 blocks for each block of the file. RAID6 allows you to use any 2 of the 4 blocks and is designed to address the issue of a second failure during the recovery from a single failure, since recovery from a full drive failure can take quite a while. If you lose an entire drive with block-level RAID5, you can replace it and recover all data online with zero downtime. If you regularly scrub any level of btrfs RAID, you can repair corrupted blocks with zero downtime.

Bringing bcachefs to the mainline

Posted May 24, 2022 16:15 UTC (Tue) by xanni (subscriber, #361) [Link] (2 responses)

I haven't looked at the BTRFS implementation to confirm, but I believe it simply keeps a checksum for each file block, so it's easy to tell which disk blocks are valid: any combination of two RAID5 or RAID6 blocks that don't recover a block with the correct checksum include a corrupted disk block, in which case try one of the other combinations. If none are valid, you have more corrupt disk blocks than your redundancy level.

Bringing bcachefs to the mainline

Posted May 24, 2022 20:28 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Yup, if it keeps a check-sum per DISK block, fine. But if the check-sum is *file*-based, how does it know which *disk* is corrupt?

Cheers,
Wol

Bringing bcachefs to the mainline

Posted May 24, 2022 21:18 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

It doesn't. For RAID1* mirroring btrfs simply reads each copy until the csum matches. For parity RAID[56], it assumes the blocks that have bad csums are bad, and reconstructs them in the normal parity raid way by reading the other blocks from the stripe and recomputing the bad blocks. If that doesn't provide a block with the right csum, or there are additional csum errors when reading other data blocks, then the block is gone and read returns EIO.

Strictly speaking, the csum is on the extent, not the file, which only matters when things like snapshots and dedupe make lots of files reference the same physical disk blocks, or compression transforms the data before storing it on disk. There's a single authoritative csum that covers all replicas of that block, whether they are verbatim copies or computed from other blocks. That csum is itself stored in a tree with csums on the pages.

There are no csums on the parity blocks, so btrfs's on-disk format cannot correctly identify the corrupted disk in RAID[56] if the parity block is corrupted and some of the data blocks in the stripe have no csums (either free space or nodatasum files). It's possible to determine that parity doesn't match the data and the populated data blocks are correct, but not whether the corrupted device is the one holding the parity block or one of the devices holding the unpopulated data blocks.

There's some fairly significant missing pieces in the btrfs RAID[56] implementation: scrub neither detects faults in nor corrects parity blocks, and neither do RMW stripe updates (which are sort of a bug in and of themselves), and half a dozen other bugs that pop up in degraded mode.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds