LWN: Comments on "Bringing bcachefs to the mainline"

Bringing bcachefs to the mainline

kreijack — Mon, 20 Nov 2023 18:15:32 +0000

> > Overstreet said that bcachefs has the same basic problem that Btrfs does with regard to inode numbers, subvolumes, and NFS

> It not just NFS. I couldn't wait for bcachefs to hit mainline. But now I've realized, that an _unprivileged_ user can do this:

Both BTRFS and BCacheFS (but also a basic LVM/dm snapshot) shared the same problem of having snapshots with (necessarily) file having the same inode number.

However BTRFS create for each subvolume a different fsid (see statfs(2), f_fsid).

[code]
ghigo@venice:/var/btrfs/@test-subvol$ btrfs sub cre vol1
Create subvolume './vol1'
ghigo@venice:/var/btrfs/@test-subvol$ mkdir vol1/dir1
ghigo@venice:/var/btrfs/@test-subvol$ btrfs sub snap vol1 vol1/snp1
Create a snapshot of 'vol1' in 'vol1/snp1'
ghigo@venice:/var/btrfs/@test-subvol$ ls -li vol1/
total 0
257 drwxr-xr-x 1 ghigo ghigo 0 2023-11-20 18:59 dir1
256 drwxr-xr-x 1 ghigo ghigo 8 2023-11-20 18:59 snp1
ghigo@venice:/var/btrfs/@test-subvol$ find .
.
./vol1
./vol1/dir1
./vol1/snp1
./vol1/snp1/dir1

higo@venice:/var/btrfs/@test-subvol$ stat -f vol1/.
File: "vol1/."
ID: 727f02496c886f2e Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 26214400 Free: 4149320 Available: 3830236
Inodes: Total: 0 Free: 0
ghigo@venice:/var/btrfs/@test-subvol$ stat -f vol1/dir1/
File: "vol1/dir1/"
ID: 727f02496c886f2e Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 26214400 Free: 4149320 Available: 3830236
Inodes: Total: 0 Free: 0
ghigo@venice:/var/btrfs/@test-subvol$ stat -f vol1/snp1/dir1/
File: "vol1/snp1/dir1/"
ID: 727f02496c886f21 Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 26214400 Free: 4149320 Available: 3830236
Inodes: Total: 0 Free: 0

[/code]

Bringing bcachefs to the mainline

donald.buczek — Mon, 20 Nov 2023 14:34:59 +0000

> Overstreet said that bcachefs has the same basic problem that Btrfs does with regard to inode numbers, subvolumes, and NFS

It not just NFS. I couldn't wait for bcachefs to hit mainline. But now I've realized, that an _unprivileged_ user can do this:

buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
buczek@dose:/scratch/local3$ mkdir vol1/dir1
buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
buczek@dose:/scratch/local3$ ls -li vol1/
total 0
1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
buczek@dose:/scratch/local3$ ls -li vol1/snp1/
total 0
1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
buczek@dose:/scratch/local3$ find .
.
./vol1
find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
./vol1/dir1
buczek@dose:/scratch/local3$ ls -lR
.:
total 0
drwxrwxr-x 3 buczek buczek 0 Nov 20 15:03 vol1

./vol1:
total 0
drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1

./vol1/dir1:
total 0
ls: ./vol1/snp1: not listing already-listed directory
buczek@dose:/scratch/local3$

Multiple files with the same inode number on the same filesystem would break too many tools, for example backup.

Bringing bcachefs to the mainline

kena — Tue, 28 Jun 2022 13:12:40 +0000

I'll admit, I don't come to LWN for humor, but this made me literally laugh out loud. I mean, "fsck -y /dev/foo" -- what could possibly go wrong?

bcachefs needs scrub

LinAdmin — Sat, 28 May 2022 13:39:37 +0000

It looks like the creator of bcachefs does not care to implement scrubbing.
I got no answer when offering to help testing when this important feature is ready.

Bringing bcachefs to the mainline

zblaxell — Tue, 24 May 2022 21:18:07 +0000

It doesn't. For RAID1* mirroring btrfs simply reads each copy until the csum matches. For parity RAID[56], it assumes the blocks that have bad csums are bad, and reconstructs them in the normal parity raid way by reading the other blocks from the stripe and recomputing the bad blocks. If that doesn't provide a block with the right csum, or there are additional csum errors when reading other data blocks, then the block is gone and read returns EIO.

Strictly speaking, the csum is on the extent, not the file, which only matters when things like snapshots and dedupe make lots of files reference the same physical disk blocks, or compression transforms the data before storing it on disk. There's a single authoritative csum that covers all replicas of that block, whether they are verbatim copies or computed from other blocks. That csum is itself stored in a tree with csums on the pages.

There are no csums on the parity blocks, so btrfs's on-disk format cannot correctly identify the corrupted disk in RAID[56] if the parity block is corrupted and some of the data blocks in the stripe have no csums (either free space or nodatasum files). It's possible to determine that parity doesn't match the data and the populated data blocks are correct, but not whether the corrupted device is the one holding the parity block or one of the devices holding the unpopulated data blocks.

There's some fairly significant missing pieces in the btrfs RAID[56] implementation: scrub neither detects faults in nor corrects parity blocks, and neither do RMW stripe updates (which are sort of a bug in and of themselves), and half a dozen other bugs that pop up in degraded mode.

Bringing bcachefs to the mainline

Wol — Tue, 24 May 2022 20:28:44 +0000

Yup, if it keeps a check-sum per DISK block, fine. But if the check-sum is *file*-based, how does it know which *disk* is corrupt?

Cheers,
Wol

Bringing bcachefs to the mainline

atnot — Tue, 24 May 2022 20:05:05 +0000

> Although at some point, the network storage devices, whether they are sharing out a block or blob service, need to run on something and manage the storage, and who is writing that code?

Afaict, there's two reasons storage folks generally skip the kernel. The first is that the UNIX filesystem API semantics are a poor fit for what they are doing, the second is that the code isn't capable of running in a distributed manner.

So for blob storage it's generally going to be almost entirely in user space, with no disk-level redundancy at all. See e.g. Ceph, Minio, Backblaze.

EMC/netapp/vSAN all have, to my knowledge, their own proprietary disk layouts. VMWare has their own kernel, not sure about the others. The block devices they present are all also redundant across multiple machines, so dm-raid alone wouldn't quite cut it there. You can use Ceph for block storage too, but that also skips the kernel.

So in general, this is why I say I find it hard to see a place for filesystem-level parity RAID in the near future. It basically amounts to a layering violation in today's virtualized infrastructure. But who knows, things might change again.

Bringing bcachefs to the mainline

xanni — Tue, 24 May 2022 19:05:03 +0000

Many years ago I worked for an ISP that had a hardware RAID controller fail with a firmware bug that caused it to write bad data to all copies on all redundant storage devices... in both data centres in Adelaide and Sydney. We had an engineer from the vendor in the US on a flight to Australia the same day, and had to spend several days restoring all our customers' data from tapes.

Bringing bcachefs to the mainline

raven667 — Tue, 24 May 2022 18:53:17 +0000

Although at some point, the network storage devices, whether they are sharing out a block or blob service, need to run on something and manage the storage, and who is writing that code? Even on a hardware raid controller, is the actual raid card itself just an embedded linux system? It's turtles all the way down, do all the vendors of this kind of hardware write their own proprietary in-house raid and filesystems or do some use the built-in linux support and innovate in the higher layer management, by actually using those building blocks to their fullest potential?

Bringing bcachefs to the mainline

xanni — Tue, 24 May 2022 16:15:05 +0000

I haven't looked at the BTRFS implementation to confirm, but I believe it simply keeps a checksum for each file block, so it's easy to tell which disk blocks are valid: any combination of two RAID5 or RAID6 blocks that don't recover a block with the correct checksum include a corrupted disk block, in which case try one of the other combinations. If none are valid, you have more corrupt disk blocks than your redundancy level.

Bringing bcachefs to the mainline

xanni — Tue, 24 May 2022 16:02:54 +0000

RAID5 allows you to recover the data with any 2 of the 3 blocks for each block of the file. RAID6 allows you to use any 2 of the 4 blocks and is designed to address the issue of a second failure during the recovery from a single failure, since recovery from a full drive failure can take quite a while. If you lose an entire drive with block-level RAID5, you can replace it and recover all data online with zero downtime. If you regularly scrub any level of btrfs RAID, you can repair corrupted blocks with zero downtime.

Bringing bcachefs to the mainline

Wol — Tue, 24 May 2022 15:22:40 +0000

How does that work then? If you've got raid-5, and the checksum reports "this file is corrupt", how do you recover the original file? If all you've got is raid-5 then it's mathematically impossible.

If you've got raid-1, the checksum identifies which copy is corrupt and therefore which copy is correct. If you've got raid-6, then you can solve the equations to get your data back. But raid-5? Sorry, unless that checksum tells you which disk block is corrupt, you're stuffed.

Cheers,
Wol

Bringing bcachefs to the mainline

xanni — Tue, 24 May 2022 12:49:00 +0000

> Just a shame that second bit of info doesn't let you recover the file ...

But it does. If you have any level of BTRFS redundancy, you can run "btrfs scrub" to replace any data whose checksum doesn't match with one of the other copies where it does match. I like to run it monthly. That's one of the big advantages of BTRFS.

Bringing bcachefs to the mainline

Wol — Tue, 24 May 2022 11:22:33 +0000

> Filesystem based RAID is just … better than block based. Even btrfs raid5/6 should be less prone to data corruption than any of the block based solutions, simply because it has the ability to actually tell which possible version of the data is correct. With block based raid (5 esp) if you get some silent data corruption the raid controller/OS will just pick an option essentially at random, so it only helps you if the drive either tells you the read went bad or an entire drive fails.

Which is why I run dm-integrity underneath my raid.

That is the WHOLE POINT of raid - it's primarily to protect against disk failure. Raid 5 contains ONE additional data point, allowing you to recover from ONE unknown, eg "the disk failed, what were the contents?". It's useless if you have TWO unknowns - "oh shit! One of my disks is corrupt - which disk and what were the original contents?" That's why you need raid 6 - you have TWO additional data points allowing you to recover from those said two unknowns. The problem, of course, is if your data is corrupt it's expensive to check on every access ...

And which is why I run dm-integrity - that catches the "which disk is corrupt?", leaving my raid-5 to deal with "and what were the original contents?" I've decided that, on a personal level, the time hit from the integrity check is okay.

(Oh, and just like ordinary raid, btrfs will be no help whatsoever if your disk is corrupt - it'll just tell you you've lost your file - which admittedly is a bit more than md-raid-5, but then btrfs stores that second bit of info, a file checksum. Just a shame that second bit of info doesn't let you recover the file ...)

Cheers,
Wol

Bringing bcachefs to the mainline

atnot — Tue, 24 May 2022 07:46:51 +0000

> Also, btrfs' RAID-[56] has spent 10+ years getting to production quality, and still is at “should not be used in production, only for evaluation or testing”

I don't think this is accurate. My perception is that the RAID56 implementation has been more or less abandoned in it's current unfinished state. This is not that surprising to me because in general, OS-level parity RAID is kind of dead, at least amongst the people who could afford to put significant money behind developing it.

In a modern datacenter you're basically just going to have three types of storage: Local scratchpad SSDs, network block devices and blob storage services. The first is usually RAID10 for performance, the second solves redundancy at a lower level and the third solves redundancy at a higher level. This puts RAID56 in an awkward spot where it's useful for many home users, still decently well supported, but nobody else is really there to care about it anymore.

Bringing bcachefs to the mainline

Sesse — Mon, 23 May 2022 22:09:49 +0000

If you have RAID-6, and a spurious bit flip (which generally needs to happen before it's written to disk, as ECC protects you well afterwards), you can tell which disk is bad.

Also, btrfs' RAID-[56] has spent 10+ years getting to production quality, and still is at “should not be used in production, only for evaluation or testing” (https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#ra..., linked from the btrfs wiki at kernel.org), so if nothing else, it's amazingly hard to get right.

Bringing bcachefs to the mainline

bartoc — Mon, 23 May 2022 21:25:12 +0000

Filesystem based RAID is just … better than block based. Even btrfs raid5/6 should be less prone to data corruption than any of the block based solutions, simply because it has the ability to actually tell which possible version of the data is correct. With block based raid (5 esp) if you get some silent data corruption the raid controller/OS will just pick an option essentially at random, so it only helps you if the drive either tells you the read went bad or an entire drive fails.

Block based RAID can be useful if you have real raid controllers and a big storage array though, as that will reduce bandwidth usage.

bcachefs’s approach to raid sounds extremely appealing, and its not something block based raid can really do.

Bringing bcachefs to the mainline

zblaxell — Wed, 18 May 2022 19:35:31 +0000

ext4 is quantitatively less robust than ext3 in this scenario because ext4's metadata is more efficient, so you lose more metadata per byte of dropped write.

fsck can only delete stuff until you can mount the filesystem read-write again. Lose a few of the wrong blocks on ext4, and the big and interesting files that aren't already in your backups end up mostly deleted.

Writeback caches can get pretty big these days. Losing a few sectors can destroy the most interesting data, but losing a few hundred million sectors and you might as well go directly to mkfs + restore backups because it will take less time than verifying everything by hand, or even using rsync with -c and --del options from your backups.

It's possible to set up multi-device arrays with writeback caches, but you have to be very careful to avoid having faults in the cache impact multiple fault isolation domains in the backing storage. The simplest form of this is to build multi-drive arrays out of pairs of SSD and HDD, and treat a SSD failure as if the paired HDD failed at the same time. Another way to do it is to have redundant cache SSDs so that faults in the cache are isolated from the backing storage, but some faults (e.g. undetected SSD corruption) can't be easily isolated this way.

Bringing bcachefs to the mainline

developer122 — Wed, 18 May 2022 16:18:09 +0000

fsck is nice when it works, but I think everyone has stockholm syndrome from over 45 years of using it.

Bringing bcachefs to the mainline

xanni — Wed, 18 May 2022 15:32:30 +0000

If the cache device fails, you can still fsck and mount the underlying device without the cache. You will only have lost whatever was pending writeback. Depending on the underlying filesystem, that may or may not be a major issue. ext4 is pretty robust.

Bringing bcachefs to the mainline

cmurf — Wed, 18 May 2022 15:26:29 +0000

There's a reason why writethrough is the default. It's safe. Write back is only as safe as the reliability of the cache device. If it fails while using write back mode, you lose the entire fs. And that's because it's likely critical metadata writes only make it to the cache device, not the backing device. I think the reality is, if your workload requires significant random write performance, you need to pay for for big enough SSDs to accommodate the workload, rather than expecting you can get SSD random write performance all the way to the backing device. Where this really bites people, is when they use a single cache device for multiple backing devices, e.g. in a RAID configuration. Lose the cache device while in write back mode, and the entire array is toast.

Bringing bcachefs to the mainline

developer122 — Wed, 18 May 2022 15:20:59 +0000

It sounds like it would be better solve those warts before upstreaming, rather than further cement them.

Bringing bcachefs to the mainline

xanni — Wed, 18 May 2022 12:39:48 +0000

This discussion may help: https://bbs.archlinux.org/viewtopic.php?id=250525

Bringing bcachefs to the mainline

Sesse — Wed, 18 May 2022 12:20:21 +0000

Well, 130 kB random reads all over the file? Should be well below the 4MB default limit.

Bringing bcachefs to the mainline

xanni — Wed, 18 May 2022 12:06:05 +0000

More specifically, take a look at https://www.kernel.org/doc/html/latest/admin-guide/bcache... and consider decreasing /sys/block/bcache0/bcache/sequential_cutoff to see if that helps.

Bringing bcachefs to the mainline

xanni — Wed, 18 May 2022 11:59:04 +0000

Note that the default settings for bcache are to only cache random reads and writes, and pass sequential reads and writes through without caching. You may want to adjust those default settings for your use case!

Bringing bcachefs to the mainline

Sesse — Wed, 18 May 2022 11:41:46 +0000

As a data point on the opposite side, I've tried using bcache with a 240GB SSD against a 2x3TB disk array to accelerate a seek-heavy video load, and it's been completely useless. The SSD is hardly ever used, even though the actual working set should be in the tens of gigabytes at a any time.

While the article is interesting, I'm not entirely sure why I should be excited for bcachefs; how does it fare in benchmarks, for one? (I don't like to mix up my RAID/LVM and my filesystems in general, so I don't care about the ZFS/btrfs-like features.)

bcachefs and scrub

fratti — Wed, 18 May 2022 10:10:51 +0000

In [1], it's described that bcachefs does not yet do scrubbing. Is this document outdated and it's already implemented, or is this feature on the horizon? If the latter, will it arrive before mainline inclusion or after? I may be misunderstanding the importance of data scrubbing in the context of long-term data safety though, so if this isn't critical in ensuring an array keeps functioning for a decade or more without data loss, feel free to correct me.

I'm excited about the possibilities bcachefs opens, but not quite adventurous enough to try and use an out-of-tree filesystem. The idea of just giving the system a bunch of block devices and saying "here's how durable I consider them, here's how many replicas I want" and having it figure it out on its own seems very appealing, as well as being able to set replication and compression on a per-file granularity.

[1]: https://bcachefs.org/bcachefs-principles-of-operation.pdf

Bringing bcachefs to the mainline

xanni — Wed, 18 May 2022 07:38:12 +0000

I'm very excited by bcachefs and while waiting for it decided to start using bcache. I've been using it to back a (large, cheap) shingled magnetic recording hard drive with an SSD cache for my Steam games library, to improve performance especially for the larger games (Horizon Zero Dawn is nearly 80GB.) It's been working great so far.