Bringing bcachefs to the mainline

By Jake Edge
May 17, 2022

Bcachefs is a longstanding out-of-tree filesystem that grew out of the bcache caching layer that has been in the kernel for nearly ten years. Based on a session led by Kent Overstreet at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), though, it would seem that bcachefs is likely to be heading upstream soon. He intends to start the process toward mainline inclusion over the next six months or so.

Overstreet is often asked what the target use cases for bcachefs are; "the answer is everything". His longstanding goal is to be "reliable and robust enough to be the XFS replacement". It has been a few years since he last gave an update at LSFMM, so he began by listing the features and changes that have been added.

Support for reflinks, effectively copy-on-write (COW) links for files, has been added to bcachefs. After that support was added, Dave Chinner asked him about snapshots; he had been avoiding implementing snapshots but some reworking that he did on how bcachefs handles extents made it easier to do so. He added snapshot support and there are no scalability issues; he has done up to a million snapshots on test virtual machines without any problems. Snapshots in bcachefs have the same external interface as Btrfs (i.e. subvolumes), though the internal implementation is different.

More recently, the bcachefs allocator has been rewritten. Bcache, which is the ancestor of bcachefs, had some "algorithmic scalability issues" because it was created in the days where SSDs were around 100GB in size. But he has bcachefs users on 50TB arrays; things that work fine for the smaller sizes do not scale well, he said. So he has been reworking various pieces of bcachefs to address those problems.

There are now persistent data structures for holding data that used to require the filesystem to periodically "walk the world" by scanning the filesystem structure. Backpointers have been added so that data blocks point to the file that contains them, which is important to accelerate the "copygc" operation. That operation does a form of garbage collection, but it (formerly) required scanning through the filesystem structure. He said that it is also important for supporting zoned storage devices, which is still a little ways off but is coming.

Merging

Overstreet wants to be able to propose bcachefs for upstream inclusion "but not go insane and still be able to write code when that happens". The to-do list is always expanding, but the "really big pain points" have mostly been dealt with at this point. There is good reason to believe that upstreaming is close, he said.

Amir Goldstein asked about where and how bcachefs is being used in production now. Overstreet said that he knows it is being used, but he does not know how many sites are using it. He generally finds out when someone asks him to look at a problem. Bcachefs is mostly used by video production companies that need to deal with multiple 4K streams for editing multi-camera setups, he said; they have been using it for several years now. Bcachefs was chosen because it had better performance than Btrfs for those workloads and, at the time, was the only filesystem with certain features that were needed.

Josef Bacik said that he looked at the to-do list and noted that it was mostly bcachefs-internal items. He said that the goal when bcachefs was discussed at LSFMM in 2018 was to get the interfaces to the rest of Linux into good shape, since that would be the focus of any mailing-list review. None of the other filesystem developers know much about the internals of bcachefs, so they would not be able to review that code directly. He wondered what was left to do before the upstream process could begin.

Overstreet said that the ioctl() interface was one of the things discussed, but it has not changed in a while. He is more concerned about ensuring that the on-disk format changes are settling down. He had been pushing out those kinds of changes fairly frequently, and the backpointer support requires another, but after that, he does not see any other changes of that sort on the horizon.

Bacik asked how much more work Overstreet wanted to do internally before he would be ready to start talking about merging bcachefs and what was holding it back. Bacik also wanted to know what Overstreet needed from other filesystem developers as part of that process. The biggest thing holding him back, Overstreet said, is that he wants to be able to respond to all of the bug reports that will arise when there are lots more users of bcachefs. So he wants to make sure that the bigger development projects get taken care of before he gets to that point.

He said that it is far faster for him to fix a bug when he finds it himself, rather than having to figure out a way to reproduce a problem that someone else has found. So he is hoping to get rid of as many bugs as he can before merging. That process has been improved greatly by the debugging support he added to bcachefs over the last few years; over the last six months, he said, that effort "has been paying off in a big way". For example, the allocator rewrite went smoothly because of those tools.

Much of that revolves around the printbuf mechanism that he recently proposed for the kernel. That work came out of his interest in getting better logging information for bcachefs. There are "pretty printers" for various bcachefs data structures and their output can be logged. He is now able to debug using grep, rather than a debugger for many of the kinds of problems he encounters. He said that he would be talking more about that infrastructure in a memory-management session the next day.

Wart sharing

Chris Mason said that he had a question along the lines of those from Bacik, but "a lot more selfish". Btrfs has a lot of warts in how it interfaces with the virtual filesystem (VFS) layer, in part because its inode numbers are effectively huge, but also due to various ioctl() commands for features like reflink. He is looking forward to some other filesystem coming into Linux that is "sharing my warts"; that may lead to finding better ways to solve some of those problems, he said.

Overstreet said that bcachefs has the same basic problem that Btrfs does with regard to inode numbers, subvolumes, and NFS; he has not spent a lot of time thinking about it but would like to use the Btrfs solution once one is found. Mason said that every eight months or so, someone comes along to say that the problem is stupid and easy to fix, then the Btrfs developers have to show once again that the problem is stupid, but hard to fix. Bacik agreed that a second filesystem with some of the same kinds of problems will help; it is difficult to make certain kinds of changes because there "seems to be an allergic reaction" to interface changes that are only aimed at Btrfs problems.

Ted Ts'o had two suggestions for Overstreet; first, before adding a whole lot of new users, some kind of bcachefs repair facility is probably necessary. Overstreet said that part was all taken care of. Ts'o also said that having an automated test runner that exercised various different bcachefs configuration options would be useful. He has a test harness, and Luis Chamberlain has a different one, either of which would probably serve the needs of bcachefs. Bacik noted that there is a slot later in LSFMM to discuss some of that.

Overstreet returned to the subject of debugging tools, as it is "the thing that excites me the most". The pretty-printer code is shared by both kernel and user space, which makes it easier to find problems, he said. grep is his tool of choice for finding problems, even for difficult things like deadlocks. He demonstrated some of the kinds of information he could extract using those facilities.

Mason suggested looking into integrating this work into the drgn kernel debugger, which was the subject of a session at LSFMM 2019. It is a Python-based, live and post-crash kernel debugger that is used extensively at Facebook; every investigation of a problem in production starts by poking around using the tool. Bacik agreed, noting that drgn allows writing programs that can step through data structures in running systems to track down a wide variety of filesystem (and other) problems. Overstreet said that he would be looking into it.

Overstreet pointed to the bcachefs: Principles of Operation document as a starting point for user documentation. It is up to 25 pages at this point, organized by feature, and will be getting fleshed out further soon.

While Overstreet's hesitance to push for merging bcachefs is understandable, Bacik said, he and others have some selfish reasons for wanting to see that happen. He said he did not want to rush things, but did Overstreet have a timeline? Overstreet said that he would like to see it happen within the next six months. Based on the recent bug reports, he thinks that is a realistic goal.

Goldstein wondered when the Rust rewrite would be coming. Overstreet said that there is already some user-space Rust code in the repository; as soon as Rust support lands in the kernel, he would like to make use of it. There are "so many little quality-of-life improvements in Rust", such as proper iterators rather than "crazy for-loop macros". Bacik said that many were waiting for that support in the kernel; Overstreet suggested that those who are waiting be a bit noisier to make it clear that there is demand for it. With that, time expired on the session, but it seems we may see bcachefs and Rust racing to see which can land in the kernel first.

Index entries for this article
Kernel	Filesystems/bcachefs
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

Bringing bcachefs to the mainline

Posted May 18, 2022 7:38 UTC (Wed) by xanni (subscriber, #361) [Link] (23 responses)

I'm very excited by bcachefs and while waiting for it decided to start using bcache. I've been using it to back a (large, cheap) shingled magnetic recording hard drive with an SSD cache for my Steam games library, to improve performance especially for the larger games (Horizon Zero Dawn is nearly 80GB.) It's been working great so far.

Bringing bcachefs to the mainline

Posted May 18, 2022 11:41 UTC (Wed) by Sesse (subscriber, #53779) [Link] (22 responses)

As a data point on the opposite side, I've tried using bcache with a 240GB SSD against a 2x3TB disk array to accelerate a seek-heavy video load, and it's been completely useless. The SSD is hardly ever used, even though the actual working set should be in the tens of gigabytes at a any time.

While the article is interesting, I'm not entirely sure why I should be excited for bcachefs; how does it fare in benchmarks, for one? (I don't like to mix up my RAID/LVM and my filesystems in general, so I don't care about the ZFS/btrfs-like features.)

Bringing bcachefs to the mainline

Posted May 18, 2022 11:59 UTC (Wed) by xanni (subscriber, #361) [Link] (7 responses)

Note that the default settings for bcache are to only cache random reads and writes, and pass sequential reads and writes through without caching. You may want to adjust those default settings for your use case!

Bringing bcachefs to the mainline

Posted May 18, 2022 12:20 UTC (Wed) by Sesse (subscriber, #53779) [Link] (1 responses)

Well, 130 kB random reads all over the file? Should be well below the 4MB default limit.

Bringing bcachefs to the mainline

Posted May 18, 2022 12:39 UTC (Wed) by xanni (subscriber, #361) [Link]

This discussion may help: https://bbs.archlinux.org/viewtopic.php?id=250525

Bringing bcachefs to the mainline

Posted May 18, 2022 15:26 UTC (Wed) by cmurf (subscriber, #112853) [Link] (4 responses)

There's a reason why writethrough is the default. It's safe. Write back is only as safe as the reliability of the cache device. If it fails while using write back mode, you lose the entire fs. And that's because it's likely critical metadata writes only make it to the cache device, not the backing device. I think the reality is, if your workload requires significant random write performance, you need to pay for for big enough SSDs to accommodate the workload, rather than expecting you can get SSD random write performance all the way to the backing device. Where this really bites people, is when they use a single cache device for multiple backing devices, e.g. in a RAID configuration. Lose the cache device while in write back mode, and the entire array is toast.

Bringing bcachefs to the mainline

Posted May 18, 2022 15:32 UTC (Wed) by xanni (subscriber, #361) [Link] (3 responses)

If the cache device fails, you can still fsck and mount the underlying device without the cache. You will only have lost whatever was pending writeback. Depending on the underlying filesystem, that may or may not be a major issue. ext4 is pretty robust.

Bringing bcachefs to the mainline

Posted May 18, 2022 16:18 UTC (Wed) by developer122 (guest, #152928) [Link] (1 responses)

fsck is nice when it works, but I think everyone has stockholm syndrome from over 45 years of using it.

Bringing bcachefs to the mainline

Posted Jun 28, 2022 13:12 UTC (Tue) by kena (guest, #2735) [Link]

I'll admit, I don't come to LWN for humor, but this made me literally laugh out loud. I mean, "fsck -y /dev/foo" -- what could possibly go wrong?

Bringing bcachefs to the mainline

Posted May 18, 2022 19:35 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

ext4 is quantitatively less robust than ext3 in this scenario because ext4's metadata is more efficient, so you lose more metadata per byte of dropped write.

fsck can only delete stuff until you can mount the filesystem read-write again. Lose a few of the wrong blocks on ext4, and the big and interesting files that aren't already in your backups end up mostly deleted.

Writeback caches can get pretty big these days. Losing a few sectors can destroy the most interesting data, but losing a few hundred million sectors and you might as well go directly to mkfs + restore backups because it will take less time than verifying everything by hand, or even using rsync with -c and --del options from your backups.

It's possible to set up multi-device arrays with writeback caches, but you have to be very careful to avoid having faults in the cache impact multiple fault isolation domains in the backing storage. The simplest form of this is to build multi-drive arrays out of pairs of SSD and HDD, and treat a SSD failure as if the paired HDD failed at the same time. Another way to do it is to have redundant cache SSDs so that faults in the cache are isolated from the backing storage, but some faults (e.g. undetected SSD corruption) can't be easily isolated this way.

Bringing bcachefs to the mainline

Posted May 18, 2022 12:06 UTC (Wed) by xanni (subscriber, #361) [Link]

More specifically, take a look at https://www.kernel.org/doc/html/latest/admin-guide/bcache... and consider decreasing /sys/block/bcache0/bcache/sequential_cutoff to see if that helps.

Bringing bcachefs to the mainline

Posted May 23, 2022 21:25 UTC (Mon) by bartoc (guest, #124262) [Link] (12 responses)

Filesystem based RAID is just … better than block based. Even btrfs raid5/6 should be less prone to data corruption than any of the block based solutions, simply because it has the ability to actually tell which possible version of the data is correct. With block based raid (5 esp) if you get some silent data corruption the raid controller/OS will just pick an option essentially at random, so it only helps you if the drive either tells you the read went bad or an entire drive fails.

Block based RAID can be useful if you have real raid controllers and a big storage array though, as that will reduce bandwidth usage.

bcachefs’s approach to raid sounds extremely appealing, and its not something block based raid can really do.

Bringing bcachefs to the mainline

Posted May 23, 2022 22:09 UTC (Mon) by Sesse (subscriber, #53779) [Link] (4 responses)

If you have RAID-6, and a spurious bit flip (which generally needs to happen before it's written to disk, as ECC protects you well afterwards), you can tell which disk is bad.

Also, btrfs' RAID-[56] has spent 10+ years getting to production quality, and still is at “should not be used in production, only for evaluation or testing” (https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#ra..., linked from the btrfs wiki at kernel.org), so if nothing else, it's amazingly hard to get right.

Bringing bcachefs to the mainline

Posted May 24, 2022 7:46 UTC (Tue) by atnot (subscriber, #124910) [Link] (3 responses)

> Also, btrfs' RAID-[56] has spent 10+ years getting to production quality, and still is at “should not be used in production, only for evaluation or testing”

I don't think this is accurate. My perception is that the RAID56 implementation has been more or less abandoned in it's current unfinished state. This is not that surprising to me because in general, OS-level parity RAID is kind of dead, at least amongst the people who could afford to put significant money behind developing it.

In a modern datacenter you're basically just going to have three types of storage: Local scratchpad SSDs, network block devices and blob storage services. The first is usually RAID10 for performance, the second solves redundancy at a lower level and the third solves redundancy at a higher level. This puts RAID56 in an awkward spot where it's useful for many home users, still decently well supported, but nobody else is really there to care about it anymore.

Bringing bcachefs to the mainline

Posted May 24, 2022 18:53 UTC (Tue) by raven667 (subscriber, #5198) [Link] (2 responses)

Although at some point, the network storage devices, whether they are sharing out a block or blob service, need to run on something and manage the storage, and who is writing that code? Even on a hardware raid controller, is the actual raid card itself just an embedded linux system? It's turtles all the way down, do all the vendors of this kind of hardware write their own proprietary in-house raid and filesystems or do some use the built-in linux support and innovate in the higher layer management, by actually using those building blocks to their fullest potential?

Bringing bcachefs to the mainline

Posted May 24, 2022 19:05 UTC (Tue) by xanni (subscriber, #361) [Link]

Many years ago I worked for an ISP that had a hardware RAID controller fail with a firmware bug that caused it to write bad data to all copies on all redundant storage devices... in both data centres in Adelaide and Sydney. We had an engineer from the vendor in the US on a flight to Australia the same day, and had to spend several days restoring all our customers' data from tapes.

Bringing bcachefs to the mainline

Posted May 24, 2022 20:05 UTC (Tue) by atnot (subscriber, #124910) [Link]

> Although at some point, the network storage devices, whether they are sharing out a block or blob service, need to run on something and manage the storage, and who is writing that code?

Afaict, there's two reasons storage folks generally skip the kernel. The first is that the UNIX filesystem API semantics are a poor fit for what they are doing, the second is that the code isn't capable of running in a distributed manner.

So for blob storage it's generally going to be almost entirely in user space, with no disk-level redundancy at all. See e.g. Ceph, Minio, Backblaze.

EMC/netapp/vSAN all have, to my knowledge, their own proprietary disk layouts. VMWare has their own kernel, not sure about the others. The block devices they present are all also redundant across multiple machines, so dm-raid alone wouldn't quite cut it there. You can use Ceph for block storage too, but that also skips the kernel.

So in general, this is why I say I find it hard to see a place for filesystem-level parity RAID in the near future. It basically amounts to a layering violation in today's virtualized infrastructure. But who knows, things might change again.

Bringing bcachefs to the mainline

Posted May 24, 2022 11:22 UTC (Tue) by Wol (subscriber, #4433) [Link] (6 responses)

> Filesystem based RAID is just … better than block based. Even btrfs raid5/6 should be less prone to data corruption than any of the block based solutions, simply because it has the ability to actually tell which possible version of the data is correct. With block based raid (5 esp) if you get some silent data corruption the raid controller/OS will just pick an option essentially at random, so it only helps you if the drive either tells you the read went bad or an entire drive fails.

Which is why I run dm-integrity underneath my raid.

That is the WHOLE POINT of raid - it's primarily to protect against disk failure. Raid 5 contains ONE additional data point, allowing you to recover from ONE unknown, eg "the disk failed, what were the contents?". It's useless if you have TWO unknowns - "oh shit! One of my disks is corrupt - which disk and what were the original contents?" That's why you need raid 6 - you have TWO additional data points allowing you to recover from those said two unknowns. The problem, of course, is if your data is corrupt it's expensive to check on every access ...

And which is why I run dm-integrity - that catches the "which disk is corrupt?", leaving my raid-5 to deal with "and what were the original contents?" I've decided that, on a personal level, the time hit from the integrity check is okay.

(Oh, and just like ordinary raid, btrfs will be no help whatsoever if your disk is corrupt - it'll just tell you you've lost your file - which admittedly is a bit more than md-raid-5, but then btrfs stores that second bit of info, a file checksum. Just a shame that second bit of info doesn't let you recover the file ...)

Cheers,
Wol

Bringing bcachefs to the mainline

Posted May 24, 2022 12:49 UTC (Tue) by xanni (subscriber, #361) [Link] (5 responses)

> Just a shame that second bit of info doesn't let you recover the file ...

But it does. If you have any level of BTRFS redundancy, you can run "btrfs scrub" to replace any data whose checksum doesn't match with one of the other copies where it does match. I like to run it monthly. That's one of the big advantages of BTRFS.

Bringing bcachefs to the mainline

Posted May 24, 2022 15:22 UTC (Tue) by Wol (subscriber, #4433) [Link] (4 responses)

How does that work then? If you've got raid-5, and the checksum reports "this file is corrupt", how do you recover the original file? If all you've got is raid-5 then it's mathematically impossible.

If you've got raid-1, the checksum identifies which copy is corrupt and therefore which copy is correct. If you've got raid-6, then you can solve the equations to get your data back. But raid-5? Sorry, unless that checksum tells you which disk block is corrupt, you're stuffed.

Cheers,
Wol

Bringing bcachefs to the mainline

Posted May 24, 2022 16:02 UTC (Tue) by xanni (subscriber, #361) [Link] (3 responses)

RAID5 allows you to recover the data with any 2 of the 3 blocks for each block of the file. RAID6 allows you to use any 2 of the 4 blocks and is designed to address the issue of a second failure during the recovery from a single failure, since recovery from a full drive failure can take quite a while. If you lose an entire drive with block-level RAID5, you can replace it and recover all data online with zero downtime. If you regularly scrub any level of btrfs RAID, you can repair corrupted blocks with zero downtime.

Bringing bcachefs to the mainline

Posted May 24, 2022 16:15 UTC (Tue) by xanni (subscriber, #361) [Link] (2 responses)

I haven't looked at the BTRFS implementation to confirm, but I believe it simply keeps a checksum for each file block, so it's easy to tell which disk blocks are valid: any combination of two RAID5 or RAID6 blocks that don't recover a block with the correct checksum include a corrupted disk block, in which case try one of the other combinations. If none are valid, you have more corrupt disk blocks than your redundancy level.

Bringing bcachefs to the mainline

Posted May 24, 2022 20:28 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Yup, if it keeps a check-sum per DISK block, fine. But if the check-sum is *file*-based, how does it know which *disk* is corrupt?

Cheers,
Wol

Bringing bcachefs to the mainline

Posted May 24, 2022 21:18 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

It doesn't. For RAID1* mirroring btrfs simply reads each copy until the csum matches. For parity RAID[56], it assumes the blocks that have bad csums are bad, and reconstructs them in the normal parity raid way by reading the other blocks from the stripe and recomputing the bad blocks. If that doesn't provide a block with the right csum, or there are additional csum errors when reading other data blocks, then the block is gone and read returns EIO.

Strictly speaking, the csum is on the extent, not the file, which only matters when things like snapshots and dedupe make lots of files reference the same physical disk blocks, or compression transforms the data before storing it on disk. There's a single authoritative csum that covers all replicas of that block, whether they are verbatim copies or computed from other blocks. That csum is itself stored in a tree with csums on the pages.

There are no csums on the parity blocks, so btrfs's on-disk format cannot correctly identify the corrupted disk in RAID[56] if the parity block is corrupted and some of the data blocks in the stripe have no csums (either free space or nodatasum files). It's possible to determine that parity doesn't match the data and the populated data blocks are correct, but not whether the corrupted device is the one holding the parity block or one of the devices holding the unpopulated data blocks.

There's some fairly significant missing pieces in the btrfs RAID[56] implementation: scrub neither detects faults in nor corrects parity blocks, and neither do RMW stripe updates (which are sort of a bug in and of themselves), and half a dozen other bugs that pop up in degraded mode.

bcachefs and scrub

Posted May 18, 2022 10:10 UTC (Wed) by fratti (guest, #105722) [Link]

In [1], it's described that bcachefs does not yet do scrubbing. Is this document outdated and it's already implemented, or is this feature on the horizon? If the latter, will it arrive before mainline inclusion or after? I may be misunderstanding the importance of data scrubbing in the context of long-term data safety though, so if this isn't critical in ensuring an array keeps functioning for a decade or more without data loss, feel free to correct me.

I'm excited about the possibilities bcachefs opens, but not quite adventurous enough to try and use an out-of-tree filesystem. The idea of just giving the system a bunch of block devices and saying "here's how durable I consider them, here's how many replicas I want" and having it figure it out on its own seems very appealing, as well as being able to set replication and compression on a per-file granularity.

[1]: https://bcachefs.org/bcachefs-principles-of-operation.pdf

Bringing bcachefs to the mainline

Posted May 18, 2022 15:20 UTC (Wed) by developer122 (guest, #152928) [Link] (1 responses)

It sounds like it would be better solve those warts before upstreaming, rather than further cement them.

bcachefs needs scrub

Posted May 28, 2022 13:39 UTC (Sat) by LinAdmin (guest, #158773) [Link]

It looks like the creator of bcachefs does not care to implement scrubbing.
I got no answer when offering to help testing when this important feature is ready.

Bringing bcachefs to the mainline

Posted Nov 20, 2023 14:34 UTC (Mon) by donald.buczek (subscriber, #112892) [Link] (1 responses)

> Overstreet said that bcachefs has the same basic problem that Btrfs does with regard to inode numbers, subvolumes, and NFS

It not just NFS. I couldn't wait for bcachefs to hit mainline. But now I've realized, that an _unprivileged_ user can do this:

buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
buczek@dose:/scratch/local3$ mkdir vol1/dir1
buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
buczek@dose:/scratch/local3$ ls -li vol1/
total 0
1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
buczek@dose:/scratch/local3$ ls -li vol1/snp1/
total 0
1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
buczek@dose:/scratch/local3$ find .
.
./vol1
find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
./vol1/dir1
buczek@dose:/scratch/local3$ ls -lR
.:
total 0
drwxrwxr-x 3 buczek buczek 0 Nov 20 15:03 vol1

./vol1:
total 0
drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1

./vol1/dir1:
total 0
ls: ./vol1/snp1: not listing already-listed directory
buczek@dose:/scratch/local3$

Multiple files with the same inode number on the same filesystem would break too many tools, for example backup.

Bringing bcachefs to the mainline

Posted Nov 20, 2023 18:15 UTC (Mon) by kreijack (guest, #43513) [Link]

> > Overstreet said that bcachefs has the same basic problem that Btrfs does with regard to inode numbers, subvolumes, and NFS

> It not just NFS. I couldn't wait for bcachefs to hit mainline. But now I've realized, that an _unprivileged_ user can do this:

Both BTRFS and BCacheFS (but also a basic LVM/dm snapshot) shared the same problem of having snapshots with (necessarily) file having the same inode number.

However BTRFS create for each subvolume a different fsid (see statfs(2), f_fsid).

[code]
ghigo@venice:/var/btrfs/@test-subvol$ btrfs sub cre vol1
Create subvolume './vol1'
ghigo@venice:/var/btrfs/@test-subvol$ mkdir vol1/dir1
ghigo@venice:/var/btrfs/@test-subvol$ btrfs sub snap vol1 vol1/snp1
Create a snapshot of 'vol1' in 'vol1/snp1'
ghigo@venice:/var/btrfs/@test-subvol$ ls -li vol1/
total 0
257 drwxr-xr-x 1 ghigo ghigo 0 2023-11-20 18:59 dir1
256 drwxr-xr-x 1 ghigo ghigo 8 2023-11-20 18:59 snp1
ghigo@venice:/var/btrfs/@test-subvol$ find .
.
./vol1
./vol1/dir1
./vol1/snp1
./vol1/snp1/dir1

higo@venice:/var/btrfs/@test-subvol$ stat -f vol1/.
File: "vol1/."
ID: 727f02496c886f2e Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 26214400 Free: 4149320 Available: 3830236
Inodes: Total: 0 Free: 0
ghigo@venice:/var/btrfs/@test-subvol$ stat -f vol1/dir1/
File: "vol1/dir1/"
ID: 727f02496c886f2e Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 26214400 Free: 4149320 Available: 3830236
Inodes: Total: 0 Free: 0
ghigo@venice:/var/btrfs/@test-subvol$ stat -f vol1/snp1/dir1/
File: "vol1/snp1/dir1/"
ID: 727f02496c886f21 Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 26214400 Free: 4149320 Available: 3830236
Inodes: Total: 0 Free: 0

[/code]