the self healing work continues in bcachefs

Posted Mar 29, 2025 5:09 UTC (Sat) by koverstreet (✭ supporter ✭, #4296)
Parent article: The first part of the 6.15 merge window

the goal is a filesystem that can recover from /anything/

the self healing work continues in bcachefs

Posted Mar 30, 2025 22:59 UTC (Sun) by motk (subscriber, #51120) [Link] (5 responses)

True vacuum universe ending events? A bold claim.

the self healing work continues in bcachefs

Posted Mar 31, 2025 3:09 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link] (4 responses)

Not quite :) but anything within the realm of possible, yes. If there's still extents and dirents leaf nodes, we should get you a working filesystem with everything possible intact.

We regularly recover from extreme disaster scenarios today - I've been looking at a metadata dump where it looked like a head just skated across the platter, which created some very... particular alloc info inconsistencies, but that's been the only failure to repair in ~6 months, and I've seen logs of some good ones. So that's largely done.

Once the mount API extension happens, plus better communication between the mount helper and systemd/plymouth (because of course communicating things to the user has been getting more complicated), we'll even be able to tell the user "hey, your SSD crapped itself (X IO errors, toast btree nodes), please wait while we reconstruct btree roots/alloc/what have you, here's a progress bar"

And this stuff is pretty fast, too - post 6.14 that dealt with backpointers check/repair. Even btree node scan is fast thanks to a small bitmap in the superblock, if we lose btree roots.

Further off, post experimental, will be finishing off online fsck - and then we'll be able to recover from slightly absurd levels of damage in the background while your filesystem is RW. (People with huge arrays really want this).

the self healing work continues in bcachefs

Posted Mar 31, 2025 4:23 UTC (Mon) by jmalcolm (subscriber, #8876) [Link] (1 responses)

Is systemd going to be a requirement for bcachefs?

the self healing work continues in bcachefs

Posted Mar 31, 2025 4:57 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link]

I'm not anti systemd, but no

the self healing work continues in bcachefs

Posted Apr 3, 2025 8:06 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Will bcachefs be able to completely recover (no data loss) if all data is present on a quorum of replicas?

the self healing work continues in bcachefs

Posted Apr 3, 2025 13:52 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link]

What's missing in this case?

Is self-healing always good?

Posted Apr 3, 2025 8:03 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Is self-healing always wanted? My concerns are:

It could risk trashing good-but-unreachable data, preventing subsequent data recovery operations.
It could hide errors from userspace, such as by reporting “file definitely does not exist” instead of “I/O error occurred and we don’t know if the file exists”.
It could recover data that was never actually present, such as freed disk blocks, creating a security concern.

If the filesystem can’t tell if file X should be there or not, or is uncertain as to what its contents should be, I would prefer that all attempts to access X fail with something other than -ENOENT until and unless the administrator tells the filesystem to use its best guess of what the pre-corruption situation was, or X is overwritten by an operation that makes that state irrelevant. Silently returning wrong data is the worst possible outcome.

Is self-healing always good?

Posted Apr 3, 2025 13:59 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link]

Well, this isn't btrfs - we don't do that.

There are cases where fsck will delete things, but for the most part that's only if we have another piece of metadata that says "this shouldn't exist".

e.g., extent past the end of an inode - something went wrong with truncate.

If a reflink pointer points to a missing indirect extent, we just mark it as poisoned, so on future attempts to read from it we don't have to print out the same error, and we can un-poison it if the indirect extent comes back; this guards against a temporary lookup error in the reflink btree.

For the snapshots btree, a key for a snapshot node that doesn't exist generally indicates a problem with snapshot deletion, and the key will be deleted. But we also track when a btree has lost data (topology error, IO error), and if the snapshots btree has lost data we'll instead try to reconstruct snapshot tree nodes (and also subvolume keys, etc.).

We can reconstruct inodes if the inodes btree has lost data (permissions, ownership, timestamps etc. will all be wrong, and i_size will be a bit off but you'll still have the correct file contents).

This topic is an area of future research, but for all practical purposes we're in good shape.