Overstreet brings this on himself

Posted Jul 3, 2025 19:00 UTC (Thu) by koverstreet (✭ supporter ✭, #4296)
In reply to: Overstreet brings this on himself by Kluge
Parent article: Bcachefs may be headed out of the kernel

It's also not just about the bugfix itself - it's also about risk mitigation.

I often tell people: when you're looking at e.g. a syzbot bug, don't do the minimum to make the report go away.

Reproduce it locally, read through the log output with an eye towards the behavior of the whole system. Make sure that the behavior in response to the error makes sense (which is a lot more than just not crashing!), and see if you can find other things to improve. That could be logging (we can't debug if we can't see what's going on), repair, or even - like in this case - looking for ways to limit the impact of similar bugs in the future.

We've now had two separate bugs in two weeks where this new repair mode has proved useful.

The second one, for which I just determined the root cause this morning, involved a filesystem with a 400GB postgres database (and a whole bunch of other data) where the directory structure got trashed.

(Two different bugs in two weeks? I'd say getting the code out there quickly was justified; I've learned to trust my intuition).

Proximate cause was a flaky USB controller and a crazy iscsi setup - which is exactly the sort of thing I love to see: I want people doing the craziest oddball crap they can imagine to break things _now_, before the experimental label gets lifted.

It turns out 6.16 broke btree node read retries for validate errors - not IO errors, we had tests for that, and there's a story as to why the tests were missing; the error injection we were used to rely on from one subsystem was dropped with a supposedly equivalent error injection mechanism from a different subsystem - except the new one wasn't tested for anything except IO error injection, the other functionality was completely broken.

Ow. Testing is important.

But we had a lot of logging available to sift through to find out what went wrong, and one thing we're getting in 6.16 (which incidentally also was the patchset that introduced the regression) is much improved logging for data read and btree read errors - which made the missing retry from rest of the replicas blindingly obvious.

And now I'm about to commit and push new tests for the relevant error path, and the user who hit the second bug is getting most of his stuff back thanks a combination of journal rewind (that didn't repair everything, the journal didn't go back far enough - we didn't catch it early enough) and writing code to find files by hash/filetype (almost nothing was completely lost, just ended up in lost+found).

And I'm writing new tests today.

TL;DR: defense in depth, risk mitigation wherever possible, and always have eyes on as much of the system's behavior as possible when things go wonky.