Filesystems, testing, and stable trees

Posted Jun 15, 2022 11:05 UTC (Wed) by nix (subscriber, #2304)
Parent article: Filesystems, testing, and stable trees

> The key to having stable kernels with stable filesystems is being able to run fstests (formerly xfstests) on them.

This is important, but it would not spot any problems of the class linked higher in this article (which I have a personal interest in, being the person that bug bit). AIUI, fstests creates filesystems and then tests them, all in the same kernel version: situations in which filesystems touched by earlier kernels (or, alternatively, made by older mkfses) have properties that newly-mkfsed ones do not will not be spotted, which is how the bug in question sneaked in (the new kernel could handle stuff *it* had written fine, just not stuff older kernels predating the problematic change had written).

And, of course, almost all filesystems will have been written by "older kernels" by this definition: nobody re-mkfses all their writable filesystems on every kernel upgrade!

(Maybe a fstest run mode in which the mkfses and some of the writes are done by a stable baseline kernel and then the rest are done by the kernel under test might work, but where do you draw the boundary between which things should be done by an older kernel and which by a newer? Bugs of this nature might occur after any of those writes...)

Filesystems, testing, and stable trees

Posted Jun 15, 2022 16:11 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

This sounds like a problem that hit raid ... a problem sneaked in, the layout of an asymetric raid-10 was accidentally changed. (Asymetric as in different sized disks.)

It was really nasty in that you had to run a kernel the same side of the change as the one the raid was created by - both new kernel on old raid and old kernel on new raid would lead to data corruption. What do you do? You can't revert the change because any arrays created with a modified kernel would be inaccessible.

They ended up by adding a flag to the raid to identify old or new, and all new kernels now refuse to start an array if that flag is missing. It's easy to add, so new kernels are fine, people are unlikely to revert to old kernels, and asymetric raids aren't that common ...

Yes, upgrading does show up a completely different class of bug ... :-)

Cheers,
Wol

Filesystems, testing, and stable trees

Posted Jun 17, 2022 14:58 UTC (Fri) by nix (subscriber, #2304) [Link]

That was much worse because it wasn't spotted for ages, so simply reverting the change wasn't possible because many people were running arrays built with the new code :/ I'm ever so glad I wasn't the person who had to fix that.