The more reliable a service is, the more alarm when it is unavailable

Posted Dec 17, 2023 1:39 UTC (Sun) by mcassaniti (subscriber, #83878)
Parent article: Ext4 data corruption hits the stable kernels

For a lot of us our utility providers (water, gas, electricity) provide a service that barely ever is off-line, even for scheduled maintenance. When one of these services is unavailable due to an emergency outage the panic about when it will be back and how long it takes is noteworthy.

What does this have to do with stable kernels? The better we get as a community at providing stable anything without regressions/errors/bugs the more alarm when it goes wrong. It doesn't matter about the %99.99... of the time the team gets these things right, all that gets noticed is what is left. The more 9's the more the alarm.

I can't thank the community enough across all projects for the thankless amount of work they put into making our software ecosystem great. I have a very stable and reliable desktop experience thanks to all of you. You're human and you'll get it wrong sometimes but you'll fix it and move on. Which one matters more?

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 17, 2023 12:38 UTC (Sun) by wtarreau (subscriber, #51152) [Link]

That's a very wise view, thanks for your positivism!

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 18, 2023 11:19 UTC (Mon) by farnz (subscriber, #17727) [Link] (8 responses)

On top of that, the better any project is at avoiding bugs, the less prepared people are for the effects of a bug. If data corruption due to kernel bugs is a monthly thing, you have better backups, more practice checking whether you need to restore from backup, and so on; if data corruption due to kernel bugs is a "doesn't happen" thing, you get lazy about backups, you don't have experience checking backups against primary copy and so on.

Plus if you have a common class of bug that recurs, people learn to not tickle that class of bug; I've seen, for example, users who always shut down their system via the menus, and then power it up again, instead of using "reboot" options, because they got trained that "reboot" didn't reliably work on the systems they used, whereas power down was visibly working. They would never find a "reboot" bug, because they never rebooted per-se; they always shut down, and then pressed the button to power it back up.

So, one reason the stable kernels appear less stable is simply that the kernel devs have become good at not having similar bugs over and over again - which means you never get trained to avoid that class of bug, and you aren't prepared for it when it happens, because it's a "once-in-a-lifetime" event, and not a "must be Tuesday" event.

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 18, 2023 13:20 UTC (Mon) by Wol (subscriber, #4433) [Link]

This is similar to Health & Safety - I would always teach people to be aware of hazards, and how to deal with said hazards.

Trying to make sure there are no H&S-relevant accidents, will actually make matters much worse. If everybody is trained in dealing with accidents, then firstly accidents are less likely, because people will be more aware, and secondly they will be less serious because everybody knows what to do.

The trouble is when a minor incident runs away into a major accident because everyone is running around like headless chickens with no idea what to do.

Cheers,
Wol

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 19, 2023 22:21 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (6 responses)

So… the alternative would be to follow Torvalds’ releases, assuming any stable-series Linux kernel’s fixes will have shown up in those first (other than the fixes inserted by distributors, of course).

But, how realistic is it to do that on, say, a Debian bullseye that was released with 5.10?

How tied to the kernel if not version then generation are various utilities that are compiled against 5.10’s headers?

Sure, there is the compatibility guarantees of to not break userspace, but what about stuff tightly bound to the kernel? Things like, oh I don’t know, ifupdown/iproute2, udev, procfs/sysfs consumers in general, LUKS/LVM/dmsetup, WLAN, FUSE, X11? The thinkpad out-of-tree module and thinkfan, perhaps? I think people with systemd will have even worse trouble…

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 20, 2023 6:35 UTC (Wed) by donald.buczek (subscriber, #112892) [Link] (1 responses)

From experience: You won't have any trouble with incompatibilities between elder userspace and newer kernels.

You will have trouble (need to resolve) missconfiguration when you use `make olddefconfig` to go from one mainline release to the next. That's not bullet-proof, because config systematics and dependencies sometimes change between releases. Silly, easy to resolve example: We just lost CONFIG_UNIX, because it went from tristate to binary and "m" was converted to "n" by `make olddefconfig`. Without CONFIG_UNIX, systemd doesn't get far.

You will hit bugs introduced by each new mainline release. At least, if you roll your kernel out to a lot of different systems and use a lot of kernel features in non-standard ways. If you just have a single machine with a distro userspace, you will probably hit less bugs than we do.

OTOH, we hit bugs with stable releases as well and I got the feeling, the frequency is ever increasing. If you analyze a bug and figure out, that the bad patch was auto-selected by some heuristics and wasn't needed at all, you start to doubt that "stable" is still worth its name. I'm really considering to abandon stable releases and follow mainline instead. Probably not yet, but it's not unthinkable. But that's not because mainline is good, its because stable gets worse. I wish there was less speed and more quality in the development of Linux.

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 20, 2023 14:09 UTC (Wed) by mirabilos (subscriber, #84359) [Link]

Oh okay, good point, so diff the old and new .config file first.

And yes, the “not because” and desire for deceleration is why I was asking in the first place.

I’ll see what the distro continues to give me for a while, but it’s good to have options. Thanks for sharing your experience.

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 20, 2023 11:32 UTC (Wed) by farnz (subscriber, #17727) [Link] (3 responses)

From experience, anything in-tree is fine as you upgrade the kernel (as long as you get Kconfig correct for the new kernel), but not necessarily happy with a downgrade; I have taken systems from a 4.x kernel to a 5.x kernel with no userspace changes, including X11, iproute2, udev, wpa-supplicant, FUSE and others.

systemd is similarly great at coping on newer kernels than it was built for. The only thing I've ever had problems with were out-of-tree modules (which needed updates for the new kernel), and a vendor binary that had to be run as root and that expected to be able to find what it wanted to find in /dev/kmem (or possibly /dev/mem - it's been a while) and didn't cope with the newer kernel.

I have, however, switched a Debian Stretch system from 4.9 to 6.1 without any difficulty - and I run Debian Stretch containers very happily on my 6.6.6 kernel.

The more reliable a service is, the more alarm when it is unavailable

Posted Dec 20, 2023 14:13 UTC (Wed) by mirabilos (subscriber, #84359) [Link] (2 responses)

Thanks to you too for sharing your experience.

No systemd here.

Containers aren’t sufficient… I run Debian sarge (3.1) chroots for building packages, as a common baseline, and I recently found the debian/eol Docker containers and while the slink (2.1) one doesn’t work without extra hassles (privileged container), the potato (2.2) one does.

But systems that actually need to boot, set up networking including WLAN, etc. need a bit more kernel↔userspace compatibility, which is why I was asking. That stretch data point is *very* useful.

I think I’ll stick with Debian bullseye for a while (LTS, ELTS), but it’s good to have options and know about them.

The more reliable a service is, the more alarm when it is unavailable

Posted Jan 6, 2024 11:41 UTC (Sat) by sammythesnake (guest, #17693) [Link] (1 responses)

I'm curious what you might be using Debian 2.1 (Slink) for, given that it turns 25 this year and has had no security updates since since last century(!)...

The more reliable a service is, the more alarm when it is unavailable

Posted Jan 6, 2024 21:36 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

I run CI-style build testing (and testsuite running) for code of mine that’s supposed to be portable, across a wide array of OSes (if possible) and versions, to see if it really works everywhere.

You might say to forget about GNU/Linux systems that old, but I have found real portability issues by building things on strange systems, with strange compilers, etc. so I personally think there is worth in it (beyond the obvious bragging credits).

Yes, I doubt I would find a slink in production. A lenny probably.

But e.g. someone dug out BOW (4.4BSD on Windows) recently, and with some minor changes, things work well there, and it’s so much better than e.g. Cygwin…

Others are cutting down systems or even writing new Unix-ish systems from scratch (Google Fuchsia, or the rust-y Maestro that recently made news, or even just classics like Haiku and Syllable OS and even Jehanne/Plan 9), and I like for stuff to run on those systems as well.