White paper: Vendor Kernels, Bugs and Stability

Posted May 17, 2024 20:58 UTC (Fri) by jd (guest, #26381)
Parent article: White paper: Vendor Kernels, Bugs and Stability

I understand the value in keeping old kernels - anything that's dependent on a feature whose semantics have changed (even subtly) isn't good news for corporations, and new codex risks new bugs.

However, I'm pretty sure we've reached the point where old vulnerabilities are costing businesses and there's also the cost of hiring developers to backport features.

Here's a what-if to consider. What if the distros and the corps spent exactly the same amount of money on a mix of QA engineers and developers for much more recent stable kernels?

In other words, actively find and fix the new bugs in kernels where many of the old vulnerabilities have already been corrected, rather than passively backport a selection of those fixes?

Yes, you would need much more QA, because you're having to find bugs that haven't yet been fixed by anyone. But my suspicion is that this would be a far more productive approach.

However, suspicion and reality don't usually agree. Security issues can be evaded, to a degree, with firewalls, and there may be an economic penalty from corps preferring known defects to unknown ones. You can mitigate known risks, after all.

Furthermore, such an approach depends on competent QA and my experience is that really good QA engineers are rare because if they're that good, they're unlikely to be attracted to QA.

But if long-term stable Linux distros aren't to acquire the same security taint as... certain other vendors I could mention, the status quo won't work and this is the most obvious alternative.

White paper: Vendor Kernels, Bugs and Stability

Posted May 17, 2024 21:05 UTC (Fri) by bluca (subscriber, #118303) [Link] (9 responses)

The problem is not bugs. The problem is when userspace backward compatibility is intentionally broken, which happens on every kernel release. That means you have to go and make all affected userspace compatible with both new and old kernels, and that's expensive, and you cannot do that every 3 months.

White paper: Vendor Kernels, Bugs and Stability

Posted May 17, 2024 22:37 UTC (Fri) by jra (subscriber, #55261) [Link] (8 responses)

> The problem is when userspace backward compatibility is intentionally broken, which happens on every kernel release.

I don't believe that is true. Can you point to any lkml discussions about intentionally breaking userspace (that don't end up with Linus being very cross :-) ?

The kernel community IMHO is extraordinarily careful not to break userspace compatibility.

Compatibility

Posted May 17, 2024 22:42 UTC (Fri) by corbet (editor, #1) [Link] (7 responses)

Unfortunately, Luca and company are not entirely wrong on this front. I still need to find some time to figure out what ultimately happened with this recent episode, for example. We're not as good at this as we like to think we are.

Compatibility

Posted May 17, 2024 22:48 UTC (Fri) by jra (subscriber, #55261) [Link] (6 responses)

Testing testing testing :-). Every time we do this in Samba we add a new regression test to make sure we don't at least break that specific network feature ever again.

I think by the time upstream makes it into Greg's kernel trees any userspace breakage has already been found and fixed. I'm not aware of any userspace breakages there (although please correct me if I'm wrong, I'm still pretty new to this kernel stuff).

"You can never be too thin, too rich, have too much disk space or too many regression tests" :-).

Compatibility

Posted May 17, 2024 22:52 UTC (Fri) by bluca (subscriber, #118303) [Link]

See first comment on this article for another even more recent example

Compatibility

Posted May 18, 2024 6:47 UTC (Sat) by marcH (subscriber, #57642) [Link]

Promises without tests are just empty words; if it's not tested then it does not work. Every piece of software is only as good as the tests it's passing.

Etc.

Compatibility

Posted May 19, 2024 4:40 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (3 responses)

One difficulty that makes this a reality is present in all software: some features happen by accident and were not intentional by developers but are more convenient side effects. They also have the interesting property of not being tested since they're unknown. When users start to rely on these and the code changes, things can break.

This is commonly visible in various setup scripts that rely on contents of /proc, /sys, output from lsmod etc. In other software sometimes a major upgrade will add support for new options to a keyword and reveal that a previous version would silently ignore any such options and that a typo in a previous version now raises an issue.

There's never a perfect solution to this. If users would only use perfectly documented and intended features, systems and programs would be very poor and boring, and nothing would evolve. They would rightfully complain even more about the occasionally needed breakage due to architectural changes. If they use any single possibility, they spark the addition of new features but face unintended changes more often, and developers have a more difficult time making their code evolve.

A sweet spot always has to be found, where developers document the intended use and users scratch a bit around what is officially supported, keeping the intended use in mind so as to limit the surprise in case of changes.

Overall I find that Linux is not bad at all on this front. You can often run very old software on a recent kernel and it still works fine. And at the same time I'm one of those being extremely careful about major upgrades because if I don't see the breakage, I suspect it will happen in my back at the moment I would have preferred not to face it. Nobody wants to discover their microphone is no longer working when joining a visio-conference, or that their 4G adapter fails to initialize when waiting for a train or plane at the station for example. On a laptop I've very rarely experienced issues, except with broken or poorly supported hardware whose behavior would change in various ways along versions. On servers I'm much more cautious because of subtle changes experienced overa long time in bonding/bridging/routing setups, and iptables also showing some more config granularity that reveals in field that you're missing some new config options from time to time. And this is a perfect reason for sticking to a stable kernel when you have other things to do than to retest everything.

Compatibility

Posted May 19, 2024 17:00 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> There's never a perfect solution to this. If users would only use perfectly documented and intended features, systems and programs would be very poor and boring, and nothing would evolve. They would rightfully complain even more about the occasionally needed breakage due to architectural changes. If they use any single possibility, they spark the addition of new features but face unintended changes more often, and developers have a more difficult time making their code evolve.

I don't think you're right here. We'll never find out, of course, because developers (and open source collaborative projects are amongst the worst) are very bad at writing documentation.

> A sweet spot always has to be found, where developers document the intended use and users scratch a bit around what is officially supported, keeping the intended use in mind so as to limit the surprise in case of changes.

That sweet spot, as an absolute minimum, needs to include DESIGN documentation. This is where Linus' management style really fails (although given that he's cat-herd in chief I don't know any style that would succeed), because so much of Linux is written by people scratching itches, fixing bugs, and everyone is fixing their immediate problem.

Nobody steps back, and asks "what is the purpose of the module I'm working? What is sensible or stupid? How can I design this to be as *small*, *complete*, and *self-contained* as possible?".

Cheers,
Wol

Compatibility

Posted May 20, 2024 11:15 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

> needs to include DESIGN documentation

You're absolutely right on these points. I have asked myself in the past the reasons for this situation in a lot of such software documentation and came to the conclusion that it's based on the fact that aside a few exception, there's mostly no design but an improvement on top of something that already exists. As such, lack of design phase implies lack of design documentation. Sometimes someone spends so much time reversing something they work on that they take a lot of notes and end up creating a doc of the existing design. It does not necessarily justify choices but can be a good point to encourage new participants to update the doc. But that's not *that* common.

I'm not seeing any good solution to this , though :-/

Compatibility

Posted May 23, 2024 16:49 UTC (Thu) by anton (subscriber, #25547) [Link]

A sweet spot always has to be found, where developers document the intended use and users scratch a bit around what is officially supported, keeping the intended use in mind so as to limit the surprise in case of changes.

That's what C compiler advocates use to defend miscompilation: "works as documented". And while this bad principle is used for declaring gcc bug reports invalid, the actual practice of gcc development seems to be better: They use a lot of tests to avoid breaking "relevant" code (including for cases outside the documented behaviour), and the irrelevant code rides in the slipstream of relevant code.

Fortunately, the kernel aspires to a better principle, the often-cited "we don't break user space". And given that we have no way to automatically check whether a program conforms to the documentation, that principle is much more practical (and the practice of gcc is also in that direction, but only for "relevant" code).

Concerning accidental features, a careful design of interfaces avoids those. E.g., one can catch unintended combinations of inputs and report them as errors. Or define and implement useful behaviour in such a case. Exposing some arbitrary accident of the implementation in such cases leads to the difficulties you point out. Given that Linux' interface to user space is also a security boundary, carefully defining interfaces and avoiding exposing implementation artifacts is a good idea anyway.

White paper: Vendor Kernels, Bugs and Stability

Posted May 17, 2024 21:23 UTC (Fri) by ronniesahlberg (guest, #171541) [Link] (2 responses)

"Here's a what-if to consider. What if the distros and the corps spent exactly the same amount of money on a mix of QA engineers and developers for much more recent stable kernels?"

I think this is where we need to go, and in a sense it would be a continuation on a collaborative process that has been happening for a while.
For example, in the past it was very common that different vendors all had their own private trees with their own private drivers and it was always very difficult and expensive for them to move from one kernel release to another. Today most of those vendors all collaborate and work in upstream. Making life much easier for those vendors themselves as they now can focus on a common codebase instead of spending time maintaining their own incompatible code-bases.

I think would be good if we extended this also to the packaging of stable vendor kernels. Instead of everyone maintaining their own private special kernels and all backporting their own selected small subset of important fixes, everyone could instead collaborate on a set of upstream stable kernels and ensure that they all get all the fixes that are needed.

White paper: Vendor Kernels, Bugs and Stability

Posted May 17, 2024 21:26 UTC (Fri) by ronniesahlberg (guest, #171541) [Link]

To clarify, my post above that talks about vendors in the past all having their own private trees is referencing embedded systems vendors and arm, not distribution vendors.

White paper: Vendor Kernels, Bugs and Stability

Posted May 20, 2024 12:20 UTC (Mon) by Wol (subscriber, #4433) [Link]

> "Here's a what-if to consider. What if the distros and the corps spent exactly the same amount of money on a mix of QA engineers and developers for much more recent stable kernels?"

"You can't QA quality into a product".

"10 minutes spent on design knocks 1hr off development".

"Every problem that slips through one phase takes 10 times longer to correct in the next".

As I moaned above, design, Design! DESIGN. If that money was spent on *documentation* and *design* it would probably find far more bugs, far more quickly, and what's more important get them fixed (probably by other people :-)

The current setup encourages a kernel held together by baling wire, duck tape, and sealing wax ...

Cheers,
Wol