Linux Kernel Security Done Right (Google Security Blog)

[Posted August 3, 2021 by jake]

Over on the Google Security Blog, Kees Cook describes his vision for approaches to assuring kernel security in a more collaborative way. He sees a number of areas where companies could work together to make it easier for everyone to use recent kernels rather than redundantly backporting fixes to older kernel versions. It will take more engineers working on things like testing and its infrastructure, security tool development, toolchain improvements for security, and boosting the number of kernel maintainers:

Long-term Linux robustness depends on developers, but especially on effective kernel maintainers. Although there is effort in the industry to train new developers, this has been traditionally justified only by the "feature driven" jobs they can get. But focusing only on product timelines ultimately leads Linux into the Tragedy of the Commons. Expanding the number of maintainers can avoid it. Luckily the "pipeline" for new maintainers is straightforward.
Maintainers are built not only from their depth of knowledge of a subsystem's technology, but also from their experience with mentorship of other developers and code review. Training new reviewers must become the norm, motivated by making upstream review part of the job. Today's reviewers become tomorrow's maintainers. If each major kernel subsystem gained four more dedicated maintainers, we could double productivity.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 9:09 UTC (Wed) by pabs (subscriber, #43278) [Link]

I wonder which companies he is suggesting could be donating engineer time for Linux but aren't.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 11:07 UTC (Wed) by smoogen (subscriber, #97) [Link] (7 responses)

While having more engineers is a worthy goal, if his goal is to make it so newer kernels are used, then he needs to look for more policy makers and bureaucrats versus engineers. Most of the requirement for keeping 'old-stable' kernels on various systems comes from mountains of regulations which require companies to use specific software which has been 'acredited' for multiple standards.

Most of those regulations are usually written to deal with the last catastrophe where people quit using their 'common sense' in one way or another so they tend to cause a backward way of thinking on how to fix things. [Yes some of them are written to create barriers of entry or monopolies or whatever but in general most of them are because people did something 'stupid' multiple times (versus once). ] So getting the policies written with the idea of forward thinking of the next problem versus the last one is needed. [Also much of this regulation isn't government but instead finding out your company can't get any insurance unless you can show your processes meet some new standard. The bigger your company, the more these industry standards come into play.]

The other area is where software is a complicated beast. You run version X.Y of the kernel and the software works (say your payroll system or your manufacturing looms). You run version X.(Y+m) of the kernel and something starts breaking. Then you play a long game of blame between the software you need to work and the kernel about whose fault it is. That is time and money you usually don't have because your competition instead went to X.Y.(Z+n) of the kernel or not at all and is running fine. [Usually the problem isn't with either the software you need or the kernel. It is some middle piece which no one remembers but needs fixing.]

This is a different sort of bureaucracy because you need to work out a lot of things from all the different software needed for various industries to work. It is also a lot of people time versus engineering because it is about getting a coalition of companies to work together.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 12:59 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (1 responses)

You don't even need to seek bureaucracy to see such problems in field. Even at the engineer's level it's extremely common to see people not interested in providing updates, just in the same stable branch. Look at the previous thread right here about stable kernels where it's easy to find cases such as "been hit by a regression once, no thanks". The whole chain from the engineer to the CEO needs to be convinced that regular updates are a necessity. But for this it's also important to make them understand that "regular" or "frequent" updates does not require systematically upgrading to the latest one each time one is out, but that every time there is an opportunity for an update, it ought to be evaluated and performed. Also one must not think in terms of "what could this update bring me" but "do I have a really good and compelling reason for not applying it". This involves better tooling to perform automatic fetches, updates, rebase, oldconfig, builds, regtests etc from latest kernels, but we must pass the message that this is mandatory, it's not an option.

Right now the benefit of updating it often not perceived, hence is often balanced against the risk of rergression and that's what must change. Updating is not a benefit, it's a necessity. The exception should be not updating for a good reason.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 14:03 UTC (Wed) by smoogen (subscriber, #97) [Link]

My apologies, I should have better described by what I meant by 'bureaucracy'.. as I was short-handing the various people problems of 'I was bitten by this so we aren't doing it that way again' which then either get officially written up as regulations or as general culture of 'we stick to the old true item'. That usually takes as much meetings, diagrams and powerpoint as it does for getting all the automated testing tools in place. It also usually takes an outside source of 'truth' because humans tend to discount internal answers over external ones. (AKA why did XYZ company pay 10 million to an outside consultant who basically said the same thing their own engineers said for 5 years?).

That is the part that I think any security initiative will also need :/.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 13:08 UTC (Wed) by pmenzel (subscriber, #113811) [Link] (2 responses)

> The other area is where software is a complicated beast. You run version X.Y of the kernel and the software works (say your payroll system or your manufacturing looms). You run version X.(Y+m) of the kernel and something starts breaking. Then you play a long game of blame between the software you need to work and the kernel about whose fault it is.

Actually, Linux’ no-regression rule “We do not break userspace.“ is very clear, making it Linux’ problem.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 14:06 UTC (Wed) by smoogen (subscriber, #97) [Link]

Yes after you have spent time and effort proving that it is actually a regression versus bad code. It is very rarely a one-and-done email, but a lot of work getting the evidence together that it is broken versus a buggy compiler, library set, or a dozen other things it actually could also be. That is time and energy most end companies have no plans to spend time on so will just either stick to a working known state or pay someone else to do that work for them while keeping them on a working known state.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 10, 2021 15:01 UTC (Tue) by geert (subscriber, #98403) [Link]

And running "git bisect" to find the Linux version that broke is usually much easier than trying to bisect a complex userspace stack.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 13:12 UTC (Wed) by jwb (guest, #15467) [Link] (1 responses)

It is not the regulations themselves, but the quasi-bureaucracy of consultants and charlatans who tell companies lies about the regulations. Some guy will tell you that your company must use x.y.z to comply with FIPS 140-2, but he is lying. The convenient existence proof is google, where they maintain and release their own kernels, every week, and they have as many regulatory certifications as you can name.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 4, 2021 14:12 UTC (Wed) by smoogen (subscriber, #97) [Link]

It depends on the specific regulations.

Some are written so that anyone can 'meet them in the field' and others are written that you must run XYZ code as certified by ABC firm of auditors and published in DEF registry. My understanding was that if you are needing to meet the second set of certifications, you don't get the latest kernel/library set they publish.. you instead get an old set which did meet those certificates and is backported patched in a way that whatever regulation allows for.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 5, 2021 2:35 UTC (Thu) by timrichardson (subscriber, #72836) [Link]

There are lots of python libraries with 0.001% of the complexity of the kernel and with great test coverage, and written in a safer language. And in production, wise people pin the version. This is the real world, and the idea of running a continuously built kernel in production seems far-fetched, even with 100 new maintainers.

As for the metaphor: The US car industry was not reformed. It was out-competed by new entrants who started without legacy baggage and a focus on quality as a point of difference. Maybe the vision is possible with Rust and modern software engineering. Assuming the problem is actually real enough. How often does linux catch fire, by the way? The only problem I have on my servers is OOM, and that's rare and it seems to be getting fixed.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 5, 2021 7:06 UTC (Thu) by marcH (subscriber, #57642) [Link]

> More engineers for code review

How many companies allocate or track time for code reviews or reward them? Maintainers are overwhelmed by code reviews because they're the only ones facing some consequences in the lack of them. All other engineers are running behind their underestimated schedules and have "more important things to do" than code reviews.

You get what you pay for.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 5, 2021 9:22 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

Most companies perceive software updates through the filter of (1) it still works, (2) it broke. Not updating guarantees (1). Updating risks (2), meaning some unknown fraction of updates produce (2), while the rest are (1) and seen as exactly as good as the previous version, not better.

In such an environment it is not hard to see why updates are considered nothing but trouble. Without a change in incentive structure, there is no reason to expect a change in behavior will even be possible in almost all organizations.

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 5, 2021 13:31 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Unfortunately, this leaves the externalities of security breaches via holes that were fixed, but not deployed. Until that starts getting factored in (via fines or actually punitive damages…not piddling amounts like $7.5M [1]), updates are always going to be "more risky" because companies put blinders on to the risk associated with staying on the current version. We have safety standards for infrastructure such as bridges, buildings, roads, etc. I really don't see why we completely ignore it for the infrastructure we've layered underneath everything else (as a society).

[1] https://arstechnica.com/gadgets/2021/08/google-class-acti...

Linux Kernel Security Done Right (Google Security Blog)

Posted Aug 5, 2021 12:28 UTC (Thu) by scientes (guest, #83068) [Link]

Gernot Heiser says that Linux's habit of giving drivers of dubious quality the ability to subvert the security of the entire system warrants a re-write. I think he has a point that security against a determined local adversary (rather than just script kiddies) is impossible, despite the great skill of people like Kees Cook (who I had the pleasure of eating Chinese with before at a conference).