The strange story of the ARM Meltdown-fix backport

By Jonathan Corbet
March 15, 2018

Alex Shi's posting of a patch series backporting a set of Meltdown fixes for the arm64 architecture to the 4.9 kernel might seem like a normal exercise in making important security fixes available on older kernels. But this case raised a couple of interesting questions about why this backport should be accepted into the long-term-support kernels — and a couple of equally interesting answers, one of which was rather better received than the other.

The Meltdown vulnerability is most prominent in the x86 world, but it is not an Intel-only problem; some (but not all) 64-bit ARM processors suffer from it as well. The answer to Meltdown is the same in the ARM world as it is for x86 processors: kernel page-table isolation (KPTI), though the details of its implementation necessarily differ. The arm64 KPTI patches entered the mainline during the 4.16 merge window. ARM-based systems notoriously run older kernels, though, so it is natural to want to protect those kernels from these vulnerabilities as well.

When Shi posted the 4.9 backport, stable-kernel maintainer Greg Kroah-Hartman responded with a pair of questions: why has a separate backport been done when the Android Common kernel tree already contains the Meltdown work, and what sort of testing has been done on this backport? In both cases, the answer illustrated some interesting aspects of how the ARM vendor ecosystem works.

Android Common and LTS kernels

The Android Common kernels are maintained by Google as part of the Android Open-Source Project; they are meant to serve as a base for vendors to use when creating their device-specific kernels. These kernels start with the long-term support (LTS) kernels, but then add a number of Android-specific features, including the energy-aware scheduling work, features that haven't made it into the mainline for a number of reasons, and more. They also contain backports of important features and fixes, including the Meltdown fixes.

The Meltdown-fix backport was quite a bit of work, and it has gone through extensive testing in the Android kernel. Kroah-Hartman worried that the new backport may not have all of the necessary pieces or have been as extensively validated as the Android work; as such, it may not be something that should appear in the LTS kernels. The analogous effort for x86 should not be an example to follow, he said:

Yes, we did a horrid hack for the x86 backports (with the known issues that it has, and people seem to keep ignoring, which is crazy), and I would suggest NOT doing that same type of hack for ARM, but go grab a tree that we all know to work correctly if you are stuck with these old kernels!

The problem with this idea is that not every ARM system is running Android, and pulling from the Android kernel will not work for vendors whose kernels are closer to the mainline. As Mark Brown put it:

While that's a very large part of ARM ecosystem it's not all of it, there are also chip vendors and system integrators who have made deliberate choices to minimize out of tree code just as we've been encouraging them to.

Those vendors would like to have a long-term supported version of the Meltdown mitigations that does not require dragging in all of the other changes that accumulate in the Android kernels. As Brown pointed out, there are increasing numbers of vendors that are doing what the community has been asking for years and staying closer to the mainline. Not providing a proper backport of these important fixes could be seen as breaking the promise that the community has made: run the officially supported stable kernels and you will get the fixes for significant problems.

There is, thus, a reasonable argument to be made that a proper set of backports for the Meltdown fixes should find its way into the LTS kernels. One little problem remains, though: a proper backport should be known to actually work.

Testing deemed optional

Shi's response to Kroah-Hartman's question about testing was, in its entirety: "Oh, I have no A73/A75 cpu, so I can not reproduce meltdown bug." Reproducing the bug on the A73 would be a bit of a challenge, since that processor does not suffer from Meltdown, but A75 does, so asking for testing results on that CPU does not seem entirely out of line. When Kroah-Hartman repeated his request for testing, though, Ard Biesheuvel responded:

If ARM Ltd. issues recommendations regarding what firmware PSCI methods to call when doing a context switch, or which barrier instruction to issue in certain circumstances, they do so because a certain class of hardware may require it in some cases. It is really not up to me to go find some exploit code on GitHub, run it before and after applying the patch and conclude that the problem is fixed. Instead, what I should do is confirm that the changes result in the recommended actions to be taken at the appropriate times.

Upon receipt of that message, Kroah-Hartman dropped the patch series entirely, complaining that: "I can't believe we are having the argument of 'Test that your patches actually work'". He later added that if the developers working on the backport don't have both the hardware and the exploit code, "then someone is doing something seriously wrong". He urged them to complain to ARM Ltd to get that problem fixed.

At that point, the conversation stopped. Whether the testing problem is on its way toward a solution has not been revealed. It does seem right that the fixes should be merged into the LTS kernels; otherwise the promises that the community has made regarding those kernels will start to look hollow. But the vendors depending on the LTS kernels also have a right to fixes that somebody has actually bothered to test; anybody who has worked in system software for any period of time knows that just checking for adherence to a specification is no guarantee of a working solution.

Index entries for this article
Kernel	Development model/Stable tree
Kernel	Security/Meltdown and Spectre
Security	Meltdown and Spectre

Yes, all software changes should be tested

Posted Mar 15, 2018 17:35 UTC (Thu) by david.a.wheeler (subscriber, #72896) [Link] (11 responses)

I completely agree with Greg K-H: "I can't believe we are having the argument of 'Test that your patches actually work'". If you develop a patch to do X, there should be tests to verify that it does it. There can always be arguments about how much testing is enough, but that's different than "should I test it?".

Yes, all software changes should be tested

Posted Mar 15, 2018 19:08 UTC (Thu) by sjfriedl (✭ supporter ✭, #10111) [Link]

I dunno, the core memory management part of Linux doesn't smell to me like one of those areas that admits of rare or unusual bugs that would be hard to track down.

Right? :-)

Yes, all software changes should be tested

Posted Mar 15, 2018 23:37 UTC (Thu) by ken (subscriber, #625) [Link] (2 responses)

Well it depends what you mean by testing, compiled and never run is obviously not enough but I have seen several errors for many CPUs that I had to make a fix for. many that I had no way to test the only thing you really can do is look at the assembly code generated and make sure the correct instruction is inserted in the correct place.

That is the nature of weird corner cases. Now was there even a meltdown test case published ?

Yes, all software changes should be tested

Posted Mar 16, 2018 0:53 UTC (Fri) by rahvin (guest, #16953) [Link] (1 responses)

There was for x86, I don't know if there was for ARM. I would assume because the ARM company confirmed the exploit on these processors that they have working exploit code. As Greg said if it's going to be patched there needs to a test to verify it works, ARM should step up and provide what's needed to demonstrate the patch works.

Yes, all software changes should be tested

Posted Mar 16, 2018 13:13 UTC (Fri) by hkario (subscriber, #94864) [Link]

> I would assume because the ARM company confirmed the exploit on these processors that they have working exploit code.

I wouldn't say this shows that. It's not uncommon to just reason whether some bug in software is possible or not and fix based on that. I'd say it's likely that people that have the design documents for hardware can do exactly the same, just based on the public description of the vulnerability.

Yes, all software changes should be tested

Posted Mar 16, 2018 6:53 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

But if the alternative is not to merge the code at all, giving a 100% chance that it isn’t fixed...

As the developer says, there is a magic sequence of processor instructions that has to be done at particular points. That’s the spec the vendor has provided - just as they might make similar black-box pronouncements about buggy floating point division or other hardware bugs to work around. Taking this on trust has the same problems as taking any other vendor guarantee on trust. Lots of code that theoretically conforms to some spec turns out not to work in practice. But to do nothing instead?

It’s also bizarre to recommend using the Android tree after years of preaching about sticking to the mainline.

Yes, all software changes should be tested

Posted Mar 16, 2018 9:23 UTC (Fri) by tchernobog (guest, #73595) [Link] (5 responses)

Not fixed does not mean not working. A working vulnerable system is better than a potentially bricked non-vulnerable system.

Not testing code before merging in a LTS is inexcusable.

Yes, all software changes should be tested

Posted Mar 16, 2018 10:24 UTC (Fri) by ardbiesheuvel (subscriber, #89747) [Link] (4 responses)

Alex meticulously regression tested his backport, so saying the code is untested is not entirely fair.

The debate is about whether it is sufficient to test whether a mitigation such as KPTI in fact does what is expected of it, i.e., unmap the kernel while running in userland, or whether it is mandatory to go all the way to the beginning and test whether unmapping the v4.9 kernel blocks Meltdown attacks just like unmapping the v4.16 kernel does.

Yes, all software changes should be tested

Posted Mar 17, 2018 19:21 UTC (Sat) by epa (subscriber, #39769) [Link] (3 responses)

Of course, the original patches to unmap the kernel from userspace were merged without any details about whether they mitigated Meltdown or anything else - they were just hurried into the mainline on the understanding that doing this was now a good idea because important people said so. It's odd to hold these ARM patches to a higher standard, demanding published exploit code before they can go in. Surely they should be merged on the same general principle that address space separation is now the thing to do?

Yes, all software changes should be tested

Posted Mar 18, 2018 14:03 UTC (Sun) by gregkh (subscriber, #8) [Link] (2 responses)

That's a very odd claim to make. All of the backported patches were merged to the stable trees after proper testing and validation that they did what they said they did by either myself, or other developers that I trust. If that hadn't happened, I would not have accepted them.

In this way, I am applying the exact same standard to these ARM patches as I have with the other architecture patches of this nature. For me to not apply that same standard would not be very fair, don't you think?

Yes, all software changes should be tested

Posted Mar 18, 2018 20:49 UTC (Sun) by epa (subscriber, #39769) [Link]

OK, I apologize. I was only going by what I had read on LWN and other sources. At the time, the address space separation was merged into the kernel but there was no mention of whether it mitigated the Meltdown attack. (The details of Meltdown only became public a few days afterwards.) As far as an outsider could tell, it was just merged because address space separation was generally thought to mitigate some theoretical attack that was likely to be possible, but without any test cases for a specific attack.

Yes, all software changes should be tested

Posted Mar 18, 2018 20:51 UTC (Sun) by epa (subscriber, #39769) [Link]

Ah, you are talking about the backports to the stable trees, while I was thinking of the initial landing of address space separation in the development branch.

The strange story of the ARM Meltdown-fix backport

Posted Mar 15, 2018 19:58 UTC (Thu) by flussence (guest, #85566) [Link] (3 responses)

It's bewildering to think ARM still exists when the majority of the stories I see involving the company or its licensees are to do with gross managerial incompetence like this, broken proprietary drivers, cruel office politics against FOSS devs, chronic GPL violations etc. Hopefully RISC-V will scare them straight.

The strange story of the ARM Meltdown-fix backport

Posted Mar 15, 2018 20:58 UTC (Thu) by pizza (subscriber, #46) [Link]

RISC-V will just be more of the same -- After all, versus ARM it's just a change of the CPU instruction set. All of the peripherals can be just as proprietary as before, the SoC makers can be just as intransigent and FOSS/GPL-hostile as before, and so forth. If anything, I expect RISC-V SoCs to be _worse_ as their makers rush to try to differentiate themselves in areas that ARM has long since standardized.

The strange story of the ARM Meltdown-fix backport

Posted Mar 16, 2018 2:09 UTC (Fri) by atelszewski (guest, #111673) [Link] (1 responses)

Hi,

> It's bewildering to think ARM still exists

Remember that Arm isn't only the big and complex SoCs.
It's a big player in the microcontrollers area.
And as much as I hate what they do in Linux ecosystem, I enjoy working with Cortex-M{0,3} cores.

--
Best regards,
Andrzej Telszewski

The strange story of the ARM Meltdown-fix backport

Posted Mar 16, 2018 2:57 UTC (Fri) by pizza (subscriber, #46) [Link]

Keep in mind that Arm doesn't actually _make_ SoCs, be they large or small. They design and license out the CPU cores (along with GPUs and an assortment of other peripherals) but what pieces to use and how they're tied together is up to their customers/licensees.

Over time Arm has provided more and more building blocks, including reference designs, and have increasingly encouraged more standardization in how things are put together (The Cortex-M's CMSIS framework is a good example, along with the SBSA stuff for the ARMv8 servers) but it's still ultimately up to the licensee to put together and support an appropriate CSP/BSP. Because the many licensees typically end up differentiating themselves into corners, things tend to fall apart rapidly, resulting in the current less-than-ideal situation.

The strange story of the ARM Meltdown-fix backport

Posted Mar 16, 2018 0:07 UTC (Fri) by hjames (subscriber, #14925) [Link] (3 responses)

Perhaps Linux should relax some of the merge requirements into the main trees from Android and the ARM vendors so these issues start to disappear. If the code is good enough to be sold on millions of devices, I'm sure it's good enough to be merged and then any required changes can happen over time.

The strange story of the ARM Meltdown-fix backport

Posted Mar 16, 2018 2:05 UTC (Fri) by viro (subscriber, #7872) [Link]

Poe Law in action?

The strange story of the ARM Meltdown-fix backport

Posted Mar 23, 2018 18:16 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> If the code is good enough to be sold on millions of devices, I'm sure it's good enough to be merged

Strangely enough, I can tell right away that this is unlikely to be true, just because the code works OK in one physical device, regardless of how many copies of that device are manufactured, doesn't make it suitable for inclusion in millions of other different devices, the code needs to work for a wide range of devices and usage, not just one.

The strange story of the ARM Meltdown-fix backport

Posted Mar 30, 2018 19:47 UTC (Fri) by flussence (guest, #85566) [Link]

> If the code is good enough to be sold on millions of devices, I'm sure it's good enough to be merged
Let a million CVEs bloom...