Testing kernels

By Jake Edge
September 19, 2017

New kernels are released regularly, but it is not entirely clear how much in-depth testing they are actually getting. Even the mainline kernel may not be getting enough of the right kind of testing. That was the topic for a "birds of a feather" (BoF) meeting at this year's Linux Plumbers Conference (LPC) held in mid-September in Los Angeles, CA. Dhaval Giani and Sasha Levin organized the BoF as a prelude to the Testing and Fuzzing microconference they were leading the next day.

There were representatives from most of the major Linux distributors present in the room. Giani started things off by asking how much testing is being done on the stable kernels by distributors. Are they simply testing their own kernels and the backports of security and other fixes that come from the stable kernels? Beyond the semi-joking suggestion that testing is left to users, most present thought that there was little or no testing (beyond simple build-and-boot testing) of the stable kernels.

Part of the problem is that it is difficult to know what to choose in order to test "the upstream kernel". The linux-next tree is a moving target as are the stable trees (and the mainline itself). But there is value in finding bugs before they make it into a release. In order to try to find bugs before they actually get into releases, some distributors are starting to test the ‑rc1 kernels. That way, if bugs are found, they can be fixed before the release, but it takes a lot of hours and machines to do that well. There is also a question of which kernel configurations to test.

The upstream testing that is done for the mainline and stable kernels is fairly limited. There is a lot of it being done, but it doesn't go all that deep into kernel functionality. It takes Red Hat a year to stabilize the kernel chosen for a RHEL release; roughly 300 engineers work on that task, meaning it takes the company 300 person-years to test and harden a kernel.

Boot testing is well covered by various upstream testing efforts, so newly released kernels will boot. The majority of bugs that are found in those (or any other) kernel are in the drivers; the only people testing the drivers are those that have the hardware. The core itself is pretty safe, it is believed, and things like Ftrace, the scheduler, and memory management are used widely, so they get a fair amount of testing. Other, non-core or less popular functionality may not see much functional testing.

Red Hat has a large testing lab with something like 6000 machines of various sorts all over the world. It uses Beaker and tests lots of different kernel configurations. It currently runs tests on three RHEL kernels and two Fedora kernels, though there are plans to add the mainline releases. Different teams focus on drivers specific to their area of interest, so the storage teams test various storage devices, while the RDMA team tests that type of hardware.

The main problem is that it takes a lot of effort to analyze the bugs found with the tests. Any crashes that happen could simply be posted to the kernel mailing list as regressions, as was suggested, but even that takes some amount of triage and requires reducing the code to a reproducible test case.

Others who might want to test the drivers may be stymied by the lack of availability (or the cost) of the hardware. It also requires a good understanding of exactly what the device is supposed to do. Ideally, the driver writers would be testing the devices—generally they do—but even that is not a complete solution. Driver writers try to make sure the driver works for their use cases, but they have differing motivations depending on whether they are being paid to write it or simply doing so to support hardware they have, often with little or no documentation.

There are also performance regressions that need to be found in new kernels. That is a difficult problem to solve since "random performance testing" does not really help. There is a need to put together some guidelines for performance tests, so that an apples-to-apples comparison can be done.

The biggest problem for all of these testing efforts is resources. More people and more machines are needed in order to find and fix the bugs sooner.

Stable

The conversation then turned toward the stable kernels. There is a need to stop bugs from entering the kernel; if the mainline is perfect, there would be no need for the stable trees. Perfection is not possible, of course, but do the distributions even use the stable trees?

It turns out that Red Hat only cherry-picks fixes from the stable trees. Each minor release of RHEL has 8-10 thousand patches on top of its kernel, all of which have come from upstream. The RHEL kernel team looks at the stable trees and the latest mainline kernel to find fixes that should be applied. The amount of testing done varies based on which subsystem the patch applies to; some subsystems have a good track record on providing working patches, others less so. Generally, Red Hat only builds the stable kernels to test them against the RHEL kernel to see if a bug is from upstream or was introduced in RHEL.

SUSE does build the stable kernels, but also cherry-picks patches for inclusion. Stable kernel testing could be added to its testing grid, but it is not clear what value that would have for the company. Ubuntu is similar; other than building the stable kernels, there is no formal testing of them.

So the distributions generally care that the fixes that go into stable are correct, but they are testing them in their own kernels. It was suggested that perhaps a collaborative project could be put together by the Linux Foundation, in cooperation with Canonical, SUSE, Red Hat, and others, to put together a set of machines with a test suite to do testing for the stable series.

Linaro is currently working on a project for Google to test the stable kernels using the kernel self-tests (kselftest) and tests from the Linux Test Project (LTP). Those tests are run for every stable release. The self-tests do find bugs, but those who are writing self-tests are probably not the ones introducing the majority of the bugs. The self-tests are just the starting point, however, Linaro intends to add more tests.

One of the difficulties is the huge number of configurations. When a stable kernel is released, it might have 100 patches, but any test suite may not exercise many, or even any, of those fixes. There is a real question of what it means to test a stable release.

The 0-Day kernel test service is also doing more than just build-and-boot tests, including performance testing. The kernelci.org project is doing build-and-boot testing on lots of different hardware, which is quite valuable, but it doesn't do any real functional testing. Things are certainly getting better, overall, and what is there now is "surprisingly better than nothing", one attendee joked.

The self-tests are typically written by kernel developers, but it takes time and effort to turn personal tests into something that can be used more widely. Drivers generally do not have self-tests, because the driver writer didn't have any time to add one. In many cases, the code is of low quality because of that, as well. So the existing self-tests are likely to be in subsystems that are already well tested, but they have found bugs on architectures that are different from what the developer normally runs. Various ARM bugs have been found that way.

LTP tests many things, but there is also plenty that it does not test. It is used by some distributions and has definitely found bugs, but there is a need for more (and better) test suites.

Benchmarks and fuzzing

Beyond that, there is also a need for more benchmarking to detect performance regressions. Mel Gorman's MMTests were mentioned as something that could be used as the basis of a "kbench" benchmarking suite. Some in the room seemed unfamiliar with that test suite, which helped point out the need for better documentation. A test suites file for the kernel documentation directory might be a start, but any benchmark is going to need in-depth documentation.

There was some thought that it would be nice to have a benchmark that boiled down to a single number that could be compared between systems (like the idea behind BogoMips). There was also a fair amount of skepticism about how possible that might turn out to be, however.

Fuzzing for stable kernels was also discussed. Fuzzing the upstream kernels is the best option, since fixes must be made there, but it can find problems in backports for distribution kernels. It turns out that the syzkaller fuzzer generates small test cases to reproduce the problems that it finds. It was agreed that those should be added to the self-tests. Some of the bugs only manifest themselves under the Kernel Address Sanitizer (KASAN), but those tests can simply be skipped as "unsupported" for kernels that are not configured for KASAN.

More and more self-tests are being added to the mainline, but the stable kernels don't benefit from those. Some are running the latest self-tests with older kernels, but there was some thought that perhaps the self-tests themselves should be backported into the stable trees.

As the BoF wound down, Levin asked that distribution maintainers push the patches they are using to the stable trees. It is not uncommon to find a fix in a distribution kernel that should be in stable as well. He has been working on training a neural net to recognize stable-eligible patches, which elicited some laughter, but he said it is actually "going surprisingly well".

One person who was not at the BoF, but has a vested interest in what was being discussed, is stable maintainer Greg Kroah-Hartman. He got a chance to offer some of his opinions in the microconference, which opened with a short session where Levin replayed what was discussed in the BoF.

As Levin said, there were a number of points raised in the BoF without much, if any, resolution of those problems. Someone spoke up to suggest that more hardware be given to the kernelci.org project, but Kroah-Hartman would also like to see more functional testing. It may make sense for the Linaro and kernelci.org efforts to join forces, though, someone said.

Kroah-Hartman has no objection to the idea of backporting self-tests as long as they will run on the kernel in question. He agreed that it would be nice for distributions to be diligent about getting their fixes to the stable tree, but noted that Fedora and Debian are already doing a good job in that area. Distributions often try to get a fix to their users quickly, then do the work to get it fixed upstream, another participant said. Kroah-Hartman said that he often will leave a bug in stable if it is not fixed in the upstream kernel to both be "bug compatible" and to provide some pressure for it to get fixed.

It is clear that more kernel testing could be done, but it is less clear exactly what form it should take or who will actually do it. With luck, some progress on that will be made in the near future, which is likely to lead to more bugs found sooner. Perfection is impossible, of course, but an overall reduction in kernel bugs is something we can all hope for.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Los Angeles for LPC.]

Index entries for this article
Conference	Linux Plumbers Conference/2017

Is there a single performance index that's useful for engineers?

Posted Sep 19, 2017 10:54 UTC (Tue) by k3ninho (subscriber, #50375) [Link] (5 responses)

> There was some thought that it would be nice to have a benchmark that boiled down to a single number that could be compared between systems (like the idea behind BogoMips). There was also a fair amount of skepticism about how possible that might turn out to be, however.

I'd like to know how this was pitched, and how much expertise the people stating it had in measuring and evaluating performance.

If 'little experience', then we can apply Dunning-Kruger, say that 'how hard can it be?' always gets 'much harder than you can imagine' and we can ignore it.

If it's something meaningful from people who want a number, I'd love to learn how you can put a single number on performance that doesn't need further qualification: imagine you have a suite of tests and aggregate the scoring, but have two systems which aggregate to the the same final tally, one with stunning performance in a sole category of work and terrible scores elsewhere versus one with middling scores across the board. Just like there's an apples-to-orange comparison hidden by that final tally, there's engineers who want different sorts of performance from their hardware -- one will want a high score on that particular sole category of work in her datacentre, while another will want good middle-tier performers.

Alternatively, consider the situation where you need to feed in fine-grained information to the scheduler about cache topology, time to refill caches and number of CPU clock cycles and instructions not completed when a cache is missed, plus what the balance of moving data to transforming data is for the work at hand. The model of a CPU's layout, and the board it's running in need to play a part in quantifying a system's performance, along with the information about the profile of the task of the test.

These those things suggest that a single index for performance will need to be broken down into components by anyone needing to use it for serious performance assessment; anyone wanting bragging rights will be happy with a single score.

K3n.

Is there a single performance index that's useful for engineers?

Posted Sep 19, 2017 19:56 UTC (Tue) by james (subscriber, #1325) [Link] (1 responses)

Given the context (detecting performance regressions) they might not actually be concerned about comparing different systems. They just want to know if the number goes down on any one system.

Is there a single performance index that's useful for engineers?

Posted Sep 21, 2017 9:14 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

>Given the context (detecting performance regressions) they might not actually be concerned about comparing different systems.
I'd read it as being a new paragraph which started a wholly new idea in a section about 'Benchmarks and Fuzzing', losing the context of the previous paragraph's words about preventing regressions and being followed by a sentence that could apply to either regression-prevention or to an abstract kernel performance number. Thanks for making this more clear.

K3n.

Is there a single performance index that's useful for engineers?

Posted Sep 20, 2017 3:08 UTC (Wed) by nevets (subscriber, #11875) [Link] (2 responses)

Being the one that actually suggested this, I'll explain the thought behind it.

This was not about seeing what change has the bigger stick, but more of a focus on regressions. For example, I use hackbench to see how my changes affect the system. hackbench is really a hack. I don't take it too seriously, but when I screw something up, it tends to show that I screwed something up quite quickly. Having some kind of benchmark that is not for true comparisons between competing features, but more to see if some code was added that caused a major performance regression.

The kernel selftests are there to test if your code breaks something in the kernel. But we have nothing to show that we caused a performance regression. Reading benchmark reports, as you state, takes skill. The idea is to have many people running two kernels on the exact same setup and compare the numbers to see if they changed drastically or not. If it only changed within an acceptable standard deviation, then there should be nothing to worry about. But if you see a large spike in a latency, or time to complete a micro-benchmark, then perhaps it should be reported and analyzed to look further if there is indeed a problem.

This is why I compared it to BogoMips. Those are truly bogus, but I did get better numbers when running on better hardware. And there were times that I screwed up something, and those BogoMips showed that there was a screw up somewhere.

I do agree with your worry. There will be those that take a single number and complain "Hey this change made this number go up", with no clue that it made a huge difference someplace else that counters it. The main point is to find issues. A complaint like this may be annoying, but can be blown off with an explanation to why it happened. But it would be nice to have something to look at, see that it changes drastically, and perhaps it pointed out that a change had an effect someplace you did not intend it would.

Is there a single performance index that's useful for engineers?

Posted Sep 21, 2017 8:10 UTC (Thu) by ColinIanKing (guest, #57499) [Link]

One tool that I use is stress-ng, it can exercise various components of the kernel and also has a bogo-ops throughput metric that can be helpful to detect performance regressions. See http://kernel.ubuntu.com/~cking/stress-ng/

Is there a single performance index that's useful for engineers?

Posted Sep 21, 2017 9:43 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

Thank you for that explanation.

K3n.