Kernel testing and regressions: an example

[Posted July 26, 2005 by corbet]

Kernel testing, or the lack thereof, is considered to be a significant part of the kernel quality problem. Recent kernels, while quite good in many regards, contain more bugs than they should because people have not gotten around to testing them before the final release. Many regressions are in device drivers, which present special testing problems: drivers can only be tested by people who have the relevant hardware. Core kernel code, however, is hardware independent and should be easier to test. But bugs can slip through in that code as well.

Consider, for example, the realtime rlimits feature, which can be used to enable otherwise unprivileged users to run processes with elevated priority. Andreas Steinmetz recently noticed that this feature does not work in the 2.6.13-rc3 kernel. This would seem to be just the sort of feedback the process needs: a user, testing a feature in a -rc kernel, found a bug and provided a patch to fix it. As a result, that particular bug will not be present in 2.6.13.

The only problem is that, as confirmed by Ingo Molnar, the bug is a little older than that. In fact, the realtime resource limit feature does not work at all in the stable 2.6.12 kernel, and nobody noticed until now. This is a feature which can be tested by just about anybody, but that work clearly had not been done. Given that nobody appears to be using this feature, Ingo is not confident that the fix can go into a 2.6.12 stable release; this one will have to wait for 2.6.13.

It should be said that testing realtime resource limits is not an entirely straightforward operation; setting that limit requires changes to the PAM library, C library, and the shells as well. Very few distributions - and no major ones - are shipping those changes at this time. Even so, unprivileged realtime scheduling is a feature that a number of people had been asking for. It is a little surprising that none of those people noticed that it failed to work in a major kernel release. Getting comprehensive testing coverage for the kernel is clearly still a problem - even before drivers are taken into account.

Index entries for this article
Kernel	Development model/Regressions
Kernel	Regression testing

Kernel testing and regressions: an example

Posted Jul 28, 2005 23:11 UTC (Thu) by rlrevell (guest, #23596) [Link] (2 responses)

This isn't the best example; the only reason this did not get noticed is that everyone who needs this feature is still using the rejected-from-mainline realtime LSM while waiting for the distros to catch up.

Kernel testing and regressions: an example

Posted Jul 28, 2005 23:54 UTC (Thu) by corbet (editor, #1) [Link]

I have to say that's kind of the point: by the time you get it from your distributor, it's a bit late to be testing it. Part of the process of making the kernel (or any other project) better is to test things before they get set into a stable release, and that is especially true for new features.

Kernel testing and regressions: an example

Posted Aug 3, 2005 14:51 UTC (Wed) by vlima (guest, #4405) [Link]

Updated pam and glibc that know about RLIMIT_RTPRIO and RLIMIT_NICE are avaiable in Fedora's rawhide. (And a hacked pam for Fedora Core 4 from Planet CCRMA.)

I still don't get Ingo Molnar's statement that
"... RLIMIT_RTPRIO is completely non-functional in 2.6.12" thought.

If it was completely non-functional should this work?

$ chrt -r 20 bash
$ ps -eo rtprio,comm | grep bash
     - bash
    20 bash

Kernel testing and regressions: an example

Posted Jul 29, 2005 1:06 UTC (Fri) by eaversa (guest, #4929) [Link] (1 responses)

gee, no wonder my application didn't benefit. i went to all that trouble (i downloaded the enhanced PAM, i got special programs to set the RT, i even got jonathan corbet to tell me whether i was doing it right) and i STILL saw latency. and i thought it was my application. so maybe it wasn't TCP after all that caused the delay...

Kernel testing and regressions: an example

Posted Jul 31, 2005 0:36 UTC (Sun) by rlrevell (guest, #23596) [Link]

Well it's simple to test. If the latency goes away when you run your app as root, it's a problem with the RT rlimits. If you still see the latency running as root, the problem is somewhere else.

Kernel testing and regressions: an example

Posted Jul 29, 2005 10:57 UTC (Fri) by dhj (guest, #4655) [Link]

We did actually try rlimits for 2.6.12 while preparing a kernel package
for the 64 Studio distribution. We couldn't get it to work, so we went
with realtime-lsm instead. We didn't consider the possibility that the
rlimits code was broken; we just assumed it was our fault that it didn't
work. I guess next time we should be more vocal...

Kernel testing and regressions: an example

Posted Aug 2, 2005 15:05 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (1 responses)

In any application the developers are expected to test code before they check it in. Because otherwise, why have a restriction of who can check in - you're not even sure it will compile. I confess to having checked in blind when it was a one-line fix (e.g dig back far enough in GIMP or GTK+), but for whole features?

But the Linux kernel doesn't work this way. For another example, look at my Source Specific Multicast bugs. The kernel had supposedly supported this feature for a while, in both IPv4 and IPv6 (they had the same bug, the incorrect logic was copied, still untested, when the IPv6 code was added) but when I tried to use it I found that it was broken, and worse, that it was broken in a way which left an easy kernel denial of service attack.

The person who wrote that code can't have tested it (any test code they tried would fail, mine certainly did). They may not even have tried to compile it. But it looked superficially OK, it didn't offend Linus and so it went straight into the kernel. In any other project that would be a serious procedural failure and heads would roll, in the Linux kernel it's business as usual. That's got to stop, and if Linus won't stop it, maybe the vendors, through people like davej will have to.

Kernel testing and regressions: an example

Posted Aug 4, 2005 18:07 UTC (Thu) by lenz (guest, #31538) [Link]

It seems to me there is a simple solution:
Before a stable kernel is release, someone needs to verify that the patch performs its intended function. The kernel isn't released until each of the patch tests has been signed off. If no one care to test it, the patch is pulled out before release. This would motivate those that care about a particular patch to test it.

This doesn't guarantee there won't be regressions but it does let you know that someone is at least looking at the specific changes.