LWN.net Logo

Can user-space bugs be kernel regressions?

By Jake Edge
August 6, 2008

Adding new functionality to the kernel while maintaining the interfaces for user space is the standard kernel development practice. Sometimes, though, that can tickle bugs in user-space programs in unpleasant ways. When that happens, it is clearly a regression—something that worked before no longer does—but is it a kernel regression? In the end, it doesn't matter, it seems, because the kernel needs to change to keep the user-space program working, even at the expense of "ugliness".

Clearly for purely internal kernel functionality, there is no mandate for compatibility across kernel versions. But, when the user-space interface is involved, things get a bit trickier. A change that alters the way a documented interface works is essentially never done; user-space interfaces are maintained forever. When new functionality properly uses a documented interface, but breaks a user-space program, it gets murkier.

That situation came up recently when Andrew Morton noticed that the linux-next tree broke the X server on his laptop. The problem was quickly diagnosed as a problem in the Synaptics touchpad driver for X. An array that was being passed to an ioctl() was sized based on the number of bits, rather than bytes, it should contain. Thus the maximum buffer length passed was off by a factor of eight.

As a solution, Dmitry Torokhov offered up a patch, not to kernel code, but to the synaptics X driver. That didn't sit particularly well, with Morton and others, eventually leading to a pronouncement from Linus Torvalds:

If somebody has the commit that broke user space, that commit will be _reverted_ unless it's fixed. It's that simple. The rules are: we don't knowingly break user space.

Torokhov clearly felt that it was the driver, not his changes, that were at fault, which is entirely understandable because it's true. That doesn't alter the fact that new kernels would break existing, working configurations on laptops everywhere. The kernel change just fully used an existing, documented interface as Torokhov explained:

It is not like we broke ABI here. The program (synaptics driver) had a grave bug. Older kernels happened to paper over the bug because they did not fill the whole buffer that was advertised as available. Now that we have more data to report the bug bit us.

Declaring an array of 64 bytes, but telling the kernel it can store up to 511 bytes into it is obviously a bug. But, as Morton points out:

It really really doesn't matter what the causes are or which piece of code is at fault or anything else like that.

What _does_ matter is that people's stuff will break. Apparently lots of people's. That's a problem. A _practical_ problem. Can we pleeeeeeze be practical and find some way of preventing it?

Since the code was in linux-next, it was targeted at the 2.6.28 kernel. In Torokhov's thinking, this would allow something approaching six months for distributions to update the synaptics driver. But that is a fundamental misunderstanding of how and when kernels are upgraded—it is not only by way of distributions. Introducing a change like this would result in many messages to linux-kernel from unhappy folks with broken X servers.

Kernel hackers purposely build and run kernels on a wide variety of hardware and distributions. That includes older distributions that no longer get updates so they would be stuck with the buggy driver, thus non-working X server, essentially forever. Obviously, they could rebuild the synaptics driver—kernel hackers have been known to compile things other than kernels—but that isn't the point.

There are major benefits to also having lots of regular users update their kernels frequently. Trying to ensure that there won't be any unnecessary barriers to doing that can only help. Torvalds describes it this way:

And if we want to encourage people to upgrade their kernel very aggressively (and we absolutely do!), then that means that we have to also make sure it doesn't require them upgrading anything else.

Torvalds and Torokhov worked out a fix that preserved the old behavior for a specific passed-in buffer length, while allowing the new events to be delivered to any other users of the ioctl() that passed in the proper length. Torvalds commented: "Yeah, it's not pretty, but pragmatism before beauty."

It is, to some extent, a gray area. Regressions are bad for any number of reasons, but maintaining hackarounds for buggy user-space programs has its own set of problems. The hope is that eventually the need for the workaround goes away so that it can be removed. It would seem difficult to determine when the last user of the old synaptics driver finally upgrades, so this code could be with us for a long time. Given the alternative, the price seems worth it.

Though Torvalds was absolute in condemning any known regression, even for programs that are clearly misusing an interface, there must be a line somewhere. If some obscure program, with few users, gets broken by the kernel doing something documented and reasonable, it is hard to imagine that this kind of workaround will be required. This particular problem was relatively easy to decide, the next might not be.


(Log in to post comments)

Can user-space bugs be kernel regressions?

Posted Aug 6, 2008 22:08 UTC (Wed) by Felix_the_Mac (guest, #32242) [Link]


What about putting in a new config option that defaults to the old behaviour?

Thereby people building the new kernel on older systems would not hit the bug but
distributions could ship their next release with the fix enabled so that in time the old,
buggy, behaviour could be removed. 

Can user-space bugs be kernel regressions?

Posted Aug 6, 2008 22:59 UTC (Wed) by bboissin (subscriber, #29506) [Link]

Are exploits considered valid applications ? ;)

"Upgrading my kernel makes my local root exploit not work anymore"

Why not do both?

Posted Aug 7, 2008 2:40 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

If the driver has a bug, fix it.
Then patch the kernel.
A hybrid approach, after the Paris Hilton model.

Why not do both?

Posted Aug 7, 2008 9:35 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

There's a third step: mark the kernel patch as Deprecated and schedule its eventual removal
for a point when the driver should be reasonably well-available in the wild.

K3n.

Why not do both?

Posted Aug 7, 2008 13:00 UTC (Thu) by bangert (subscriber, #28342) [Link]

exactly. was the driver fixed?

Why not do both?

Posted Aug 7, 2008 13:54 UTC (Thu) by jcristau (subscriber, #41237) [Link]

Can user-space bugs be kernel regressions?

Posted Aug 7, 2008 16:26 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

Should it not be standard practice that if a user passes an N-byte buffer and the kernel has
less than N bytes to store in it, the rest should nevertheless be written (for example
zeroed)? This would avoid the possibility of latent bugs such as this one biting years later.
Instead, they'd probably bite while the original code was being debugged, and even if not, the
bug would clearly be seen to be a latent memory-corruption bug in userspace and not anything
that's the kernel's fault.

Can user-space bugs be kernel regressions?

Posted Aug 7, 2008 20:30 UTC (Thu) by davecb (subscriber, #1574) [Link]

That might be an interesting thing to propose to the LKML
as a janitorial project. Of course, it would need to
default off for the interfaces we find broken right now (;-))
but over time it would close this ABI hole...

--dave (ex-ABI team guy) c-b

Can user-space bugs be kernel regressions?

Posted Aug 12, 2008 15:26 UTC (Tue) by evgeny (guest, #774) [Link]

Do all other OS'es the driver is used for have the same level (8-fold) of permissiveness?? Or
is Linux the only OS "user" of Xorg today?

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds