By Jake Edge
August 6, 2008
Adding new functionality to the kernel while maintaining the interfaces for
user space is the standard kernel development practice. Sometimes, though,
that can tickle bugs in user-space programs in unpleasant ways. When that
happens, it is clearly a regression—something that worked before no
longer does—but is it a kernel regression? In the end, it doesn't
matter, it seems, because the kernel needs to change to keep the user-space
program working, even at the expense of "ugliness".
Clearly for
purely internal kernel functionality, there is no
mandate for compatibility across kernel versions. But, when the user-space
interface is involved, things get a bit trickier. A change that
alters the way a documented interface works is essentially never done;
user-space interfaces are maintained forever.
When new functionality properly uses a documented interface, but breaks a
user-space program, it gets
murkier.
That situation came up recently when Andrew Morton noticed that the linux-next tree broke the X
server on his laptop. The problem was quickly diagnosed as a problem in
the Synaptics touchpad driver for X. An array that was being passed to an
ioctl() was sized based on the number of bits, rather than bytes, it
should contain. Thus the maximum buffer length passed was off by a factor
of eight.
As a solution, Dmitry
Torokhov offered up a patch, not to kernel
code, but to the synaptics X driver. That didn't sit
particularly well, with Morton and others, eventually leading to a pronouncement from Linus Torvalds:
If somebody has the commit that broke user space, that commit will be
_reverted_ unless it's fixed. It's that simple. The rules are: we don't
knowingly break user space.
Torokhov clearly felt that it was the driver, not his changes, that were at
fault, which is entirely understandable because it's true. That doesn't
alter the fact that new kernels would break existing, working
configurations on laptops everywhere. The kernel change just fully used an
existing, documented interface as Torokhov explained:
It is not like we broke ABI here. The program (synaptics driver) had a
grave bug. Older kernels happened to paper over the bug because they
did not fill the whole buffer that was advertised as available. Now
that we have more data to report the bug bit us.
Declaring an array of 64 bytes, but telling the kernel it can store up to
511 bytes into it is obviously a bug.
But, as Morton points out:
It really really doesn't matter what the causes are or which piece of
code is at fault or anything else like that.
What _does_ matter is that people's stuff will break. Apparently lots
of people's. That's a problem. A _practical_ problem. Can we
pleeeeeeze be practical and find some way of preventing it?
Since the code was in linux-next, it was targeted at the 2.6.28 kernel.
In Torokhov's thinking, this would allow something approaching six months
for distributions to update the synaptics driver. But that is a fundamental
misunderstanding of how and when kernels are upgraded—it is not only
by way of distributions. Introducing a change like this would result in
many messages to linux-kernel from unhappy folks with broken X servers.
Kernel hackers purposely build and run kernels on a wide variety of
hardware and distributions. That includes older distributions that no
longer get updates so they would be stuck with the buggy driver, thus
non-working X server, essentially
forever. Obviously, they could rebuild the synaptics driver—kernel
hackers have been known to compile things other than kernels—but that
isn't the point.
There are major benefits to also having lots of regular users update their
kernels
frequently. Trying to ensure that there won't be any unnecessary barriers
to doing that can only help. Torvalds describes it this way:
And if we want to encourage people to upgrade their kernel very
aggressively (and we absolutely do!), then that means that we have to also
make sure it doesn't require them upgrading anything else.
Torvalds and Torokhov worked out a fix that preserved the old behavior for
a specific passed-in buffer length, while allowing the new events to be
delivered to any other users of the ioctl() that passed in the
proper length. Torvalds commented:
"Yeah, it's not pretty, but pragmatism before beauty."
It is, to some extent, a gray area. Regressions are bad for any number of
reasons, but maintaining hackarounds for buggy user-space programs has its own
set of problems. The hope is that eventually the need for the workaround
goes away so that it can be removed. It would seem difficult to determine
when the last user of the old synaptics driver finally upgrades, so this
code could be with us for a long time. Given the alternative, the
price seems worth it.
Though Torvalds was absolute in condemning any known regression,
even for programs that are clearly misusing an interface, there must be a
line somewhere. If some obscure program, with few users, gets broken by
the kernel doing something documented and reasonable, it is hard to imagine
that this kind of workaround will be required. This particular problem was
relatively easy to decide, the next might not be.
(
Log in to post comments)