Our systems run a complex mix of software which is the product of many
different development projects. It is inevitable that, occasionally, a
change to one part of the system will cause things to break elsewhere, at
least for some users. How we respond to these incidents has a significant
effect on the perceived quality of the platform as a whole and on its
usability. Two recent events demonstrate two different responses - but
not, necessarily, a clear correct path.
The two events in question are these:
- An optimization applied to glibc changed the implementation of
memcpy(), breaking a number of
programs in the process. In particular, the proprietary Flash
plugin, which, contrary to the specification, uses memcpy()
to copy overlapping regions, is no longer able to play clear audio for
some kinds of media.
- A change in the default protections for
/proc/kallsyms, merged for the 2.6.37 kernel, was found to
cause certain older distributions to fail to boot. The root cause is
apparently a bug in klogd, which does
not properly handle a failure to open the symbol file.
In summary, we have two changes, both of which were intended to improve the
behavior of the system - better performance, in the glibc case, and better
security for /proc/kallsyms. In each case, the change caused
which was buggy - but which had been working - to break. What came
thereafter differed considerably, though.
In the glibc case, the problem has been experienced by users of
Fedora 14, which is one of the first distributions to ship the
new memcpy() implementation. Given that code using glibc has been
rendered non-working by this change, one might reasonably wonder if the
glibc developers have considered reverting it. As far as your editor can
tell, though, nobody has even asked them; the developers of that project
have built a reputation for a lack of sympathy in such situations. They
would almost certainly answer that the bug is in the users of
memcpy() who, for whatever reason, ignored the longstanding rule
that the source and destination arrays cannot overlap. It is those users
who should be fixed, not the C library.
The Fedora project, too, is in a position to revert the change. The idea
was discussed at length on the fedora-devel mailing list, but the project
has, so far, taken no such action. At this level, there is a clear tension
between those who want to provide the best possible user experience (which
includes a working Flash player) in the short term, and those who feel that
allowing this kind of regression to hold back a performance improvement is
bad for the best possible user experience in the longer term. According to
the latter group, reverting the change would slow things down for working
programs and relieve the pressure on Adobe to fix its bug. It is better, they
say, for affected users to apply a workaround and complain to Adobe. That
view appears to have carried the day.
In the /proc/kallsyms case, the change was reverted; an explicit
choice was made to forgo a potential security improvement to avoid
breaking older distributions. This decision has been somewhat
controversial, both on the kernel mailing list and here on LWN. The affected distribution
(Ubuntu 9.04) is relatively old; its remaining users are unlikely to put
current kernels on it. So a number of voices were heard to say that, in
this case, it is better to have the security improvement than compatibility
with older distributions.
Linus was clear about his policy, though:
The rule is not "we don't break non-buggy user space" or "we don't
break reasonable user-space". The rule is simply "we don't break
user-space". Even if the breakage is totally incidental, that
doesn't help the _user_. It's still breakage.
The kernel's record with regard to this rule is, needless to say, not
perfect, but that record as a whole is quite good; that has served the
kernel well. It is usually possible
to run current kernels on very old distributions, allowing users to gain
new hardware support and features, or simply to help with testing. It
forms a sort of contract with the kernel's users which gives them some
assurance that new releases will not cause their systems to break. And,
importantly, it helps the kernel developers to keep overall kernel quality
high; if you do not allow once-working things to break, you can be at least
somewhat sure that the quality of the kernel is not declining over time.
Once you start allowing some cases to break, you can never be sure.
There is probably little chance of a kernel-style "no regressions" rule
being universally adopted. Even in current kernels, the interface to the
rest of the system is relatively narrow; the system as a whole has a much
larger range of things that can break. It is a challenge to keep new
kernel releases from causing problems with existing applications; for a
full distribution, it's perhaps an insurmountable challenge. That is part
of why companies pay a lot of money for distributions which almost never
make new releases.
Some kinds of regressions are also seen as being tolerable, if not actively
desirable. There has never been any real sympathy for broken proprietary
graphics drivers, for example. The proprietary nature of the Flash plugin
will not have helped in this case either; it is irritating to know exactly
how to fix a problem, but to be unable to actually apply that fix. Any
free program affected by this bug would, if anybody cared about it at all,
have been fixed long ago. Flash users, meanwhile, are still waiting for
Adobe to change a memcpy() call to memmove(). One could
certainly argue that holding Adobe responsible for its bug - and, at the
same time, demonstrating the problems that come with proprietary programs -
is the right thing to do.
On the other hand, one could argue that breaking Flash is a good way to
demonstrate to users that they should be using a different distribution -
or another operating system entirely. Your editor would suggest that
perfection with regard to regressions is not achievable, but it still
behooves us to try for it when we can. There is a lot to be said for
creating a sense of confidence that software updates are a safe thing to
apply. It will make it easier to run newer, better software, inspire users to
test new code, and, maybe, even bring some vendors closer to upstream. We
should make a point of keeping things from breaking, even when the bugs are
not our fault.
to post comments)