On breaking things
The two events in question are these:
- An optimization applied to glibc changed the implementation of
memcpy(), breaking a number of
programs in the process. In particular, the proprietary Flash
plugin, which, contrary to the specification, uses memcpy()
to copy overlapping regions, is no longer able to play clear audio for
some kinds of media.
- A change in the default protections for /proc/kallsyms, merged for the 2.6.37 kernel, was found to cause certain older distributions to fail to boot. The root cause is apparently a bug in klogd, which does not properly handle a failure to open the symbol file.
In summary, we have two changes, both of which were intended to improve the behavior of the system - better performance, in the glibc case, and better security for /proc/kallsyms. In each case, the change caused other code which was buggy - but which had been working - to break. What came thereafter differed considerably, though.
In the glibc case, the problem has been experienced by users of Fedora 14, which is one of the first distributions to ship the new memcpy() implementation. Given that code using glibc has been rendered non-working by this change, one might reasonably wonder if the glibc developers have considered reverting it. As far as your editor can tell, though, nobody has even asked them; the developers of that project have built a reputation for a lack of sympathy in such situations. They would almost certainly answer that the bug is in the users of memcpy() who, for whatever reason, ignored the longstanding rule that the source and destination arrays cannot overlap. It is those users who should be fixed, not the C library.
The Fedora project, too, is in a position to revert the change. The idea was discussed at length on the fedora-devel mailing list, but the project has, so far, taken no such action. At this level, there is a clear tension between those who want to provide the best possible user experience (which includes a working Flash player) in the short term, and those who feel that allowing this kind of regression to hold back a performance improvement is bad for the best possible user experience in the longer term. According to the latter group, reverting the change would slow things down for working programs and relieve the pressure on Adobe to fix its bug. It is better, they say, for affected users to apply a workaround and complain to Adobe. That view appears to have carried the day.
In the /proc/kallsyms case, the change was reverted; an explicit choice was made to forgo a potential security improvement to avoid breaking older distributions. This decision has been somewhat controversial, both on the kernel mailing list and here on LWN. The affected distribution (Ubuntu 9.04) is relatively old; its remaining users are unlikely to put current kernels on it. So a number of voices were heard to say that, in this case, it is better to have the security improvement than compatibility with older distributions.
Linus was clear about his policy, though:
The kernel's record with regard to this rule is, needless to say, not perfect, but that record as a whole is quite good; that has served the kernel well. It is usually possible to run current kernels on very old distributions, allowing users to gain new hardware support and features, or simply to help with testing. It forms a sort of contract with the kernel's users which gives them some assurance that new releases will not cause their systems to break. And, importantly, it helps the kernel developers to keep overall kernel quality high; if you do not allow once-working things to break, you can be at least somewhat sure that the quality of the kernel is not declining over time. Once you start allowing some cases to break, you can never be sure.
There is probably little chance of a kernel-style "no regressions" rule being universally adopted. Even in current kernels, the interface to the rest of the system is relatively narrow; the system as a whole has a much larger range of things that can break. It is a challenge to keep new kernel releases from causing problems with existing applications; for a full distribution, it's perhaps an insurmountable challenge. That is part of why companies pay a lot of money for distributions which almost never make new releases.
Some kinds of regressions are also seen as being tolerable, if not actively desirable. There has never been any real sympathy for broken proprietary graphics drivers, for example. The proprietary nature of the Flash plugin will not have helped in this case either; it is irritating to know exactly how to fix a problem, but to be unable to actually apply that fix. Any free program affected by this bug would, if anybody cared about it at all, have been fixed long ago. Flash users, meanwhile, are still waiting for Adobe to change a memcpy() call to memmove(). One could certainly argue that holding Adobe responsible for its bug - and, at the same time, demonstrating the problems that come with proprietary programs - is the right thing to do.
On the other hand, one could argue that breaking Flash is a good way to
demonstrate to users that they should be using a different distribution -
or another operating system entirely. Your editor would suggest that
perfection with regard to regressions is not achievable, but it still
behooves us to try for it when we can. There is a lot to be said for
creating a sense of confidence that software updates are a safe thing to
apply. It will make it easier to run newer, better software, inspire users to
test new code, and, maybe, even bring some vendors closer to upstream. We
should make a point of keeping things from breaking, even when the bugs are
not our fault.
