LWN.net Logo

On breaking things

By Jonathan Corbet
November 24, 2010
Our systems run a complex mix of software which is the product of many different development projects. It is inevitable that, occasionally, a change to one part of the system will cause things to break elsewhere, at least for some users. How we respond to these incidents has a significant effect on the perceived quality of the platform as a whole and on its usability. Two recent events demonstrate two different responses - but not, necessarily, a clear correct path.

The two events in question are these:

  • An optimization applied to glibc changed the implementation of memcpy(), breaking a number of programs in the process. In particular, the proprietary Flash plugin, which, contrary to the specification, uses memcpy() to copy overlapping regions, is no longer able to play clear audio for some kinds of media.

  • A change in the default protections for /proc/kallsyms, merged for the 2.6.37 kernel, was found to cause certain older distributions to fail to boot. The root cause is apparently a bug in klogd, which does not properly handle a failure to open the symbol file.

In summary, we have two changes, both of which were intended to improve the behavior of the system - better performance, in the glibc case, and better security for /proc/kallsyms. In each case, the change caused other code which was buggy - but which had been working - to break. What came thereafter differed considerably, though.

In the glibc case, the problem has been experienced by users of Fedora 14, which is one of the first distributions to ship the new memcpy() implementation. Given that code using glibc has been rendered non-working by this change, one might reasonably wonder if the glibc developers have considered reverting it. As far as your editor can tell, though, nobody has even asked them; the developers of that project have built a reputation for a lack of sympathy in such situations. They would almost certainly answer that the bug is in the users of memcpy() who, for whatever reason, ignored the longstanding rule that the source and destination arrays cannot overlap. It is those users who should be fixed, not the C library.

The Fedora project, too, is in a position to revert the change. The idea was discussed at length on the fedora-devel mailing list, but the project has, so far, taken no such action. At this level, there is a clear tension between those who want to provide the best possible user experience (which includes a working Flash player) in the short term, and those who feel that allowing this kind of regression to hold back a performance improvement is bad for the best possible user experience in the longer term. According to the latter group, reverting the change would slow things down for working programs and relieve the pressure on Adobe to fix its bug. It is better, they say, for affected users to apply a workaround and complain to Adobe. That view appears to have carried the day.

In the /proc/kallsyms case, the change was reverted; an explicit choice was made to forgo a potential security improvement to avoid breaking older distributions. This decision has been somewhat controversial, both on the kernel mailing list and here on LWN. The affected distribution (Ubuntu 9.04) is relatively old; its remaining users are unlikely to put current kernels on it. So a number of voices were heard to say that, in this case, it is better to have the security improvement than compatibility with older distributions.

Linus was clear about his policy, though:

The rule is not "we don't break non-buggy user space" or "we don't break reasonable user-space". The rule is simply "we don't break user-space". Even if the breakage is totally incidental, that doesn't help the _user_. It's still breakage.

The kernel's record with regard to this rule is, needless to say, not perfect, but that record as a whole is quite good; that has served the kernel well. It is usually possible to run current kernels on very old distributions, allowing users to gain new hardware support and features, or simply to help with testing. It forms a sort of contract with the kernel's users which gives them some assurance that new releases will not cause their systems to break. And, importantly, it helps the kernel developers to keep overall kernel quality high; if you do not allow once-working things to break, you can be at least somewhat sure that the quality of the kernel is not declining over time. Once you start allowing some cases to break, you can never be sure.

There is probably little chance of a kernel-style "no regressions" rule being universally adopted. Even in current kernels, the interface to the rest of the system is relatively narrow; the system as a whole has a much larger range of things that can break. It is a challenge to keep new kernel releases from causing problems with existing applications; for a full distribution, it's perhaps an insurmountable challenge. That is part of why companies pay a lot of money for distributions which almost never make new releases.

Some kinds of regressions are also seen as being tolerable, if not actively desirable. There has never been any real sympathy for broken proprietary graphics drivers, for example. The proprietary nature of the Flash plugin will not have helped in this case either; it is irritating to know exactly how to fix a problem, but to be unable to actually apply that fix. Any free program affected by this bug would, if anybody cared about it at all, have been fixed long ago. Flash users, meanwhile, are still waiting for Adobe to change a memcpy() call to memmove(). One could certainly argue that holding Adobe responsible for its bug - and, at the same time, demonstrating the problems that come with proprietary programs - is the right thing to do.

On the other hand, one could argue that breaking Flash is a good way to demonstrate to users that they should be using a different distribution - or another operating system entirely. Your editor would suggest that perfection with regard to regressions is not achievable, but it still behooves us to try for it when we can. There is a lot to be said for creating a sense of confidence that software updates are a safe thing to apply. It will make it easier to run newer, better software, inspire users to test new code, and, maybe, even bring some vendors closer to upstream. We should make a point of keeping things from breaking, even when the bugs are not our fault.


(Log in to post comments)

On breaking things

Posted Nov 24, 2010 22:28 UTC (Wed) by rwmj (subscriber, #5474) [Link]

Indirectly related:
http://lists.fedoraproject.org/pipermail/devel/2010-Novem...

Yes, people really do need to read the kernel, and probably the kernel symbols, as non-root, and it's likely a lot more stuff will break if you make these changes. It's security-through-obscurity anyway because distro kernels are public knowledge. Why not implement a kind of ASLR in the kernel instead?

On breaking things

Posted Nov 25, 2010 2:41 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

"Why not implement a kind of ASLR in the kernel instead?"

I expect that is exactly the sort of thing the removal of non-root kallsyms access was for in the first place. The file reports the address of every kernel symbol, after all, which would render ASLR largely ineffective to anything capable of reading it.

On another note, there is more to these cases than simply the way the breakage is being handled. The glibc change only broke programs which used the API contrary to its explicit interface requirement--every C programmer should know that you never use memcpy() when the buffers may overlap, as it's documented clearly in the C99 standard (http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf page 337), POSIX (http://www.opengroup.org/onlinepubs/000095399/functions/m...) and SUSv2 (http://opengroup.org/onlinepubs/007908799/xsh/memcpy.html). The kernel's /proc/kallsyms interface, on the other hand, was never documented as anything other than readable by everyone by default.

Also, Linus' comment to the contrary notwithstanding, it's not like there have never been user-visible API changes in the Linux kernel; for example, the ipchains interface no longer exists, and more recently a non-zero default was set for the /proc/sys/vm/mmap_min_addr parameter, which did in fact break some existing user-level software.

On breaking things

Posted Nov 25, 2010 7:33 UTC (Thu) by error27 (subscriber, #8346) [Link]

"The glibc change only broke programs which used the API contrary to its explicit interface requirement--every C programmer should know that you never use memcpy()"

The memcpy() change exposed a similar bug in glibc. So the Flash people maybe aren't idiots after when compared to the glibc people.

Why use such misleading names for functions?

Posted Nov 25, 2010 9:40 UTC (Thu) by rvfh (subscriber, #31018) [Link]

I think having memcpy() and memmove() is a mistake.
* memcpy() should do the right thing, which is copy memory, safely.
* memmove() should not even exist, especially not which such a misleading name.

If we want an optimised memcpy(), I think we should rather create a new memcpy_nooverlap() (feel free to find a better name) function which clearly states its nature.

Last but not least, if memcpy() had always _really_ been unsafe, we would not have the problem today,as Adobe would have spotted and corrected the bug during development.

Why use such misleading names for functions?

Posted Nov 25, 2010 10:18 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

Any naive implementation of ISO C memcpy() is almost certain to have been unsafe in one direction. Either it starts at the end and works backward, making it work reliably for "dest > src, src + len > dest", or it starts at the beginning and works forward, making it work reliably for "dest < src, dest + len > src". (Of course, some bunch of demented jerks implementing a Deathmaster 9000 probably made it start at the middle and oscillate outwards.)

As for your proposed grand rename: You're quite right. Feel free to try and persuade the ISO C standard committee of your correctness.

Why use such misleading names for functions?

Posted Nov 25, 2010 10:52 UTC (Thu) by xav (subscriber, #18536) [Link]

> Last but not least, if memcpy() had always _really_ been unsafe, we would not have the problem today,as Adobe would have spotted and corrected the bug during development.

Just one Valgrind run found the problem. So the big lesson here is that Adobe doesn't even run Valgrind or equivalent (e.g. Purify) on their code. They wouldn't have corrected the bug during development, even if memcpy() had done a printf() to warn them.

No wonder their software is full of security holes.

Why use such misleading names for functions?

Posted Nov 25, 2010 18:42 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

> memcpy_nooverlap() (feel free to find a better name)

We've already got one: memmove()...

(And, no, it's NOT really a "misleading" name... As I said in the old thread,
when the regions overlap, indeed the source region will no longer contain the
data it once did prior to the memmove(), since it's now been overwritten, at
least in part... So, the data truly was MOVED from source to destination, not
merely COPIED...)

Why use such misleading names for functions?

Posted Nov 25, 2010 18:45 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

D'oh! Of course, I misread, and of course memmove() is the name for memcpy_overlap(),
and your hypothetical memcpy_nooverlap() is just memcpy()... But, it has
literally ALWAYS been this way, and any C programmer worth half a damn knows
this is how memcpy() and memmove() works... It's widely documented in various
standards, and misusing memcpy() in place of memmove() will cause you lots of
trouble on various other libcs... Just because glibc has been forgiving of the
blatent misuse prior to now is no reason to continue to tolerate it...

Why use such misleading names for functions?

Posted Nov 25, 2010 21:36 UTC (Thu) by rvfh (subscriber, #31018) [Link]

> And, no, it's NOT really a "misleading" name...
> (...)
> the data truly was MOVED from source to destination (...)

You're correct, now that you say that, I understand why the name!

On breaking things

Posted Nov 24, 2010 22:59 UTC (Wed) by nirik (subscriber, #71) [Link]

re: the flash issue a few things to note:

* This only affects the 64bit flash plugin on some cpus.
* Adobe hasn't actually officially released a 64bit version, they are all beta or preview, AFAICT.

So, the simple work around is to use the 32bit actual released flash plugin with nspluginwrapper for now.

See http://fedoraproject.org/wiki/Flash for more info on how to do so.

On breaking things

Posted Nov 25, 2010 23:40 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

it's also important to note that flash is not the only application that has this problem. As noted elsewhere in the comments, even glibc had bugs around this

On breaking things

Posted Nov 24, 2010 23:01 UTC (Wed) by jengelh (subscriber, #33263) [Link]

>On the other hand, one could argue that breaking Flash is a good way to demonstrate to users that they should be using a different distribution - or another operating system entirely.

Ha, those users came to Linux hoping to run proprietary stuff? They be taught better!

On breaking things

Posted Nov 25, 2010 15:10 UTC (Thu) by mrshiny (subscriber, #4266) [Link]

Damn straight. And they better not upgrade glibc without also upgrading every other package on their system too, since those older packages may have been developed against an older glibc; even a recompile doesn't fix this problem, you need to change every affected program, and update those to the latest, non-buggy version.

On breaking things

Posted Nov 25, 2010 19:10 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Only for statically linked programs. Thankfully most proprietary programs' vendors have got the idea that dynamic linking against libc is a good idea.

On breaking things

Posted Nov 25, 2010 21:47 UTC (Thu) by mrshiny (subscriber, #4266) [Link]

No, only statically linked programs are NOT affected by the change in Glibc.

If you wrote a program which accidentally did a memcpy when it should have been memmove, and (due to implementation details in memcpy) this worked just fine for years, you might not find the bug. A user who installed your application, which worked fine for years, and never upgraded the app but upgraded glibc, will suddenly find that your app is broken. Maybe you found and fixed the bug in a later version of the app. Maybe not. The point is that the user, who knows nothing about memcpy, now has a broken app. And the app might only break under certain conditions. And it might result in any kind of error: crash, corrupted data, silent corruption of data, audio or video glitches, who knows. These users are being punished so that someone else potentially might have a slightly faster memcpy.

On breaking things

Posted Nov 25, 2010 21:52 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Then I don't understand your comment why you'd have to update all programs. Just the ones that use memcpy incorrectly.

>user, who knows nothing about memcpy, now has a broken app.

But false attribution of fault is nothing new. When a program/driver did a stupid thing in Windows 9x and lead to a bluescreen, few would consider it to be a program/driver issue, and instead blamed Windows.

On breaking things

Posted Nov 26, 2010 4:27 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

Nobody is saying the apps aren't at fault.

The problem is that the glibc changed the situation from a hypothetical bug to an actual bug.

And due to the nature of the bug it's impossible for the user to diagnose it.

And because this change isn't hidden from older binaries by version symbols, upgrading the library breaks the apps and the user may have no way of getting a fixed app.

My point is that users are being held hostage so that the glibc maintainers can say "meh, those stupid programmers at <wherever> should read the C99 standard". Thanks, that doesn't help me with my problem at all.

The windows developers have many features in place to provide backwards compatibility for broken apps. Yes, they need it more because source isn't available for most windows apps, but still.

At least in Linux I can roll my own LD_PRELOAD hack to fix this. Except, it's a pain in the butt to use, and I only know of one app that needs it right now (flash). Maybe there is another one, somewhere on my system, which is misbehaving in a way that will cause me to lose important data in a few days. I have no way of knowing.

Also this is the 2nd time in recent years that a change to memcpy broke apps on my system. Maybe the glibc people should change glibc so that it subtly breaks ALL apps that violate the C standard? So that instead of hundreds of hypothetical bugs we'd have hundreds of real bugs, happily munching the data on your hard disk? It's within their rights, I suppose.

On breaking things

Posted Nov 26, 2010 9:57 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

I believe that Ulrich Drepper's position is roughly that if a change to glibc's internal implementation of aspects of an ISO C function's behaviour that compliant ISO C programs are explicitly forbidden to rely on (e.g. whether memcpy() copies forward, backward, or oscillating outward from the middle; what isalpha() etc. do if you pass them OOB values) breaks an application, it's the application developer's fault and officially Not His Problem and if you try to make it his problem he will tell you exactly where to get off. Especially if it's a closed source application.

On breaking things

Posted Nov 26, 2010 13:40 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

I think you are correct about his position. I just feel that it's not the right position for a library maintainer to take, especially the single most important library in the whole system.

The thing that bugs me is that there is a way to implement this change such that all newly-compiled apps get the improvement while older apps get the older behaviour. Sure, for Fedora that means that every single app which might have this bug is now vulnerable, but anything else will be fine. Lots of people have apps that they can't easily change. Many of those apps are even Free Software. Those users cannot reliably upgrade glibc, it seems.

On breaking things

Posted Nov 26, 2010 19:09 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I just feel that it's not the right position for a library maintainer to take, especially the single most important library in the whole system.

But let's not attribute more responsibility to Drepper than he really has. One of the reasons a distributor of free software has the privilege of defining what is Not His Problem is that anyone who disagrees is free to do better. The article makes this point in noting that it is Fedora, not the Glibc project, that is distributing a problematic library, and Fedora is accepting that responsibility and discussing whether to distribute the old or new memcpy behavior.

And, according to the article, the people arguing in favor of distributing the new memcpy behavior aren't doing so based on principle, like Drepper, but based on the belief that giving better performance to a wide range of users over the long term is better than making Flash work for some users in the short term.

On breaking things

Posted Nov 26, 2010 19:27 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

it's not just flash that was broken.

there's also the problem that the breakage can easily go unnoticed, and can corrupt the users data.

On breaking things

Posted Nov 26, 2010 19:48 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

If they used symbol versioning (or whatever it's called) they could have working Flash AND better performance in the long run.

On breaking things

Posted Nov 26, 2010 20:06 UTC (Fri) by oak (guest, #2786) [Link]

> My point is that users are being held hostage so that the glibc maintainers can say "meh, those stupid programmers at <wherever> should read the C99 standard".

This was specified already in ANSI-C in 80's, i.e. last century.

Memory debugging tools like Valgrind, duma etc. have been giving warnings about memcpy() calls with overlapping addresses at least for a decade.

If 10-20 years isn't enough for e.g. Adobe to test with freely available (or commercial) tools that their software is robust, portable and correctly implemented, I don't have very high hopes of it ever being what I (and apparently Steve Jobs) call "product quality" SW.

On breaking things

Posted Nov 26, 2010 20:22 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

Valgrind will only report the problem if the particular run of the program happens to produce overlapping regions.

If the work is something like defragmenting memory by moving things around, it's very possible for one particular run to not have overlaps, but another run to have overlaps. Any time you have the pointers calculated in the program, you have a case where they may or may not overlap on a particular run.

and while you are chastising Adobe for not having tested with Valgrind, make sure you chastise everyone else (including glibc maintainers) for the same thing.

this is unfortunantly a very easy mistake to make, and unless a change like this is made, it's unlikely to some to light.

On breaking things

Posted Nov 26, 2010 20:46 UTC (Fri) by jengelh (subscriber, #33263) [Link]

Well if this many users run into the problem that have, surely the chance for Adobe employess would be reasonably high to encounter it too at least once.

On breaking things

Posted Nov 26, 2010 21:01 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

Naturally, now that glibc has changed this bug from hypothetical to actual, the Adobe maintainers will have no problem at all reproducing it. But prior to this change nobody experienced the bug.

On breaking things

Posted Nov 26, 2010 21:43 UTC (Fri) by jengelh (subscriber, #33263) [Link]

Which would hint towards Adobe not having run Valgrind. Because, seriously, Youtube is not exactly new nor a small unimportant site.

On breaking things

Posted Nov 26, 2010 22:16 UTC (Fri) by oak (guest, #2786) [Link]

> If the work is something like defragmenting memory by moving things around, it's very possible for one particular run to not have overlaps, but another run to have overlaps.

One should of course understand that tools can give only positive proof of bugs existence, not proof of them not existing. Things like running Valgrind (and static checkers) should be part of the development process, so that over SW life time & changes one gets better coverage. It's not some one-off, instantly forgotten thing.

> and while you are chastising Adobe for not having tested with Valgrind, make sure you chastise everyone else (including glibc maintainers) for the same thing.

Sure.

Btw. If calls to a function are within the same library where the function is implemented, function wrappers don't catch that unless it also goes through .plt. And as to Glibc FORTIFY utility, I'm not sure whether Glibc enables that for itself...?

With Duma I've seen also another "issue" in Glibc, memmove() calling memcpy() because it has inherent knowledge about in which direction memcpy() works. Because Duma doesn't have that info, it complains (Duma has now a variable for this)...

On breaking things

Posted Nov 26, 2010 11:44 UTC (Fri) by cortana (subscriber, #24596) [Link]

> No, only statically linked programs are NOT affected by the change in Glibc.

Even ones that make use of NSS modules?

On breaking things

Posted Nov 26, 2010 13:34 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

I couldn't answer that question, I don't know enough about it. But by definition, statically-linked libraries don't get upgraded, so your old statically-linked apps with hidden bugs won't suddenly find that those bugs are now active once glibc is upgraded for the rest of the system.

On breaking things

Posted Nov 26, 2010 20:00 UTC (Fri) by oak (guest, #2786) [Link]

You can't do completely statically linked programs with Glibc unless they're really simple. There are several parts in Glibc (like NSS) which load code dynamically.

With e.g. C-libraries intended for embedded devices like uClibc, it's a bit easier.

On breaking things

Posted Nov 25, 2010 1:14 UTC (Thu) by Sufrostico (guest, #68053) [Link]

everything depends in what kind of users do you want, freedom its not free! you need some effort to keep the way.

But in the other hand, differet opinions its also good for freedom.

On breaking things

Posted Nov 25, 2010 11:05 UTC (Thu) by dgm (subscriber, #49227) [Link]

This is not directly related to freedom, as you can see from there being two completely opposed ways to deal with the issue.

No, it's not about being picky with your users, but about what is best for the project long term. It's well known that Linus thinks that keeping a reputation of not breaking things (knowingly and arbitrarily) is good for the Kernel. The glibc people probably think that not allowing clearly buggy code to tie their hands is best.

Different options, and yet none is necessarily wrong, because they apply to different stuff.

On breaking things

Posted Nov 25, 2010 2:21 UTC (Thu) by lutchann (subscriber, #8872) [Link]

What is the oldest userland that the kernel developers care about keeping usable? Can I run 2.6.37 on Ubuntu 6.04? Debian Potato? RedHat 9? (Yes, I might need to install a new gcc to even build 2.6.37 on these distros...)

On breaking things

Posted Nov 25, 2010 9:49 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

I asked that exact question

On Fri, Nov 19, 2010 at 12:04:47PM -0800, Linus Torvalds wrote:
> On Fri, Nov 19, 2010 at 11:58 AM, <david@lang.hm> wrote:
> >
> > how far back do we need to maintain compatibility with userspace?
> >
> > Is this something that we can revisit in a few years and lock it down then?
>
> The rule is basically "we never break user space".
>
> But the "out" to that rule is that "if nobody notices, it's not
> broken". In a few years? Who knows?
>
> So breaking user space is a bit like trees falling in the forest. If
> there's nobody around to see it, did it really break?

so test -rc kernels and see if they break something for you.

once it's released there's far less pressure to revert something, but during the -rc series it doesn't take much to trigger a revert.

On breaking things

Posted Nov 25, 2010 3:47 UTC (Thu) by iabervon (subscriber, #722) [Link]

On the memcpy thing, I have to wonder whether it wouldn't be pretty much optimal on modern processors to implement both memcpy and memmove as the same, using a pair of implementations, each of which works for a different direction of overlap, and either just comparing the pointers to see which way the potential overlap is. If one direction is really substantially faster, you could also check whether there's any overlap and always use that one if there isn't.

That is, I expect that libraries are full of calls to memmove where the library doesn't know that the buffers don't overlap, and so the performance of memmove on non-overlapping buffers matters; and probably a few tests on things in registers aren't going to be noticeable next to a loop and memory accesses.

On breaking things

Posted Nov 25, 2010 20:59 UTC (Thu) by kleptog (subscriber, #1183) [Link]

Doing memory copies on buffers that overlap is exceedingly rare. As a library it's something you really don't need to worry about. If you're writing code and can't tell if two pointers could overlap, you're doing something wrong.

Another way to look at it is that programs consist of objects. Objects don't overlap because you allocate them that way (they may nest). About the only situation where you do a memory copy on overlapping buffers in when you're shifting objects within an array, and there it's *obvious* that's what you're doing.

It's really not a situation you get into by accident. Deoptimising the common case to save the 0.00000001% (which is an estimate on the high side) of memcpys that are broken is just silly. Valgrind detects this for you, this is really a non-issue.

On breaking things

Posted Nov 26, 2010 15:19 UTC (Fri) by zlynx (subscriber, #2285) [Link]

Compare and branch is an expensive operation. In a lot of cases the memcpy would be finished by the time memmove has figured out which direction to copy in.

On breaking things

Posted Nov 26, 2010 20:21 UTC (Fri) by iabervon (subscriber, #722) [Link]

Is there some architecture where there's a way to do a loop without compare and branch and the cost of the compare and branch doesn't go away when followed by a load with offset? For in-order architectures I've seen, the test is no move expensive than the overhead of one loop iteration; for out-of-order architectures, the processor should know which path it's actually running shortly before it would know which address it's loading anyway. But I confess when I was last looking at out-of-order architecture design was when the Pentium 4 was being designed, so I may be expecting the processor to analyze way too much.

On breaking things

Posted Nov 26, 2010 20:25 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

I also question the expense of the test, all the values needed are already in registers, I would expect that the fact that the CPU is so much faster than the ram would let the test happen without noticably affecting the overall memory copy time (since the memory copy time needs to at least wait for the cache, if not for the main memory)

On breaking things

Posted Nov 25, 2010 14:31 UTC (Thu) by TRS-80 (subscriber, #1804) [Link]

Another example is <a href="https://bugs.launchpad.net/bugs/579300">Ubuntu disabling ALSA OSS emulation</a> which broke a fair few apps, both closed and open source. Now, I wouldn't have a problem with this if they'd gone ahead with the original plan to get OSSp working. Sadly, they neglected to do this, but are refusing to revert the change.

On breaking things

Posted Dec 1, 2010 5:19 UTC (Wed) by RogerOdle (subscriber, #60791) [Link]

The documentation for memcpy says that memory regions should not overlap but who reads the documentation until something breaks? What should be done now? Considering how wide spread memcpy use is, this is not trivial. Who should determine what the memcpy functionality should be? At some point, enough people may use a function that inertia takes control away from the original writers. Should the writers change the functionality just to make it faster? What else could be done?

Solutions:

1. Go ahead and make the change. The problem is that memcpy works differently on different architectures. This is a bad thing. It results in portability problems.

2. Revert memcpy to they way it worked before and introduce new memcpy alternatives that are optimized. I would suggest one version for copying from higher-memory-to-lower (memcpyhl) and another for lower-memory-to-higher (memcpylh). The original memcpy may be modified so that it is safe for overlapped memory by selecting which optimised version to branch to. This would be compatible with existing applications and provide the faster algorithms for future applications.

3. Change the memcpy spec so that it says explicitly what way it will copy so that it can be made to work the same way on all architectures. This will not fix the current problem but, for the future, memcpy will always work the same way on every platform.

We have seen the effects the change has had on highly visible projects, it has surely effected many more projects. It is not enough to say that the writers had the right to make the change. They also have a responsibility to the people who trust that the behaviour of the library will be consistent. You have to be sensitive to the fact that the change will cost your users time and money. It is not a way to endear friends.

On breaking things

Posted Dec 1, 2010 14:01 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

The documentation for memcpy says that memory regions should not overlap but who reads the documentation until something breaks?

People who've learned the hard way that reading the manual before you use something is a really good idea. This is, given the way the human brain works, not a lesson that can be reliably taught the easy way, though you may be able to learn it the medium-hard way (from someone else's mistakes, rather than your own).

Who should determine what the memcpy functionality should be?

Well, the accepted answer is not "random idiot programmers", but rather "the ISO and IEC technical committees responsible for maintaining ISO/IEC 9899, the international standard for the C programming language".

ISO/IEC 9899:1990 and its replacement ISO/IEC 9899:1999 both declare some things to be implementation-defined (meaning that the implementation must define and document the behaviour) and some things to be undefined (anything can happen - if your hardware supports nasal demons, triggering undefined behaviour is permitted to summon demons from your nose).

The behaviour of overlapping memcpy is formally undefined in the C standard; this allows creators of conforming implementations to implement it in the way that makes most sense for their platform.

The behaviour of overlapping memmove is clearly defined in the standard. After a memmove, the destination area contains a perfect copy of the source area's original contents, and any part of the source area overlapped by the destination area will be overwritten.

On breaking things

Posted Dec 1, 2010 16:41 UTC (Wed) by RogerOdle (subscriber, #60791) [Link]

Implementation defined means that the property is undefined. Should something other than memcpy have been used. Sure but it wasn't. Now we have a problem and we have to pick a solution to get us out of the current situation. You can argue about what should have been done but one problem you can not deny is that this issue has given the community a black eye. I am sure that some people get some thrill telling big companies to toe the line but we should not let egos dictate our actions. The fact is that the behaviour of a core routine was modified without due consideration or concern for the consequences. There was no warning and no plan in place to deal with the consequences. This left people scrambling and worrying. Would it have been so difficult for the glibc people to produce a glibc version that worked the old way.

I think that this would be a acceptable way to change the behaviour of a function. If a function in libfoo-1.0.so is changed then release it with libfoo-1.0.t.so (t for transition) at the same time. It should be understood that the particular change will only be present for this release. That should give developers time to adjust to the changes while providing them with a solution that allows software in the field to continue to work. If it is simple enough that an environment variable can be set to select the library before an application is run then that is a simple solution for the end user. The present solution for Firefox/Flash is just too technical for most people.

I personally do not like functions like memcpy that blow up like this. memcpy should always have worked regardless of whether the memory regions overlap or not. This issue is not new. It has been a perpetual stumbling block since the beginning. People think that when something has a simple name and a simple task then it should just work. Putting use conditions on something as simple as "memory copy" is just awkward. Even if a developer doesn't violate memcpy requirements, someone else may use a library that uses memcpy which is just further complication.

On breaking things

Posted Dec 1, 2010 19:59 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

It doesn't take long, observing humans, to realize that they will routinely and repeatedly put off doing anything about something that is not yet a clear, present, and immediate problem for them, especially if they can come up with a good reason why someone else should do it instead, and even people who are aware of this phenomenon are guilty of it.

Tell them "your program will break when version x.(y+1) comes out and removes x.y's transitional compatibility with x.(y-1)" and they'll say "ok, I'll fix it when x.(y+1) comes out", then go into a blind panic when x.(y+1) really does come out and really does break their program.

On breaking things

Posted Dec 1, 2010 20:20 UTC (Wed) by RogerOdle (subscriber, #60791) [Link]

That is very true but some people are forward looking though. It doesn't help that their source of income can be changed because a core component works differently after an upgrade. Upgrades are important for security and stability. They can not be ignored. Planning for a change like memcpy is better than hitting people cold.

On breaking things

Posted Dec 2, 2010 9:04 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

People responsible enough to be forward-looking about these things should be expected to have read the documentation for the APIs they use, and to accept that if they used an API contrary to the established international standard, it's 100% their fault and their responsibility when their program breaks.

On breaking things

Posted Dec 2, 2010 9:10 UTC (Thu) by linuxrocks123 (guest, #34648) [Link]

No. Just no. This is like saying, "well, this particular case of a use after free worked before, so changing the implementation of malloc to break this program without warning is irresponsible." No. It's not. There is a clear delineation of responsibilities. That delineation is documented in the relevant standards. That delineation says that if you use memcpy with overlapping regions, you are doing it wrong. This has been the case SINCE THE DAWN OF C. Anyone using memcpy with overlapping regions wrote a buggy program. Anyone whose program is broken now has to fix it. Their programs are buggy; they need to fix the problem.

If you don't hold people accountable, you encourage bad behavior. I'm glad glibc has a maintainer willing to hold people accountable for their bugs.

That being said, if Fedora wants to roll back to the old version until the bug is fixed, that is their business, but I wouldn't recommend holding every other program back for Adobe Flash. Just ship a workaround library for Flash and be done with it.

---linuxrocks123

On breaking things

Posted Dec 4, 2010 23:11 UTC (Sat) by nix (subscriber, #2304) [Link]

Implementation defined means that the property is undefined.
They are distinct and always have been: implementation-defined means that an implementation can choose what to do at that point, undefined means that the program is no longer C and that anything might happen, even before the code invoking undefined behaviour is reached. Quite different.

Overlapping memcpy() is undefined. It is not implementation defined: implementations need do nothing sensible when it is executed, and if they do do something sensible, they are setting up a portability trap for everyone who implements anything for the first time on that platform. (Solaris is infamous for this, with a malloc() that allows double-free() and use-after-free() and God only knows what else without a murmur. Take your Solaris platform and put it on a Linux platform, or a FreeBSD platform, or a Windows platform and *boom*, and everyone blames the Linux/FreeBSD/Windows system: but it is the Solaris system that was at fault, for being tolerant of bad code past sanity.)

Might I suggest that next time you learn something about the C standard before you try to argue about it? You only need to read the first dozen pages to get this distinction. This is a radical suggestion on the net, I know, but I think you'll find it worthwhile...

On breaking things

Posted Dec 4, 2010 23:48 UTC (Sat) by RogerOdle (subscriber, #60791) [Link]

Please do not lecture me on standards. I am an engineer and I live by them. I have written many specifications and they come back to bite you when you make a mistake so I take extra care not to make them in the first place. When you leave a hole in your specification like saying that it is implementation specific then two people are going to use it differently and one is going to blow up. I do know something about the C standard and in this case the standard leaves one wanting something better.

You can not deny that the insufficiency of the memcpy standard has caused problems. It is easy to say that programmers should be mindful about copying overlapping memory but the reality is that they are not and this is the root of the problem. You can stay in you ivory tower where everything is perfect if you want but I want thing to be easier. I want to get rid of the stumbling blocks like memcpy. Even if programmers know about memcpy, they are still going to forget and this issue is going to come up again. It has before and it will again.

I would argue that since memcpy is a core C function, that it should work all of the time no matter how the addresses overlap. If someone wants an optimized function then they should look elsewhere. No function should go into the core whose behaviour is "implementation defined". I should be able to use the core functions anywhere and get the same results everywhere. I mentioned before that things would be better if the C standard had memcpy alternatives for copying from to lower addresses and for copying to higher addresses. Each of these could be optimized and memcpy could be modified to pick which one to use based on its arguments. If someone wants to optimize performance then they can call these functions directly. But memcpy would always work! It the C standard did this reasonable thing then we wouldn't be having this argument.

My point all along has been that memcpy has once again given us a black eye. If isn't fixed this time then next year we will be having this same argument all over again. How about it GCC just does us all a favor and throws out a warning whenever memcpy is used?

On breaking things

Posted Dec 5, 2010 3:51 UTC (Sun) by magila (subscriber, #49627) [Link]

Implementation defined and undefined behavior exists for a good reason, without them C could not be as versatile or performant as it is across multiple platforms. Sure it can be a pain to deal with, but this is trade-off which lies at the core of C's design philosophy.

If you don't want to deal with ensuring that memory regions don't overlap then you are certainly welcome to use memmove. Although even then, if you are inadvertently copying between overlapping regions there's a chance you'll still have a bug because you really didn't want to clobber the source region.

Frankly, it sounds like what you want isn't C at all. There are plenty of high level languages which will (attempt to) give you safe and consistent behavior across all supported platforms. Of course, to do this those languages accept a different set of trade-offs from C, but if you really don't want to have to deal with implementation defined or undefined behavior I expect you will find them a better match for your needs.

On breaking things

Posted Dec 5, 2010 20:57 UTC (Sun) by anselm (subscriber, #2796) [Link]

I would argue that since memcpy is a core C function, that it should work all of the time no matter how the addresses overlap.

This is wishful thinking. The fact that memcpy() is not guaranteed to work when the source and target ranges overlap has been documented, in an ISO standard no less, for 20 years now. C programmers ignore this at their own peril.

There is a lot to be said for the observation that memcpy() should never have been standardised that way, but that observation ought to have been made before ISO 9899-1990 was finalised. Even if the GNU libc programmers changed their version of memcpy() back to suit your preferences, the fact that ISO C disallows overlapping copies is only going to bite you again on the next C implementation you're going to port your code to.

On breaking things

Posted Dec 13, 2010 20:30 UTC (Mon) by nix (subscriber, #2304) [Link]

Exactly. The C Standard will never change in this respect, and even if it *did* it would be many decades before everyone could rely on it and a source of portability nightmares until this day.

Language standards for major languages do not change that easily. (Look up the history of the && versus & precedence rules, and why & is wrong. That was back when there were only a few C installations, and they *still* held off changing it.)

On breaking things

Posted Dec 5, 2010 12:54 UTC (Sun) by paulj (subscriber, #341) [Link]

And GNU libc checks strings passed to printf for a %s placeholder for NULL, when the C standard says this is not allowed. So a lot of code that runs fine on Linux would blow up on Solaris. I think eventually the Sun engineers relented and held their nose and made Solaris libc similarly check.

So I don't think anyone has full claim to being pure as the snow...

On breaking things

Posted Dec 13, 2010 20:35 UTC (Mon) by nix (subscriber, #2304) [Link]

They did, and your statement is true enough (though, as usual, general logging functions still have to guard against unintended NULLs, because where you can get unintended NULLs you can also get wild pointers, and those will crash anything). But as a general principle, glibc is more paranoid than Solaris libc. (Everyone at work moans about this except for me. I celebrate it. It's caught a good few bugs, although less than the saviour of all tricky bugs, valgrind :) )

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds