Glibc change exposing bugs

Posted Nov 10, 2010 21:17 UTC (Wed) by lmb (subscriber, #39048)
In reply to: Glibc change exposing bugs by jwb
Parent article: Glibc change exposing bugs

memcpy detecting that the areas overlap? What's next, are you going to propose making POSIX threads (and syscall behaviour w/the same) programmer friendly? ;-)

Glibc change exposing bugs

Posted Nov 10, 2010 21:52 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (33 responses)

The memcpy() routine is used in a wide variety of places, sometimes implicitly, to copy small amounts of data. For example, implicit memcpy() calls may be inserted by the compiler whenever an initialized array is allocated on the stack. A test+branch based on the order of the operands may take just as long as copying the data. Ergo, adding even small amounts of additional error-checking to memcpy() may have a significant impact on performance.

The restrictions on memcpy() are hardly unique; *most* APIs do not tolerate overlapping memory regions. The memmove() routine is an exception. If you want a nice "safe" way to copy some data between buffers which may or may not overlap, and don't care so much about performance, just use memmove() everywhere.

While forward compatibility is a good thing in general, it is unreasonable for API developers to feel bound to support obvious *misuse* of their APIs which directly contradicts explicit API documentation, which is exactly what is happening here. Given that any broken applications can be trivially patched with a simple LD_PRELOAD, I see no reason not to permit this change to the internal implementation of memcpy() in glibc.

Glibc change exposing bugs

Posted Nov 10, 2010 22:18 UTC (Wed) by lmb (subscriber, #39048) [Link] (32 responses)

To be honest, I disagree. Yes, the programmers are misusing memcpy(), and relied on unspecified behavior. Yes, they should correct their code.

But the code worked so far - when users upgrade their glibc, suddenly their programs break, or possibly corrupt the user's data. How's that good?

Yes, it pushes users to complain to the developer (if they still can, and their e-mail/internet bits aren't affected), but it leaves them with a bitter aftertaste for the platform/ecosystem that forces developers to fix bugs at their users's expense.

The code should start with emitting a warning to the logs (once per program run, otherwise it becomes a DoS). The compiler could start warning if it detects the possibility of this happening (or coverity/valgrind etc all can). Possibly taunt developers publicly if you spot those messages in your logs.

But breaking underneath an unsuspecting user? Horribly, horribly wrong.

Glibc change exposing bugs

Posted Nov 10, 2010 22:34 UTC (Wed) by jwb (guest, #15467) [Link] (29 responses)

Your philosophy seems to be that if Flash never fixes this bug, we can never have the faster glibc memcpy. Why should free software be blocked by secret developments at Adobe?

Glibc change exposing bugs

Posted Nov 10, 2010 23:59 UTC (Wed) by lmb (subscriber, #39048) [Link] (24 responses)

You're focusing on Flash. I am not. The same applies to all buggy projects.

Knowingly introducing a change with consequences that aren't just mere crashes, but data corruption, for end users - if you cannot see how that is wrong, I have no idea how to help explain it.

Yes, of course, the performance achievement is worth having. However, not at this cost. Not without first adding some audit logging. Not without giving developers time to fix that. It is an incompatible change in the ABI.

I can read the man page as well as you can. Sure, the applications are buggy; that does not give one the right to corrupt user data. Such changes need to be phased in carefully; not in a "I am more righteous than you" style. It is bad enough when it happens by accident; intentionally, it is malpractice.

Glibc change exposing bugs

Posted Nov 11, 2010 0:05 UTC (Thu) by bojan (subscriber, #14302) [Link] (2 responses)

What else uses glibc-2.12.90 out there apart form Fedora 14 and rawhide? Next to nothing. So, the change is being rolled out carefully. Only limited number of people will be exposed to it - the ones willing to run the latest Fedora.

Glibc change exposing bugs

Posted Nov 11, 2010 16:35 UTC (Thu) by jedbrown (subscriber, #49919) [Link] (1 responses)

I assume you mean glibc-2.12 or glibc-2.12.1. 2.12 has been in Arch Linux since May, 2.12.1 since August. This issue has not affected me because the new memcpy is still forward on Core 2.

Glibc change exposing bugs

Posted Nov 12, 2010 0:02 UTC (Fri) by bojan (subscriber, #14302) [Link]

No, I mean 2.12.90:

glibc-2.12.90-18.x86_64
glibc-2.12.90-18.i686

This is an unreleased version of glibc. Fedora does this from time to time - ship early cuts of new glibc (this will be 2.13 one day).

Glibc change exposing bugs

Posted Nov 11, 2010 0:31 UTC (Thu) by jwb (guest, #15467) [Link] (14 responses)

"First adding some audit logging".

You are proposing that it would be wise to have a test and branch at every entry into memcpy? That is madness.

Glibc change exposing bugs

Posted Nov 11, 2010 1:52 UTC (Thu) by gus3 (guest, #61103) [Link] (13 responses)

No, it is simple integer math:

if ((p1 + length <= p2) || (p2 + length <= p1)) {
crash_and_burn();
}

It is not a sophisticated test, and the more noise it makes about buggy parameters, the sooner the calling code will get fixed.

Glibc change exposing bugs

Posted Nov 11, 2010 2:05 UTC (Thu) by gus3 (guest, #61103) [Link] (10 responses)

I got that test backwards. It should be:

if ((p1 + length >= p2) || (p2 + length >= p1)) {
crash_and_burn();
}

But goofing the test, doesn't mean the test isn't simple.

Glibc change exposing bugs

Posted Nov 11, 2010 2:12 UTC (Thu) by gus3 (guest, #61103) [Link] (9 responses)

Aaaaand I've still goofed it. I'm not taking into account... well, everything.

I see the actual test in my head, but I can't code it right now due to fatigue. But even with all necessary calculations, being integer math, it'll take no more than a few tens of cycles. Even on a register-starved x86, putting a couple temporary variables on the stack will only pollute the cache, before over-writing the temps anyway. It shouldn't take more than a microsecond to check for overlap.

Glibc change exposing bugs

Posted Nov 11, 2010 7:24 UTC (Thu) by nix (subscriber, #2304) [Link] (6 responses)

A few tens of cycles! That's much longer than an in-cache memcpy() on a small input takes: and most of its inputs are small.

There is absolutely no chance that the glibc devs would ever accept this except in the -lc_g version of the library (which nobody ever uses).

Glibc change exposing bugs

Posted Nov 12, 2010 23:04 UTC (Fri) by gus3 (guest, #61103) [Link] (5 responses)

It shouldn't be difficult to put it in as debugging code, but this isn't normal debugging. The GNU people should own up to having violated the documentation on their code. One year of making noise, or maybe even just six months, should be long enough for developers to clean up their code from this erroneous assumption.

Glibc change exposing bugs

Posted Nov 14, 2010 22:29 UTC (Sun) by nix (subscriber, #2304) [Link] (4 responses)

The GNU people should own up to having violated the documentation on their code.

What on earth? The relevant documentation for memcpy() is ISO C, incorporated by reference into all versions of POSIX.1. This clearly states "If copying takes place between objects that overlap, the behavior is undefined."

This isn't an obscure or hard-to-interpret part of the Standard. Undefined, bang, that's it. Perhaps you are operating under the misapprehension that the linux manpages project, a descriptive effort, not a prescriptive one, is in some way binding on glibc? It isn't. It really isn't. It isn't binding on anything.

Glibc change exposing bugs

Posted Nov 15, 2010 0:00 UTC (Mon) by promotion-account (guest, #70778) [Link] (3 responses)

What on earth? The relevant documentation for memcpy() is ISO C, incorporated by reference into all versions of POSIX.1.

Indeed.

Linux man-pages are only authoritative for the kernel system-calls (more precisely, their glibc thin layer). The rest of the APIs are only included for convenience: they are a secondary source to the primary source references residing in the 'CONFORMING TO' section.

Glibc change exposing bugs

Posted Nov 15, 2010 0:31 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

Linux man-pages are only authoritative for the kernel system-calls (more precisely, their glibc thin layer).

No, even those are descriptive. Perhaps the glibc texinfo documentation would be authoritative for that, if it was ever maintained by anyone. As it is, I think only Ulrich and Roland's brains are authoritative for glibc.

Glibc change exposing bugs

Posted Nov 15, 2010 1:27 UTC (Mon) by promotion-account (guest, #70778) [Link] (1 responses)

I remember finding a good number of system-call manpages discussions in LKML. Otherwise, where should we find documentation for things like futexes, netlink sockets, and the rest?

For good or bad, these manpages are the 'most primary' sources available for such topics, only beside the code.

But unfortunately these man-pages do not always exist. I once had to carefully study the bluez userspace code to know how to best interface with the kernel Bluetooth API (undocumented AF_BLUETOOTH sockets, undocumented netlink interfaces, etc).

Glibc change exposing bugs

Posted Nov 15, 2010 10:38 UTC (Mon) by nix (subscriber, #2304) [Link]

Yes, they are the most useful documentation we have, especially for things that do not have glibc wrappers. But even for the kernel they are descriptive, and for things for which the glibc wrappers are the primary implementation (like readdir()) or for which there is no kernel component, the manpages are completely after-the-fact. (As far as I can tell the glibc project no longer bothers to document anything at all. There are lots of utterly undocumented things in glibc's allegedly public interface.)

Glibc change exposing bugs

Posted Nov 11, 2010 18:52 UTC (Thu) by oak (guest, #2786) [Link] (1 responses)

memmove() has this check you're clamoring for... And if the given areas don't overlap, it calls memcpy().

Glibc change exposing bugs

Posted Nov 15, 2010 0:14 UTC (Mon) by promotion-account (guest, #70778) [Link]

memmove() has this check you're clamoring for... And if the given areas don't overlap, it calls memcpy().

Sometimes even if the areas do overlap, it calls memcpy(). This happens if the library has an internal knowledge about memcpy()'s copying direction.

A common example is having src > dst, copying is forward, and the CPU block transfer unit is smaller than or equal to (src - dst). x86-64 CPUs support copying up-to 8-byte blocks in one opcode (movsq), assuming no floating-point ops in use, which is usually the case with kernel code.

Glibc change exposing bugs

Posted Nov 12, 2010 6:34 UTC (Fri) by jmm82 (guest, #59425) [Link] (1 responses)

That is two tests and one branch.

Glibc change exposing bugs

Posted Nov 12, 2010 22:56 UTC (Fri) by gus3 (guest, #61103) [Link]

Given that the total calculations will be less than one memory page, or one cache line, and the tests can be marked in code as "likely to fail" (that is, no crash and burn), Intel's post-486 processors will already be fetching the normal (everything's OK) code by the time the final test calculations are done. The branch prediction will fail on the crash-and-burn case.

Glibc change exposing bugs

Posted Nov 11, 2010 11:13 UTC (Thu) by marcH (subscriber, #57642) [Link] (5 responses)

> It is an incompatible change in the ABI.

No it is not, unless "defined-undefined behaviour" has now become part of Interfaces.

By using the wrong name you are trying to sidestep all the nuances of this problem. Unfair tactics lowering your credibility.

Glibc change exposing bugs

Posted Nov 11, 2010 12:52 UTC (Thu) by nye (subscriber, #51576) [Link] (4 responses)

>No it is not, unless "defined-undefined behaviour" has now become part of Interfaces.

Deterministic observed behaviour, like it or not, will always be considered a part of the ABI.

This is why the kernel goes out of its way to preserve observe but undocumented behaviour, and one of the reasons Windows is wildly successful despite its numerous design flaws is that Microsoft agrees.

If a change breaks existing software, then it's a regression. Hand-wringing, finger-pointing, and bitter recriminations about 'proprietary crapware' are all irrelevant. Something worked. Now it doesn't.

From the comments on this it sounds like symbol versioning could be used to avoid this problem altogether, while still getting the benefit for newly built applications. Developers don't want to do this because they feel that it will benefit only proprietary software[0]. Of course the only people harmed by this attitude are end users.

This is just yet another case where open source software chooses politics over technical excellence, which is sad but entirely unsurprising.

[0] Disregarding the idea that one might want to use some open source software with a similar bug that hasn't yet been fixed - most developers seem to always want to run the latest bleeding edge version of everything, and don't understand that the rest of the world isn't like that and expects existing software not to break unexpectedly.

Glibc change exposing bugs

Posted Nov 11, 2010 18:27 UTC (Thu) by donwaugaman (subscriber, #4214) [Link]

> This is just yet another case where open source software chooses politics over technical excellence, which is sad but entirely unsurprising.

Oddly enough, I would consider "technical excellence" to mean fixing bugs in software that has them, in this case the Adobe Flash player, whereas "politics" means allowing poorly-written precedent to trump (and in this case penalize) better performance for programs written with an eye to the standard.

It's a shame there's no way to get Adobe to do an 's/memcpy/memmove/' on their codebase. But the fact that they won't let others do it has more to do with their politics (and opposition to software freedom) than about technical excellence.

Glibc change exposing bugs

Posted Nov 11, 2010 18:32 UTC (Thu) by xilun (guest, #50638) [Link] (2 responses)

> Deterministic observed behaviour, like it or not, will always be considered a part of the ABI.

Nonsence. The definition of the ABI is NOT any random characteristic that would please you by making others responsible for your own errors.

First if you want to program in a language (and its associated standard library) that is not full of undefined behaviors, then you don't program in C.

If you do program in C, *YOU* are responsible for respecting preconditions. The system will *NOT* magically fix your bugs for you. Glibc developpers are not responsible for the Flash software package, and this is not a free software or not free software problem; they are also not responsible for other random piece of free software, even crappy ones.

You are inversing roles, given the bad one to the developpers of the very high quality, standard compliant, piece of code the is glibc, and the good one to the constant notable piece of crap that is the flash player.

And even if you could be a dictator for the glibc project, please explain us what is the politic you would then impose to *concretely* solve the generalized unattended interactions problem between software components. Nobody has ever solved that. Even Microsoft, which you seems to cherish so much, does for *years* (maybe even decades now) ask third party developers to ship the MS libc the third party developer test his application with. So of course most of the time you don't have this problem of a suddenly changing libc under MS Windows, because each program has its own libc. Now what happens in case some version contains a security exploitable flaw? Security effort are duplicated in such an environment. (And this is just an example.)

Imagine a random application depends on a BUG of a particular version of the glibc. Because of that, you are asking for this bug to remain forever? Nonsense. This is what you call "technical excellence"? What a joke. It would mean freezing libraries forever, because that's the only way you can guarantee in the shared library model that the behavior of any random piece of crappy software won't change too much.

If you're a third party developer who only cares about making your proprietary app sort of working even when you write highly faulty unacceptable quality code, please: 1/ ship it with frozen version of the library it needs, like you do under windows anyway 2/ leave the developers of libraries that are trying to improve them alone, and especially do not report YOUR mistake on them.

I'm not sure nye meant all regressions need to be avoided.

Posted Nov 12, 2010 2:50 UTC (Fri) by gmatht (guest, #58961) [Link] (1 responses)

> So of course most of the time you don't have this problem of a suddenly changing libc under MS Windows, because each program has its own libc. Now what happens in case some version contains a security exploitable flaw? Security effort are duplicated in such an environment. (And this is just an example.)

This seems almost a strawman. Criticize Microsoft's policy if you want, but all nye suggested was using symbol versioning for this particular known-to-be-dangerous change. This wouldn't cause any security issues (and may even avoid some).

I agree that it isn't possible to avoid every regression. For example, newer software often has performance regressions on old hardware. However, this seems like a particularly serious regression, so if there was an easy way to stop old versions of software silently corrupting data it may be worth taking.

I'm not sure nye meant all regressions need to be avoided.

Posted Nov 12, 2010 11:13 UTC (Fri) by xilun (guest, #50638) [Link]

Symbol versioning is when the API/ABI changes. Here, it has not.

Glibc change exposing bugs

Posted Nov 11, 2010 0:00 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

Note that glibc broke the JVM back in glibc 2.3.x days, when it made its internal symbols private and suddenly the JVM was discovered to have been relying on internal libc symbols. That made it not *start* at all, and at that point it was very closed-source, so nobody could fix it. Inevitably there were calls to revert that change and make all the glibc symbols public again. Sensibly, the glibc developers didn't do that: instead, the idiots using internal libc symbols fixed their bugs.

The same will happen here.

Glibc change exposing bugs

Posted Nov 11, 2010 0:08 UTC (Thu) by lmb (subscriber, #39048) [Link] (2 responses)

A start-up failure is acceptable. A straight crash might be OK, if it is unavoidable (even though then I'd already worry). Data corruption - compared to previous behavior - is not.

Of course it will cause the code to be fixed. But that the maintainers of the core system library place "I am right" above users's data is a worrying insight.

Glibc change exposing bugs

Posted Nov 11, 2010 0:26 UTC (Thu) by bojan (subscriber, #14302) [Link]

> Data corruption - compared to previous behavior - is not.

Now you're making glibc maintainers responsible for other people's bugs. They are not.

If this same buggy program was linked against some other library that implements memcpy() similarly to the way latest glibc does, the data would be just as corrupt.

In essence, it is the program that is corrupting the data, not glibc. And it's doing so by clear misuse of a function.

> But that the maintainers of the core system library place "I am right" above users's data is a worrying insight.

I think that's a bit overly dramatic. Fedora 14 is a fresh release, currently carrying a non-released version of glibc. As such, users of it (which includes me) sometimes encounter things that are surprising at first. But the audience is limited and the impact is not earth shattering.

Glibc change exposing bugs

Posted Nov 11, 2010 10:53 UTC (Thu) by nix (subscriber, #2304) [Link]

The glibc maintainers are much harsher than that. Any change which is implementation-defined in POSIX or ISO C's library functions is fair game for them to change, and they have very harsh words for people who complain (and explicitly don't care about breaking closed-source code that depends on such assumptions). Expecting them to insert slowing-down hacks for actively *undefined* stuff, given their somewhat cavalier attitude to implementation-defined stuff, is peculiar. (Their attitude to stuff that *is* defined is appropriately rigid: thou shalt not break it.)

Glibc change exposing bugs

Posted Nov 10, 2010 22:55 UTC (Wed) by clugstj (subscriber, #4020) [Link]

It wasn't "unspecified behavior", the manpage, in the first paragraph, says to not do this!

Glibc change exposing bugs

Posted Nov 10, 2010 23:56 UTC (Wed) by nix (subscriber, #2304) [Link]

The code happened to work on Linux. It certainly wouldn't have worked on a lot of other OSes, even Unixes. It's undefined behaviour: it might be broken at any time, without warning. And now it has. Compiler optimizations could perfectly well have broken it instead: perhaps we should compile everything -O0 to remove the chance of that.

Emitting a warning to the logs is far too expensive: this stuff is so often called that the compiler sometimes open-codes it! Adding a conditional in there that isn't absolutely needed would have horrible effects on performance.

(And as for detecting it at compile time, well, sure! It requires whole-program optimization of every single program and all its shared libraries, and even then detecting it reliably reduces to solving the halting problem. This seems to be rather harder than just valgrinding the bloody thing and learning elementary C before you write it.)

(Sure, there will be actual bugs teased out by this: code that didn't expect to receive overlapping regions when it was written, but that now is. But, guess what? I bet those overlapping copies were causing other bugs, because it is surely rare for code to just memcpy() from region A to *unexpectedly*-overlapping region B and then never do anything with A again.)