LWN.net Logo

Glibc change exposing bugs

Glibc change exposing bugs

Posted Nov 10, 2010 20:03 UTC (Wed) by corbet (editor, #1)
In reply to: Glibc change exposing bugs by rodgerd
Parent article: Glibc change exposing bugs

That's a germane example, actually. "Poo flinging" notwithstanding, the kernel developers fixed things so that applications would not lose data even if they weren't following standard behavior. Not breaking things was seen as more important than doing something because the posted rules say you can.

I don't believe that Linus (or anybody else) is saying that the broken applications are not buggy. What I'm hearing is that those applications have worked for years and that people should think for a long time before introducing a change which breaks them. Thus, Linus asks: what's the benefit that justifies such a change? I think it's a reasonable question.


(Log in to post comments)

Glibc change exposing bugs

Posted Nov 10, 2010 20:10 UTC (Wed) by jwb (guest, #15467) [Link]

There are a huge variety of improvements to Linux which have broken or will break Flash Player, for example Flash abused the ALSA API until Pulse came along and exposed that abuse.

Glibc change exposing bugs

Posted Nov 10, 2010 20:38 UTC (Wed) by neilbrown (subscriber, #359) [Link]

This all sounds like a very strong recommendation in favour of Rusty Russell's Maxim of API development: APIs should be hard to misuse. memcpy, and apparently ALSA, are easy to misuse.

So implementing memcpy as memmove - which Linus says in the bugzilla threads is largely what the kernel does - sounds very sensible. memmove is much harder to misuse.

Glibc change exposing bugs

Posted Nov 13, 2010 1:07 UTC (Sat) by rriggs (subscriber, #11598) [Link]

memmove: safe, fast, verbose function name
memcpy: unsafe, at least as fast as memmove, one less character to type

Which one do you think your average C programmer will choose?

Which one do you think new programmers are taught to use (in schools that still teach C programming)?

Glibc change exposing bugs

Posted Nov 13, 2010 2:52 UTC (Sat) by neilbrown (subscriber, #359) [Link]

So we can save the world by creating a 'memmv' in glibc which aliases memmove? Brilliant!

Glibc change exposing bugs

Posted Nov 15, 2010 16:43 UTC (Mon) by renox (subscriber, #23785) [Link]

Too late!

And memcpy should also be named as mem_unsafe_copy, but yes if you tell developers to use safe function by default and to optimize only when they can show benchmarks that the optimisation will make a difference, then yes, you'd get probably better software (if a bit slower).

Glibc change exposing bugs

Posted Nov 25, 2010 15:13 UTC (Thu) by Spudd86 (guest, #51683) [Link]

It's not so much that ALSA is easy to misuse (although it probably is), it's that certain parts of it are impossible to emulate from userspace. An app that actually NEEDS those parts is NOT misusing the API when it uses them (for example, pulseaudio itself uses those bits).

The problem is that most apps don't actually need the those bits, so they just needlessly break software like pulseaudio (and also break on bluetooth audio too).

Pulseaudio does use those unemulatable APIs, but it also falls back if they don't work, and it has good reasons to use those APIs (so it can hand over large chunks of audio data, but still be able to decide it wants to change that same data later (if for example something else starts playing audio), this saves you power because pulse won't wake your CPU as much, but it also uses APIs that don't emulate well AND until pulse came along nobody ever tried to do that sort of thing so it broke)

Glibc change exposing bugs

Posted Nov 25, 2010 16:07 UTC (Thu) by foom (subscriber, #14868) [Link]

If the documentation wasn't so terrible, this probably wouldn't be a problem. It doesn't give *any* clue that, for example, a developer shouldn't use the mmap functions. In fact it makes it sound like you should use them, because they're zero-copy (and that's better, right?)

Glibc change exposing bugs

Posted Nov 25, 2010 16:39 UTC (Thu) by Spudd86 (guest, #51683) [Link]

Well yea, but Lennart Pottering does have a blog post where he says exactly what subset of ALSA's API you should restrict yourself to, perhaps someone should put that into the ALSA docs.

Glibc change exposing bugs

Posted Nov 10, 2010 20:56 UTC (Wed) by clugstj (subscriber, #4020) [Link]

Even if there isn't currently-demonstrable benefit, there could be in the future, so why not get the buggy code fixed now instead of later? Yes, it's a very fine line, but, just my opinion, I don't have a problem w/ GlibC not reverting the change.

Glibc change exposing bugs

Posted Nov 10, 2010 22:24 UTC (Wed) by lmb (subscriber, #39048) [Link]

Because the user whose data has just been corrupted or whose important business meeting presentation just crashed or whose mail has been eaten no longer cares, and has switched to a less hostile platform.

That behavior is undefined makes one only right as far as technicality is concerned; it does not imply that changing it silently is good software engineering practice, nor that it is right in terms of software providing a service to users.

Glibc change exposing bugs

Posted Nov 10, 2010 23:58 UTC (Wed) by nix (subscriber, #2304) [Link]

But the compiler makes undefined stuff break all the time, and the set of undefined stuff which is broken is changed by all sorts of things. Nobody complained when LTO came in, although it surely broke programs relying on numerous instances of undefined behaviour which had been harmless before due to wider optimization opportunities when optimizing across translation units. So why complain about this? Just because Flash was affected?

Glibc change exposing bugs

Posted Nov 11, 2010 2:29 UTC (Thu) by foom (subscriber, #14868) [Link]

Because changes in the compiler don't break already-installed working binaries. They break newly compiled versions of software. Presumably such newly-compiled software gets tested, and if there's a problem, the program is perhaps recompiled with an older version of the compiler until the issue is fixed.

Glibc change exposing bugs

Posted Nov 11, 2010 7:29 UTC (Thu) by nix (subscriber, #2304) [Link]

That's a very large assumption indeed. When glibc gets recompiled, is everything on the distro tested? When libpng gets recompiled (every other week), is everything that uses it tested? I doubt it.

Glibc change exposing bugs

Posted Nov 11, 2010 17:30 UTC (Thu) by foom (subscriber, #14868) [Link]

Shrug, yet still, even if it was only discovered sometime later...if there was a bug in the new libpng binary that only appeared because it was compiled with a new gcc, it's still a bug in the new libpng binary that can be fixed by uploading yet another new libpng binary.

Here we have a new bug in flash which appeared without a new version of the flash binary being uploaded. It's a substantively different situation.

Glibc change exposing bugs

Posted Nov 12, 2010 7:12 UTC (Fri) by hozelda (guest, #19341) [Link]

If you update your libpng then the corruption already happened just as if you update glibc. [But the odds grow problems will arise when you update glibc because of its vast use]

If you are that worried, you should work off stable versions or off a stable distributor that will manage this for you. You should not change key parts of the system if possible. glibc is a very key part. You should not update for optimizations, at least not without significant tests and only if you think it's worthwhile the gains. Stick to security updates or when a crucial problem has been solved.

Anyway, when an important "bug" like this comes up, projects should audit the code. In this case, the possible entry points to potential problems can be identified quickly for many projects (just search for memcpy).

The case of glibc involves well-defined standards. Most libraries do not have such carefully defined semantics, and we must rely on access to source code for the juicy bits.

OK, despite what I just said, if the gains here are not that useful, glibc should revert, at least for the time being. Reverting should not hurt those that adjusted already and will save those that have not. On the other hand, when will be the right time to change? Will people remember to fix this problem or will we just have a repeat later on? [Again, if the gains are negligible, the change in glibc should probably be avoided.]

Glibc change exposing bugs

Posted Nov 11, 2010 0:50 UTC (Thu) by MattPerry (guest, #46341) [Link]

> That behavior is undefined makes one only right as far as technicality
> is concerned;

But it is defined. The man page says not to use that function on overlapping regions. That applications ignored that and still functioned for so long is more a matter of good luck. That luck has run out due to their poor implementations and they should now be fixed.

Glibc change exposing bugs

Posted Nov 11, 2010 1:15 UTC (Thu) by Lovechild (guest, #3592) [Link]

Perhaps if somehow one could get it to emit a warning message instead of crashing that might work. For now it might be best done in testing environments such as being enabled perhaps during the development cycle of a distribution.

Glibc change exposing bugs

Posted Nov 11, 2010 2:41 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

the kernel developers fixed things so that applications would not lose data even if they weren't following standard behavior
What some filesystem developers propose applications do isn't defined by any standard. POSIX, SuS, and so on don't state what happens after a crash, fsync() or not. The argument was over what to do in certain circumstances outside any standard. The argument was must muddled because one said kept claiming that its brand of brain damage was endorsed by the standard. Fortunately, sanity prevailed. Calling fsync() after every rename would have inconvenienced application developers and decreased performance.

memcpy, on the other hand, is clearly described by the relevant standards. Application developers deserve what they get.

Glibc change exposing bugs

Posted Nov 11, 2010 8:07 UTC (Thu) by bojan (subscriber, #14302) [Link]

> The argument was must muddled because one said kept claiming that its brand of brain damage was endorsed by the standard.

He, he... Nice try :-)

Nothing could be further from the truth. The problem is that the standard doesn't _specify_ in which order things should happen on the underlying FS, which then gives implementers the ability to implement _any_ order (which they do). Relying on a _particular_ order (which is completely undocumented, of course) by application writers is the problem.

Suggestion about specification not dealing with crashes is irrelevant, because, once again, it doesn't specify _any_ behaviour. In other words, if you FS is hosed completely after a crash, that OK. If it's half hosed, that's OK too. If it's completely OK, that's OK as well. Obviously, the _interesting_ case is when it's completely OK, in which case the _implemented_ ordering actually makes a difference. And, once again, _any_ ordering is OK, because the standard specifies _none_.

The only difference between this and the memcpy() fiasco is that in the case of rename() folks may get an _impression_ that the operation is atomic on the FS level, because it is atomic as viewed from the processes currently running on the system. Of course, this is documented nowhere, but is a common misreading of the standard.

With memcpy() it is quite clear overlapping regions should be copied with memmove().

Glibc change exposing bugs

Posted Nov 11, 2010 8:32 UTC (Thu) by Mook (guest, #71173) [Link]

Funnily enough... that has to do with glibc too. In particular, its manual on rename(): http://www.gnu.org/s/libc/manual/html_node/Renaming-Files...

Yes, glibc's rename() API guarantees atomic renames. Since normal applications do not make syscalls directly, but call the libc API to do it on their behalf, they are not to blame.

Glibc change exposing bugs

Posted Nov 11, 2010 8:46 UTC (Thu) by bojan (subscriber, #14302) [Link]

And even more "funily", glibc doesn't deal with file system implementation (i.e. the persistence of the change) at all. In fact, that very page you pointed to states that strange things may indeed happen after a crash.

The atomicity of rename() refers to a view from the running system and not much else. But it has sure been misread a lot :-)

Glibc change exposing bugs

Posted Nov 11, 2010 9:06 UTC (Thu) by Mook (guest, #71173) [Link]

Hmm, odd; I parse "If there is a system crash during the operation, it is possible for both names to still exist; but newname will always be intact if it exists at all. " as "the file named by the destination will either not exist, or have some sort of sensible value, but not be truncated at zero bytes unless that was one of the two inputs".

Glibc change exposing bugs

Posted Nov 11, 2010 9:52 UTC (Thu) by bojan (subscriber, #14302) [Link]

You are confusing file names (i.e. what is recorded in the directory) with contents of files.

Glibc change exposing bugs

Posted Nov 11, 2010 13:49 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

"intact" seems to refer to the contents?

Glibc change exposing bugs

Posted Nov 11, 2010 23:05 UTC (Thu) by bojan (subscriber, #14302) [Link]

Suppose there are two entries in the directory, with oldname being renamed to newname, and each (obviously) pointing to an inode. If the system crashes during the rename, it is possible that both will survive (because the directory was not committed to disk yet).

What glibc docs are talking about is that rename() is not implemented by copying content of the oldname to newname. So, if there was newname before rename and the directory commit doesn't go through, the content of newname will not be changed. It is a pure directory operation. On the other hand, if the directory gets committed, there will be just newname there, pointing to whatever content oldname had. All of that is if your FS knows how to survive a crash - otherwise situation is not interesting (well, unless you're the sysadmin recovering the mess :-).

Now note the situation from the ext4 "problem". The oldname content was not fsync()-ed to disk before the rename(). Ergo, when the directory got committed, oldname became newname on disk, pointing to zero bytes, due to delayed allocation. This has nothing to do with the fact that on unsuccessful (i.e. not committed before the crash) rename(), both oldname and newname would remain in the directory.

Glibc change exposing bugs

Posted Nov 12, 2010 7:12 UTC (Fri) by Mook (guest, #71173) [Link]

Thank you for the clear explanation! It does clearly say that I'm wrong :)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds