OLS: On how user space sucks

[Posted July 20, 2006 by corbet]

Dave Jones's OLS talk, titled "Why user space sucks," was certain to be popular at a setting like this. So many of the people in the standing room only crowd might well have wondered why this talk was not scheduled into the larger room. Perhaps the powers that be feared that a non-kernel talk would not have a large audience - even when it is given by a well-known kernel hacker.

Dave set out to reduce the time it took his Fedora system to boot. In an attempt to figure out what was taking so long, he instrumented the kernel to log certain basic file operations. As it turned out, the boot process involved calling stat() 79,000 times, opening 27,000 files, and running 1382 programs. That struck him as being just a little excessive; getting a system running shouldn't require that much work. So he looked further. Here are a few of the things he found:

HAL was responsible for opening almost 2000 files. It will read various XML files, then happily reopen and reread them multiple times. The bulk of these files describe hardware which has never been anywhere near the system in question. Clearly, this is an application which could be a little smarter about how it does things.
Similar issues were found with cups, which feels the need to open the PPD files for every known printer. The result: 2500 stat() calls and 400 opens. On a system with no attached printer.
X.org, says Dave, is "awesome." It attempts to figure out where a graphics adapter might be connected by attempting to open almost any possible PCI device, including many which are clearly not present on the system. X also is guilty of reopening library files many times.
Gamin, which was written to get poll() loops out of applications, spends its time sitting in a high-frequency poll() loop. Evidently the real offender is in a lower-level library, but it is the gamin executable which suffers. As Dave points out, it can occasionally be worthwhile to run a utility like strace on a program, even if there are no apparent bugs. One might be surprised by the resulting output.
Nautilus polls files related to the desktop menus every few seconds, rather than using the inotify API which was added for just this purpose.
Font files are a problem in many applications - several applications open them by the hundred. Some of those applications never present any text on the screen.
There were also various issues with excessive timer use. The kernel blinks the virtual console cursor, even if X is running and nobody will ever see it. X is a big offender, apparently because the gettimeofday() call is still too slow and maintaining time stamps with interval timers is faster.

There were more examples, and members of the audience had several more of their own. It was all great fun; Dave says he takes joy in collecting train wrecks.

The point of the session was not (just) to bash on particular applications, however. The real issue is that our systems are slower than they need to be because they are doing vast amounts of pointless work. This situation comes about in a number of ways; as applications become more complex and rely on more levels of libraries, it can be hard for a programmer to know just what is really going on. And, as has been understood for many years, programmers are very bad at guessing where the hot spots will be in their creations. That is why profiling tools so often yield surprising results.

Programs (and kernels) which do stupid things will always be with us. We cannot fix them, however, if we do not go in and actually look for the problems. Too many programmers, it seems, check in their changes once they appear to work and do not take the time to watch how their programs work. A bit more time spent watching our applications in operation might lead to faster, less resource-hungry systems for all of us.

Index entries for this article
Conference	Linux Symposium/2006

OLS: On how user space sucks

Posted Jul 20, 2006 22:37 UTC (Thu) by cventers (guest, #31465) [Link] (7 responses)

I'm glad we have free systems -- not only is it possible to spot these
problems, it's possible for people not originally involved in the
programs to step in and fix them.

I guess I'm not at all surprised that some applications behave so
poorly... this is what tends to happen when you stack layers and layers
of abstraction on eachother.

One of the reasons I find programming _so_ entertaining is because I am
challenged by the task of creating the most stable and incredibly
efficient solution I can think of. I guess the same is not true of
everyone :)

What this does tell me is that there is tons of low hanging fruit if we
want to improve power efficiency and performance. So, great hackers,
let's get to work!

OLS: On how user space sucks

Posted Jul 20, 2006 23:03 UTC (Thu) by tomsi (subscriber, #2306) [Link] (2 responses)

I'm glad we have free systems -- not only is it possible to spot these problems, it's possible for people not originally involved in the programs to step in and fix them.

I agree. At least, with linux it is possible to find out why things takes to much time. With Windows, you can only wait...

OLS: On how user space sucks

Posted Jul 21, 2006 1:17 UTC (Fri) by sepreece (guest, #19270) [Link] (1 responses)

I don't know about Windows, but you could do similar analysis on Unix systems, since 1990-ish, despite their being closed source. Dave's report (which was, indeed, hilarious, and a great lesson for us all) was based on observing the apps from outside (watching their file operations in particular).

OLS: On how user space sucks

Posted Jul 21, 2006 3:40 UTC (Fri) by cventers (guest, #31465) [Link]

Right. The difference with free software, though, is that you don't have
to wait for the vendor to fix it (or hope that they care enough). If
something is really itching, you're 100% empowered to scratch it.

OLS: On how user space sucks

Posted Jul 21, 2006 5:27 UTC (Fri) by flewellyn (subscriber, #5047) [Link] (3 responses)

With respect, I'm not at all sure that layers of abstraction are really necessary to create such a situation. In fact, in my experience, well-abstracted systems where each conceptual layer is well-defined and efficient in and of itself, tend to have less of these sorts of problems going on. I've found that inefficiencies and bogosities of the sort described tend to crop up more in poorly abstracted systems, or where the abstractions are "leaky", exposing too many internals to the next level up. I know this because I have built some.

As a custom web application developer, I have created a number of large, complex systems that do very complicated things; in the process, I have had to reimplement and refactor many of the earlier generations of these systems, because as an application developer in general, I am prone to the very human desire to just get it working. While I try to make sure the applications are properly abstracted, oftentimes the combination of deadlines and the exploratory nature of the process means that I don't do this as well as I should.

This results in system components, libraries, and abstraction layers which at times can be brilliant coups of engineering, and at other (far too frequent) times turn out to be astounding feats of perverse stupidity. In rereading old code, I think for every time I have said "I did that? Damn, that's cool!", I have also said "What the HELL was I thinking?"

Of course, sometimes the design goals are poorly specified, or very broad, or else there are no efficient methods of doing something at the moment, and you just have to wing it with an inferior, but at least functional, solution. That may account for some of the cases above. And then there is the issue of programs whose very purpose is misguided. That may account for one of the other cases given above.

OLS: On how user space sucks

Posted Jul 21, 2006 9:21 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

That one thing the C++ standard library got right, which I wish other interface designers would follow: treat the time and space complexities as *part of the interface*, document them, and *do not increase them*.

If you're not told how expensive some function call is, the only way to tell is to profile the hell out of it *on a system where n happens to be large* (so fontconfig sloth might not be obvious unless, like davej, you have many thousands of fonts), or to look deeply into the internal implementations of every function you ever use (no chance).

What this is really telling us is that we need better docs, I think.

OLS: On how user space sucks

Posted Jul 21, 2006 13:10 UTC (Fri) by oak (guest, #2786) [Link]

> If you're not told how expensive some function call is, the only way to
> tell is to profile the hell out of it *on a system where n happens to be
> large* (so fontconfig sloth might not be obvious unless, like davej, you
> have many thousands of fonts),

Actually when I half a year ago installed Breezy on a P166, I found
that simple Xlib based programs (xeys, xclock...) took actually twice
as long (10 secs) to start than e.g. Gnome calculator or Abiword
which were using fontconfig through Xft through Pango. So, I would
say that it's an improvement over what was before. :-)

When I straced the Xlib programs, most of the time seemed to be going to
loading information about Asian bitmap based X fonts. It would have been
nice if the other bitmap fonts (for example everything besides the cursor
and fixed font) would have been in separate package that is installed only
when needed.

OLS: On how user space sucks

Posted Jul 21, 2006 17:40 UTC (Fri) by Tet (guest, #5433) [Link]

What this is really telling us is that we need better docs, I think.

Well, yes. But also better tools. SystemTap is a perfect example of a "better tool" in this case. It lets you look at problems at a system-wide level, rather than on a per-process basis, and you'd be amazed at some of the things that show up, in places you're least expecting. It turns out that on my desktop, ioctl() and poll() are called more than any other system call, by an order of magnitude -- where intuitively (and based on experience on previous systems), I'd have expected gettimeofday(). SystemTap provides an easy way to track down the culprit, too (in this case, the java_vm, even when no applet is running -- but being closed source, there's sadly nothing that can be done to fix it).

OLS: On how user space sucks

Posted Jul 21, 2006 0:40 UTC (Fri) by Tara_Li (guest, #26706) [Link] (2 responses)

Any chance of a transcript popping up somewhere?

OLS: On how user space sucks

Posted Jul 21, 2006 9:11 UTC (Fri) by pebolle (guest, #35204) [Link] (1 responses)

http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf (pages 441 through 449)

OLS: On how user space sucks

Posted Jul 27, 2006 17:15 UTC (Thu) by lockhart (subscriber, #31615) [Link]

Or see the individual paper at http://ols2006.108.redhat.com/

Pointers to the relevant bugs filed

Posted Jul 21, 2006 1:11 UTC (Fri) by yusufg (guest, #407) [Link] (3 responses)

I'm assuming Dave filed bug reports in the appropiate places or was this talk the 'bug report' and he's expecting others to file bugs.

Dave, if you are reading this and you've filed bug reports can you link to them please

Pointers to the relevant bugs filed

Posted Jul 21, 2006 3:44 UTC (Fri) by cventers (guest, #31465) [Link]

Pardon, but I'm guessing from reading your comment that you might have
missed Dave's point. I don't think his point was to point out specific
problems with specific applications -- his point was to make it clear that
we should be paying more attention to such things in general.

Having a super-robust and massively efficient kernel is only half the
battle. Without making similar improvements in user-space (or worse,
losing ground because of the sort of problems Dave brings up), the free
software desktop as a whole won't improve.

Asking "where's the bug reports" has always seemed to me like a defensive
response. The correct response is to acknowledge that there is a problem
and fix it.

Pointers to the relevant bugs filed

Posted Jul 21, 2006 12:32 UTC (Fri) by arjan (subscriber, #36785) [Link]

Most of the guilty people were actually in the room and took this bug report in public after admitting the humilating fact that we write sucky code...

Pointers to the relevant bugs filed

Posted Jul 21, 2006 21:39 UTC (Fri) by davej (subscriber, #354) [Link]

I don't have pointers to bugs, because in a lot of cases, I just mailed/IRC'd the relevant developers.

Lots of the examples I covered are already fixed, but there are still a number of outstanding issues. I'll be rerunning the tests some time soon, and see what's left that sticks out, but based on data I collected not so long back, we're now doing a *lot* better than we used to.

The initial tests the paper was based on were done on Fedora Core 5 test1 iirc, and I did some quick stats on an FC5 final release, and the amount of reads/stats/exec's were pretty much halved, even though we had added more functionality, and a few extra daemons etc.

I'll be looking at this stuff again as we get closer to FC6.

OLS: On how user space sucks

Posted Jul 21, 2006 6:50 UTC (Fri) by ekj (guest, #1524) [Link] (7 responses)

The startup-scripts also have tons of stupidity. Many of them are like literally hundreds of lines of shell-script containing dozens of checks, even in the case where they ultimately do nothing.

Case in point: Fedora, Mandriva and Ubuntu all install pcmcia by default, even in the case where the computer is a stationary that does not even have a pcmcia-slot. They also enable the pcmcia startup-script in the normal runlevels. The script is over 300 lines long. It contains 57 conditinals (ifs, elses, cases) and among other things, calls a script named laptop-detect that tries to guesstimate if we're on a laptop. (by messing around in proc looking for a batteries-file, among other things)

Now, most computers don't turn into laptops overnigth. It would be perfectly possible to do this detection *once* on installation, and thereafter simply not install pcmcia-stuff if we don't actually have that hardware. It would even be reasonable.

Yes, it's "convenient" to have new/changed hardware autodetected and auto-working on first boot after installation. I'm not sure it's worth it though. You could skip a *LOT* of startup if you simply assumed this boot was going to be exactly like the last one. There could be a big fat option in the boot-menu saying: "Configure new hardware" which would do what the bootup-scripts do *every* time now.

Profiling before you optimize

Posted Jul 21, 2006 8:41 UTC (Fri) by Thalience (subscriber, #4217) [Link] (2 responses)

Do you have profiles to support the idea that hardware autodetection is a big component of the startup time? It could be, but watch out for optimizing blind.

I'd take reliable hardware auto-dectection over a static configuration at any reasonable cost, just as a matter of personal preference.

Profiling before you optimize

Posted Jul 27, 2006 7:54 UTC (Thu) by ekj (guest, #1524) [Link] (1 responses)

Not scientifically valid profiles, no.

But bootup-time (measured from GRUB loads the kernel until the last startup-script finishes) improved from 1:48 to 1:32 simply by disabling kudzu (which does detection of new hardware).

Uninstalling packages that where auto-installed without question, and that support hardware I don't have shaved another 10 seconds off that to aproximately 1:23.

1:23 and 1:48 ain't that hugely different, but it *does* mean a 30% increase in bootup-time for trying to detect hardware I don't have on every bootup.

Some hardware is regularily plugged in an out. It's reasonable (and good) to try to detect such. But that should happen on the fly, and not as part of a bootup-script. Afterall, the user may very well plug in a usb-stick or whatever *after* logging in.

Other hardware (like for example a pcmcia-slot) is rather unlikely to suddenly appear. I'm guessing that 99.9% of the computers that don't have it when the distro is installed, will *never* have it.

Profiling before you optimize

Posted Jul 27, 2006 17:26 UTC (Thu) by Thalience (subscriber, #4217) [Link]

Ok. Those are reasonable numbers to work with. Thanks for making the effort!

I still think, however, that ditching auto-detection would be to throw the baby out with the bathwater. By focusing more attention on hot-plug style auto-detection, we can have both fast startup and reliable, effortless hardware support.

As an aside, I've always thought that Kudzu was a very appropriate name for that particular peice of software. :)

OLS: On how user space sucks

Posted Jul 21, 2006 10:06 UTC (Fri) by kleptog (subscriber, #1183) [Link]

Last time I installed Debian, at the end of the installtion it noticed I didn't have a laptop and offered to remove pcmcia-cs for me.

Nowadays most of the pcmcia stuff has been subsumed into udev so this may not even be relevent anymore...

OLS: On how user space sucks

Posted Jul 24, 2006 23:27 UTC (Mon) by cjwatson (subscriber, #7322) [Link] (2 responses)

On Ubuntu, pcmciautils is installed by default because - aside from its init script - it's pretty small and lightweight (compared to the monster that was pcmcia-cs/cardmgr) and it simplifies the installer, debugging the resulting system, etc. if we just install it all the time.

Per Olofsson simplified that init script a fair bit in Debian; Edgy has that simplification, and e.g. no longer calls laptop-detect.

It's also worth noting that 'case' in shell scripts doesn't spawn a subprocess, unlike 'if [ ... ]', so simply counting conditionals doesn't always give you a fair picture.

Test, with "test" vs /usr/bin/test

Posted Jul 25, 2006 3:59 UTC (Tue) by Richard_J_Neill (subscriber, #23093) [Link] (1 responses)

"if [ a = b ] " and "if test a = b " are both shell-builtins. If you want the other one, you have to call /usr/bin/test

$ date; for ((i=0; i<10000; i++)) ; do if [ a = b ] ; then c=d ; fi ; done ;date
Tue Jul 25 04:51:16 BST 2006
Tue Jul 25 04:51:16 BST 2006

$ date; for ((i=0; i<10000; i++)) ; do if test a = b ; then c=d ; fi ; done ;date
Tue Jul 25 04:51:27 BST 2006
Tue Jul 25 04:51:27 BST 2006

$ date; for ((i=0; i<10000; i++)) ; do if /usr/bin/test a = b ; then c=d ; fi ; done ;date
Tue Jul 25 04:51:33 BST 2006
Tue Jul 25 04:51:50 BST 2006

It's a huge difference! The script with the builtins runs 60 times faster.

Test, with "test" vs /usr/bin/test

Posted Jul 29, 2006 15:43 UTC (Sat) by kreutzm (guest, #4700) [Link]

Where do you see that "60x"? ALso I'd advise you to use time(1) next time ;-)

OLS: On how user space sucks

Posted Jul 21, 2006 7:37 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

I think it's more that *when the smart scheduler was implemented* (XFree86 4.1?) gettimeofday() was considered too slow (for tight inner loops when processing multiple requests from clients all at once).

The tradeoff might well have changed since then.

OLS: On how user space sucks

Posted Jul 21, 2006 9:43 UTC (Fri) by nix (subscriber, #2304) [Link]

(Ah, I see davej mentioned this. Thanks for the transcript!)

OLS: On how user space sucks

Posted Jul 21, 2006 10:22 UTC (Fri) by tcabot (subscriber, #6656) [Link] (3 responses)

This reminds me of the time a few years ago when Alan Cox ran Nautilus under strace. He didn't like what he saw:

http://mail.gnome.org/archives/nautilus-list/2001-August/...

I find that developers (at least here in the US) tend to have very fast machines and upgrade them frequently. Do you think that this might be one of the causes, i.e. the developers might not notice these issues since their HW is so fast? If that's the case then I'd expect OLPC to provide a lot of useful feedback. Running in a resource-constrained environment tends to make efficiency problems more "itchy" than they are on monster HW.

OLS: On how user space sucks

Posted Jul 21, 2006 11:20 UTC (Fri) by Viddy (guest, #33288) [Link] (1 responses)

Just for sh*ts and giggles, and because my laptop makes lots of fan noise when it uses lots of processor power and gets hot (its a Compaq Evo n1020v P4 2.4GHz) I thought I might run strace on various processes running on a stock ubuntu install to see why it didn't idle at 300mhz and buggerall cpu usage.

Wow. I'm not a particularly good coder, but my understanding is that if it polls like a large chunk of the desktop programs do, generally, you've done it wrong. I know of the sleep() and usleep() functions, and I'm pretty sure I could figure out callbacks.

It seems that the cups daemon responds with http headers several times per second to none other than... gnome-cups-icon. I mean, It's nice to see printers appearing when the cups daemon finds them on the network, but checking multiple times per second?

The update-notifier spits out craploads of file reads per second. I'm pretty sure that my ubuntu repositories don't change _that_ often :)

The upshot of this is that I'm now irritated enough to start downloading source and submitting patches.

The part that gets me is that neither of my machines are that slow, and yet, under a linux desktop, it feels like I'm trying to push mud around with my mouse. I want my desktop to feel snappy.

OLS: On how user space sucks

Posted Jul 28, 2006 17:45 UTC (Fri) by sandmann (subscriber, #473) [Link]

> Wow. I'm not a particularly good coder, but my understanding is that if it polls like a large chunk of the desktop programs do, generally, you've done it wrong. I know of the sleep() and usleep() functions, and I'm pretty sure could figure out callbacks.

Uh, are you saying that using poll() is wrong?

OLS: On how user space sucks

Posted Jul 21, 2006 15:10 UTC (Fri) by wilck (guest, #29844) [Link]

All the time since I read that article I have wondered if anybody ever cared to fix this.

I guess the same will happen to me with Dave's findings, unless LWN some time publishes a follow-up titled "user space doesn't suck no more" ...

OLS: On how user space sucks

Posted Jul 21, 2006 13:00 UTC (Fri) by NAR (subscriber, #1313) [Link] (24 responses)

Nautilus polls files related to the desktop menus every few seconds, rather than using the inotify API which was added for just this purpose.

Exactly when was the inotify API added? 2.6.x or 2.4.x?

Bye,NAR

inotify

Posted Jul 21, 2006 13:20 UTC (Fri) by corbet (editor, #1) [Link] (23 responses)

inotify was merged for 2.6.13, one year ago.

inotify

Posted Jul 21, 2006 13:44 UTC (Fri) by pizza (subscriber, #46) [Link] (22 responses)

So inotify has been in linux 2.6 for about a year now; cool.

Now what about the other platforms that Gnome has to run on? How long have they had inotify? Have they ever? Will it work the same way?

When one of your project goals is to be portable, you really do need to code to the least-common-denominator APIs. Special-case code paths add greatly to software complexity and make debugging more difficult.

Yes, userspace often does a lot of dumb things, but "not taking advantage of bleeding-edge kernel features" isn't usually one of them.

inotify

Posted Jul 21, 2006 13:53 UTC (Fri) by arjan (subscriber, #36785) [Link] (3 responses)

gnome has a thing called "gamin" which abstracts the various inotify-like interfaces the different operating systems provide. At least IRIX provides a dnotify like thing, as does linux historically. And if the OS is unknown or has no method *gamin* goes to do the polling.

So there is quite reasonable infrastructure for this in gnome.. just it's not being used consistently

inotify

Posted Jul 21, 2006 17:02 UTC (Fri) by sepreece (guest, #19270) [Link] (2 responses)

Gamin was actually one of the things Dave complained about...

inotify

Posted Jul 22, 2006 9:51 UTC (Sat) by drag (guest, #31333) [Link] (1 responses)

I am sure that it's worth complaining about... but it's a lot better then the 'famd' it replaced!

For instance with famd if I had mount point or something like that in my home directory then it would crap out if I tried to go more then 2 directories deep. And basicly cause anything to do with gnome that concernes files (nautilus mostly)

With gamin there is no problem.

I think that a huge part of the problem we have with performance on Linux desktop nowadays is that everybody was scrambling to get just the basics in place and everything more or less working.

Hal/Dbus/X.org/inotify(and it's userspace helpers)/desktop search stuff/udev.. etc etc. All of it is thrown together and made to 'make it just work'.

Now it seems that the push is going towards making 'make it work well'. Filling out the blanks, improving performance. That sort of thing.

inotify

Posted Jul 27, 2006 9:35 UTC (Thu) by nix (subscriber, #2304) [Link]

... except of course that if you're a poor sod whose home directory is mounted over the network (perhaps from a centralized RAIDed fileserver), then, oops, the damn thing falls back to polling (over a network!)

Apparently its inability to send notifications to other copies of itself over the network is a *feature*, but given that you're using NFS or a similar fs in any case, I can't imagine what extra security threats could be opened by sending notifications around. (FAM could do this.)

inotify

Posted Jul 21, 2006 15:18 UTC (Fri) by cventers (guest, #31465) [Link] (17 responses)

The "least common denominator" argument really sucks. I get that KDE,
Gnome and X.org try to support as many of the UNIXes as they can. But I
refuse to accept that they should do so at the expense of the majority of
their users (who are using Linux).

It's totally possible to build platform-independent code (hell, the
toolkits both of our desktops run on are portable to operating
systems /without/ UNIX APIs), yet specialize on each platform. Take the
kernel as a great example -- we have a nice mechanism called
"alternatives" that detects processor model and counts, and then
re-writes parts of the kernel text on the fly in order to make it
maximally efficient. The developers could have instead shot for the
lowest common denominator (386) -- cause the code would still certainly
work on everything else (provided that it's also built for SMP).

We depend on the huge mess of scripts known as "autom4te" so much these
days in order to make our buildsystems work, but when I watch all the
crap flying by on every package I build, I realize that few of them
actually /need/ all those damn checks. Why don't we make better use of
the tools we have? autom4te can check inotify. If it's present, don't
build a Gnome desktop that spams the kernel, CPU and memory bus every
second when there's no activity at all.

inotify

Posted Jul 21, 2006 16:04 UTC (Fri) by nix (subscriber, #2304) [Link] (4 responses)

No, in practice you must build something which tests for inotify at runtime and falls back to dnotify or even polling. The reason: distributors won't want to build programs which fail to work when run on kernels as recent as 2.6.12 --- at least, not non-system-level programs.

However, this is perfectly doable.

inotify

Posted Jul 21, 2006 16:45 UTC (Fri) by cventers (guest, #31465) [Link] (1 responses)

Ah, good point. Well, at the very least, having build-time inotify
support would assist some of us (crazy Gentoo users that spend half our
lives watching a compiler) immediately and others later ;)

But yes, just attempt inotify at startup. -ENOSYS? Ok, we'll try this
another way.

inotify

Posted Jul 21, 2006 18:31 UTC (Fri) by nix (subscriber, #2304) [Link]

fam 2.7.0 uses dnotify in any case, if it's available. It may not be as nice as inotify but it's a hell of a lot better than polling.

Hm. Looking at the sources, gamin has had an inotify backend since v0.0.8, Aug 26 2004, *long* before inotify hit the kernel proper. It is enabled by default.

Looks like this might be an out-and-out bug. I'll have a look this weekend and see if I can reproduce and fix it.

inotify

Posted Jul 23, 2006 17:51 UTC (Sun) by NAR (subscriber, #1313) [Link] (1 responses)

in practice you must build something which tests for inotify at runtime and falls back to dnotify or even polling.[...]However, this is perfectly doable.

Yes, but I'm afraid this is way above the avarage application programmer's level.

Bye,NAR

inotify

Posted Jul 23, 2006 17:55 UTC (Sun) by cventers (guest, #31465) [Link]

True, but this is something that should be part of the desktop
infrastructure, not a part of every application. Most application
programmers wouldn't have a lot of fun with Xlib either, but someone's got
to do it...

inotify

Posted Jul 21, 2006 16:08 UTC (Fri) by pizza (subscriber, #46) [Link] (11 responses)

A few points --

* Software outlives hardware, by several orders of magnitude. You really weaken your argument by trying to draw parallels there -- especially when modern distros *still* build userland for a stock i386.

* "least common denominator" gives you the greatest coverage with the least effort. Additional effort should be focused on where it does the most good, and that call is (hopefully) made by those who know the bigger picture and/or do the actual work. (I'd agree that inotify support is a promising candidate, but I'm just an armchair general)

* Different APIs can require radically different software architecures; it's not a matter of "writing an autom4te test"; someone has to actually write a non-trivial pile of non-trivial code, while leaving the existing path intact as a run-time fallback and maintaining complete backwards compatibility (source, binary, and behaivoral) for the APIs that Gnome exports.

So while yes, the "least common denominator" argument sucks, it's not the suckiness of the argument itself, but rather the suckiness of the *reality* that the argument represents.

"Optimization without instrumentation is just mental masturbation"

inotify

Posted Jul 21, 2006 16:42 UTC (Fri) by cventers (guest, #31465) [Link] (8 responses)

There is a difference between a vendor choosing to make i386 releases and
programmers refusing to use the features of any more modern chip simply
because a few i386 boxes are still out there clocking their ops. One of
the great things about having open source code is that you can download
and build your own packages optimized just how you choose (indeed,
distributions like Gentoo even make it easy). You're doing well if your
code will build for old hardware but otherwise make use of new features.

The problem with the least common denominator argument isn't really the
suckiness of the reality that the argument represents, it's the fact that
it ever gets used as an excuse to write code in which "sub-optimal" is a
gross understatement.

Furthermore, the fact that different systems require different code to be
optimal is a fact of life. It's why we have abstraction layers at all. If
every system was the same, operating systems either wouldn't exist or
they'd be a hell of a lot more simple, and that goes for everything from
the bottom of the stack up. It's very much a reality, as you put it.

When you choose to support multiple systems, you should be ready to write
multiple implementations of the same function. Writing to the least
common denominator -- and not ever specializing -- is a cop-out.

> "Optimization without insturmentation is just mental masturbation"

I've never much been a fan of that argument either, because it's often
used to justify incredibly sloppy / inefficient code. The quote as it
stands is simply imprecise. There are /some/ optimizations which are
questionable enough that you very much want insturmentation before you
write large chunks of code, but the world just isn't black and white.

Put another way: I would like to think that any reasonably talented
systems programmer would know that polling files several times a second
for something like menu entries, or assembling entire HTTP queries and
responses several times a second to communicate with a system tray icon,
is a bad idea -- something that could be optimized. No need for
insturmentation at all.

These arguments (the least common denominator and the no optimization
without insturmentation) really irritate me, because I started on a 386
and many common operations take more wall-clock time today than they did
back then. I'm now on a Pentium 4, for chrissakes, with a gigabyte of DDR
RAM. What has happened is that as the generations go on, some of us seem
to be trading in programmer time for CPU time (read: being lazy).

It seems like a perfectly acceptable bargain, and on some level it is. (I
don't think any sane person expects you to write desktop apps in
assembler, even though if you somehow had the dedication and
concentration required you'd make something at least slightly faster).
The problem is that programmers are _being lazy_ and choosing points on
the "diminishing returns" curve that are well before returns start to
diminish.

I'm sure not all of Dave's identified misbehaviors were even apparent to
the programmers in question. Many of them are probably 'bugs'. But when I
hear about applications hammering the filesystem many times per second,
or using HTTP as an IPC mechanism between a system tray icon and another
program, I worry that we've all gone just a little bit crazy.

So I propose a new quote:

"Sensible optimizations give pleasure by default"

inotify

Posted Jul 21, 2006 19:18 UTC (Fri) by pizza (subscriber, #46) [Link] (7 responses)

Most of your response is tangental to the argument I submitted.

Here's the bottom line -- we're not all "above average" programmers. Even when we know what "the right way" is, we usually don't have that luxury due to externally-imposed constraints.

"Cheap, fast, good. Pick two"

inotify

Posted Jul 21, 2006 20:14 UTC (Fri) by cventers (guest, #31465) [Link] (6 responses)

> Most of your response is tangental to the argument I submitted.

Really? I'm not sure I see how. It seems to me like you were listing
counterpoints to my complaint about programming to the least common
denominator, and I was systematically addressing them (including your
quote about optimization)

> Here's the bottom line -- we're not all "above average" programmers.
> Even when we know what "the right way" is, we usually don't have that
> luxury due to externally-imposed constraints.

What does "average" have to do with it? It doesn't take oodles of talent
to build a model capable of using different implementations. Sometimes,
it's even more trouble to try and come up with something generic!

You allude to constraints but never mention what some of them might be.

> "Cheap, fast, good. Pick two"

Why pick just two? One of the greatest things about free software
development is that it's usually not the requirements-driven,
oh-my-the-deadline-is-yesterday-and-the-customer-is-complaining-style
development uncomfortably familiar to programmers working in the
corporate world. And if our projects are being run that way (which I
don't think they are), we should move further up the chain and ask why
we're adopting policies and procedures that impose external constraints
on our code quality.

This stuff isn't actually all that complicated. The problem is either

*A) No one had pointed out ways in which apps misbehave, so no one knew
there was a problem (glad we have this paper to enumerate some examples!)
*B) Developers did what they thought was 'good enough' and just didn't
realize that their implementation didn't make their expectation
*C) We're less than average programmers and we can't figure this stuff
out for the life of us (doubt that, there's oodles of awesome free
software from all of the major projects out there, which demonstrates
competency)

So I think Dave's paper was spot-on. We should skip the 'apologizing'
step and move on to 'making it better'.

inotify

Posted Jul 23, 2006 13:08 UTC (Sun) by pizza (subscriber, #46) [Link] (5 responses)

"Fast, cheap, good. Pick two" is a reflection of the reality that nothing is without cost.

If you want your software to be developed "good and fast", then it's not going to be cheap. If you want it "fast and cheap" then it's not going to be all that good. If you want it "good and cheap" then it won't happen particularly quicky.

"fast and cheap" is usually where software ends up when someone is directly footing the bill (and hence, there is an upper bound on cost, aka budgets/deadlines, and "good" tends to suffer). "Good and cheap" is where F/OSS software traditionally lies, where the "it'll be done when it's done" attitude is the norm. Then we end up with the likes of NASA (or other life-critical situtations), where the requirement of "good" is so important that it happens neither quicly nor cheaply.

The problem with the above generalization is that many larger F/OSS projects (including Gnome) actually fall into the first category, as the majority of the "work" is done by people required to do so, with formal goals, deadlines and budgets. F/OSS has gone up and been corporatized!

(Another glaring hole in this generalization is that "good" means different things to everyone -- In the end, only the one who is footing the bill gets to make that call -- but that is the nature of generalizations..)

And finally, I would agree with you and chalk up the problems that Dave raised to (A) and (B), although they both are symptoms of (C) -- which is usually due to inexperience, not idiocy. Subsequently, with better awareness of (A) and (B), (C) is lessened as the programmer presumably will learn from their mistakes.

Dave's ("spot on", as you put it) paper was a direct result of the idea embodied by the "no premature optimization" blurb that you took so much of an issue with. Without that instrumentation, this handful of bugs/mistakes wouldn't have likely come to light, and we wouldn't have been able to learn from them.

inotify

Posted Jul 23, 2006 17:10 UTC (Sun) by cventers (guest, #31465) [Link] (4 responses)

Most of what you say about "Fast, cheap, good. Pick two" is fine and good.
But all I'm really trying to say is that we, the F/OSS community, have the
capacity to do better. Look at the Linux 2.6 process - I would call
that "Fast, cheap, good". It's not perfect, but it's damn fast, it's still
F/OSS and it's still /very/ good.

You could twist the definition of fast, cheap, and good enough to make
the "Pick two" argument apply to any project. The problem I have
with "Pick two" and the earlier optimization quote is simply that most of
the time I've heard an engineer saying one, it's being invoked as an
excuse for shoddy design. And I've personally witnessed that when you
simply let a passion for your art drive your work, and sprinkle on a
little bit of experience in the environment you're working in, you can
deliver "fast, cheap, good" all at once.

F/OSS is getting more and more industrialized, but depending on the
project, the majority of the code still comes from people with that
passion -- people just scratching their itch. I hope our projects don't
erode into the same corporately-managed disasters as are so commonplace to
the proprietary software engineer. But since engineers have the power in
F/OSS, I think if we focus on passion and rejecting ideas like "fast,
cheap, good -- pick two," we'll be entirely successful in breaking the
traditional rules of development once again.

This is free software. The traditional rules of corporate development
don't apply; please leave them at the door.

inotify

Posted Jul 23, 2006 17:55 UTC (Sun) by NAR (subscriber, #1313) [Link] (3 responses)

Look at the Linux 2.6 process - I would call that "Fast, cheap, good". It's not perfect, but it's damn fast, it's still F/OSS and it's still /very/ good.

I wouldn't call the 2.6 process "fast, cheap and good". It might be fast, but it's certainly not good (the last usable kernel for me was 2.6.14) and definitely not cheap - I'd like to know how many kernel developers are funded for their work on the kernel. I think it's not a particulary low number.

Bye,NAR

inotify

Posted Jul 23, 2006 18:00 UTC (Sun) by cventers (guest, #31465) [Link] (2 responses)

If 'cheap' is a function including n (the rate of change) rather than a
constant, then I think the kernel is about as 'cheap' as you can get.

It's unfortunate that you've had problems since 2.6.14. What sort of
problems are you having?

After having seen the survey conducted here on kernel quality, it would
seem like most users are pleased (I'm one of them).

inotify

Posted Jul 24, 2006 6:42 UTC (Mon) by drag (guest, #31333) [Link] (1 responses)

Each kernel gets better for me. 2.4 series was better then 2.2. 2.6 is better for me then 2.4

Lower latencies, more usable desktop. Better responsiveness. My hardware is supported out of the box on new kernels, which is wasn't for older. ALSA sound drivers are a huge improvement over OSS for me. With dmix I can have, get this, more the _one_sound_ at a time and it doesn't sound like crap. Multimedia performance has improved.

(of course I am still taking about the kernel here.. it's desktop scedualing options makes life better)

Stability has improved. Wireless support has improved. Udev makes things easier for me now that I just tell the computer what /dev files I want vs having to dig around and finding the stupid major minor numbers for everything.

Maybe if the other person was to post WHY 2.6.15, 2.6.16, 2.6.17 series kernels are unusable maybe they would have receive more sympathy.

inotify

Posted Jul 24, 2006 12:27 UTC (Mon) by NAR (subscriber, #1313) [Link]

My hardware is supported out of the box on new kernels, which is wasn't for older.

Your mileage may vary, but I never managed to boot my old 486 with 2.6 kernel - fortunately it worked with 2.4. It didn't worked well, the TCP connection tracking code kept tracking connections that were long gone, so the system ran out of memory, but it still worked. On the other hand, one of the two reasons I use 2.6 on my other computer is that with 2.6 I dont' have to reboot between watching a DVD and burning a CD-R.

Stability has improved. Wireless support has improved. Udev makes things easier for me

Again your mileage may vary, but my computer locks up hard with every single 2.6 if I make a larger I/O operation while watching TV with xawtv - and this wouldn't make a useful bug report. I don't have wireless cards and never felt the need for dynamic /dev, so these features do not make me happy.

WHY 2.6.15, 2.6.16, 2.6.17 series kernels are unusable

Recording audio from TV doesn't work with mplayer. I've reported the bug and it's supposed to be in mplayer and supposed to be fixed, yet it still didn't work when I tried last time. So I stick with 2.6.14.

Bye,NAR

i386 userland?

Posted Jul 21, 2006 17:42 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

The userland is normally compiled for i386 instructions only, but scheduled (instruction selection and ordering) for i686. The code where full i686 (or whatever) does make a real difference is far in between (and there you do get i686 packages).

Distributions (and their users!) do pay a hefty price if there are zillions of package versions by CPU type.

inotify

Posted Jul 22, 2006 6:07 UTC (Sat) by dvdeug (guest, #10998) [Link]

As far as I know, none of the modern distros build for a stock i386. You can't; the C++ libraries depend on i486 opcodes to properly implement threading, and while there are i386 alternatives, they're slow and unreliable.

OLS: On how user space sucks

Posted Jul 21, 2006 13:31 UTC (Fri) by oak (guest, #2786) [Link] (1 responses)

> X.org, says Dave, is "awesome." It attempts to figure out where a graphics
> adapter might be connected by attempting to open almost any possible PCI
> device, including many which are clearly not present on the system.

A good example of this is following crash:
https://launchpad.net/distros/ubuntu/+source/xserver-xorg...

ATI driver crashed trying to probe something that didn't exist on ISA
bus after finding the correct card from PCI bus. I think the problem
is the drivers, not the X server itself.

OLS: On how user space sucks

Posted Jul 21, 2006 18:25 UTC (Fri) by daniels (subscriber, #16193) [Link]

No, he was talking about how the code in hw/xfree86/os-support/bus/Pci.c and hw/xfree86/os-support/linux/lnx_pci.c walks the tree (it's really bad, and only very recently got better). The ATI situation is due to the driver being braindead (use Driver "radeon", not Driver "ati"), and has nothing to do with the code Dave is talking about.

(For what it's worth, I fixed the library-stat()ing in Ubuntu quite a while ago, but the patch only got half-merged into upstream because it broke a couple of things. But I fixed it again in Dave's talk, after realising that it was still sucking.)

Some more observations: xfce4-panel, firefox

Posted Jul 23, 2006 0:09 UTC (Sun) by hein.zelle (guest, #33324) [Link]

Inspired by the above article, I went to check the first programs in the cpu ordered top-list on my system. Although I realize the point of the article was more generic, it may be an interesting excercise to try this on your own system. My system is running a recently updated debian unstable with kernel 2.6.17-1-k7.

xfce4-panel: average 2% cpu usage. A strace of approximately 10 seconds shows 267 function calls to gettimeofday, and 141 to ioctl and poll.

firefox: average just over 2% cpu usage. A strace of approximately 10 seconds shows 594 calls to gettimeofday, 151 calls to poll, read and ioctl, 282 calls to futex. I have no idea why it calls gettimeofday 4 times in a row in every cycle.

Apparently cpu-unintensive polling is a more common problem than I thought it would be. Is using gettimeofday, ioctl and poll the common way to do this?

Infinite bloat

Posted Jul 24, 2006 21:01 UTC (Mon) by aegl (subscriber, #37581) [Link] (5 responses)

A long time ago in a Unix version far, far away /bin/true was an empty shell script (since it was executable, the shell would run it as a shell script when the kernel failed the exec(2), with nothing in the script, the shell returned a 0 exit code).

$ ls -l /bin/true
-rwxr-xr-x 1 root root 21969 2004-04-05 21:32 /bin/true

Infinite bloat!

But it is worse. Run strace on /bin/true, and you'll see it open and mmap a dozen locale files (and try and fail to open a dozen more).

The problems here seem to have arisen because someone decided to add "--version" and "--help" arguments. Aaargggghhhh!

Infinite bloat

Posted Jul 25, 2006 1:00 UTC (Tue) by jonabbey (guest, #2736) [Link] (2 responses)

But invoking a complete /bin/sh process to evaluate the empty shell script file would have been worse, surely?

Infinite bloat

Posted Jul 25, 2006 1:11 UTC (Tue) by zlynx (guest, #2285) [Link]

Not mine, but I remembered seeing it before.

Here you go :)
http://www.muppetlabs.com/~breadbox/software/tiny/true.as...

Infinite bloat

Posted Aug 9, 2006 1:28 UTC (Wed) by barrygould (guest, #4774) [Link]

True was changed from a shell script to an executable due to the fact that running a shell for a user that wasn't supposed to be able to login (e.g. an ftp-only user) created a security hole (ctrl-c could get you a shell).

Barry

Infinite bloat

Posted Jul 25, 2006 1:40 UTC (Tue) by joey (guest, #328) [Link] (1 responses)

Shell coders who are really interested in being efficient don't use external commands like /bin/true anyway, when the shell builtin ":" will do the same thing.

(true is also a builtin in bash and dash, but I prefer ":" for space-efficiency also.)

However, your version of gnu true doesn't seem to match mine, which opens only /usr/lib/locale/locale-archive, and which is faster than the zero-byte version.

Infinite bloat

Posted Jul 25, 2006 12:05 UTC (Tue) by nix (subscriber, #2304) [Link]

That is all GNU libc-version-dependent, and happens before main() is entered.

OLS: On how user space sucks

Posted Jul 24, 2006 23:18 UTC (Mon) by bluefoxicy (guest, #25366) [Link]

You know all you people are gathering nice profiling data and all, I'd like to know all the stuff that normally gets scanned for and why it's relevant. Seems to me the lot of you are running strace and finding a djillion gettimeofday()s et al; wouldn't it be nice to have a small shell script that does this and spits out timing information?

Unfortunately oprofile tends to suck and eat up gigs of hard disk space in a few days, else it'd be interesting to have a method of automatic profiling and reporting. Of course we then sacrifice cycles to do runtime profiling, which is no good in a production environment; but I'd volunteer some CPU time to gather data like that, it's not that much of a drag for light usage like typical desktop stuff. I'm not sure of the security implications here ... runtime timing data may be useful for getting parts of passwords or encryption keys, if it's detailed enough; but that's black magic to me so far.

gtk file selector dialog

Posted Jul 25, 2006 1:49 UTC (Tue) by joey (guest, #328) [Link] (6 responses)

My favorite example at the moment is the gtk file selector dialog, as seen in firefox. Each time you change directories, it first uses getdents to get all the files in the directory, then stats each individual file, then _reads_ 4k of each file to determine the file type. That information is used to put undecipherable tiny useless icons next to the files indicating their file type.

If you're unfortunate enough to navigate to /usr/bin using it, it will probably lock up for a good 2 minutes as it reads 10+ mb of data and makes tens of thousands of system calls. Compare with ls /usr/bin which takes about 0.2 seconds even in inefficient sort and colorise mode.

gtk file selector dialog

Posted Jul 27, 2006 11:13 UTC (Thu) by jbh (guest, #494) [Link] (1 responses)

Yes! 2 minutes to choose an application sounds about right (even if I know the exact path and write it in the Ctrl-L location dialog).

I guess it wouldn't be too hard to work around either, but I really don't want to wade into the flame fest that surrounds the gtk file chooser. So... for now I mostly use opera, at least on slow machines.

gtk file selector dialog

Posted Jul 28, 2006 15:03 UTC (Fri) by bjornen (guest, #38874) [Link]

> So... for now I mostly use opera, at least on slow machines.

Ouch! opera 9.0-20060616.6 is one of the worst offenders on my computer (a
750MHz P4 laptop (dell i8k)).

top's TIME+ column says it used 24 CPU seconds by the time it's done
loading. After this it continues to steal ~3% CPU by itself and increasing
Xfree86's CPU usage to ~7%.

And this on a fresh install, only 9 tabs opened, the window minimised and
javascript, plugins, et al turned off.

There some surprising blame and constructive criticism in Federico Mena
Quintero's blog -
http://primates.ximian.com/~federico/news-2005-11.html#mo...

gtk file selector dialog

Posted Jul 27, 2006 14:21 UTC (Thu) by emj (guest, #14307) [Link] (2 responses)

The best part is you can't just enter a command to execute having firefox look it up in the path.. So this bug of GTK fileselector only show up because a misfeature in Firefox.. ;-(

gtk file selector dialog

Posted Jul 27, 2006 14:33 UTC (Thu) by emj (guest, #14307) [Link] (1 responses)

Here is te bug report in bugzilla all you have to do is fix it. Does any one know if gnome/kde has a "open with application" dialog like the one windows uses

gtk file selector dialog

Posted Aug 6, 2006 16:07 UTC (Sun) by Duncan (guest, #6647) [Link]

> Does [...] gnome/kde has a "open with application"
> dialog like the one windows uses

KDE (3.5.4) does. It pops up a 3-section window, textbox file chooser on
top (browse button to the right), a tree-view copy of the K-Menu in the
center, and two checkbox options on the bottom, run in terminal (with a
don't close after exit suboption that's dimmed out until run in terminal
is selected), and remember application association (similar to the
MSWormOS dialog option). Below that are the usual OK/Cancel buttons.

Duncan

gtk file selector dialog

Posted Aug 3, 2006 9:17 UTC (Thu) by kelvin (guest, #6694) [Link]

If you're unfortunate enough to navigate to /usr/bin using it, it will probably lock up for a good 2 minutes as it reads 10+ mb of data and makes tens of thousands of system calls.

This was all too true on older versions of GTK+, but the file selector (among other GTK-things) has been heavily profiled and improved in the latest versions. On this P4/3GHz Ubuntu install with GTK+ 2.8.20, a cold-cache opening of /usr/bin takes roughly 2 seconds (including display of the "undecipherable icons").

Sometimes userspace is made to suck, I mean poll

Posted Jul 27, 2006 16:34 UTC (Thu) by gdt (subscriber, #6284) [Link]

linux-2.6.17/Documentation/feature-removal-schedule.txt says:

What: mount/umount uevents
When: February 2007
Why: These events are not correct, and do not properly let userspace know when a file system has been mounted or unmounted. Userspace should poll the /proc/mounts file instead to detect this properly.

Also, try and detect if an unrelated process is running without polling (hint: *notify doesn't work on procfs).

OLS: On how user space sucks

Posted Nov 10, 2010 10:11 UTC (Wed) by Randakar (guest, #27808) [Link] (1 responses)

(Sorry for raising this particular article from the grave)

How much of this has been changed and/or fixed today?

OLS: On how user space sucks

Posted Nov 11, 2010 9:57 UTC (Thu) by MKesper (subscriber, #38539) [Link]

Yes, this would be something worth of a follow-up!
Besides, here's the updated link to the slides of Dave Jones' presentation "Why userspace sucks" (Magicpoint format).