LWN.net Logo

Advertisement

Front, Kernel, Security, Distributions, Development. See your byline here on LWN.net.

Advertise here

Graesslin: Driver dilemma in KDE workspaces 4.5

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 6, 2010 21:48 UTC (Mon) by Sho (subscriber, #8956)
In reply to: Graesslin: Driver dilemma in KDE workspaces 4.5 by rahulsundaram
Parent article: Graesslin: Driver dilemma in KDE workspaces 4.5

> Kwin developers could have called for a public round of testing from end users so that results from a diverse set of hardware and drivers can be tested on before assuming that the strategy that they have planned will work.

There were two betas and three release candidates for KDE SC 4.5.0. I don't really know why that wasn't sufficient. From what I've read the problem seems to be that there's a fairly high bug noise ceiling due to driver flakyness that makes it hard to tell whether a report describes a one-off problem or if there's in fact a single problem that affects many people. So it becomes a question of how to produce better data to improve the metrics, I guess.


(Log in to post comments)

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 6, 2010 21:53 UTC (Mon) by rahulsundaram (subscriber, #21946) [Link]

You answered your own question. Call for testing should be focused and explicit guidance needs to be given on how to get started and report feedback. I am biased but I quite like Fedora test days and while it is not perfect does usually help find quite a number of issues and resolve them fairly quickly. Recent example

http://lwn.net/Articles/403677/

Fedora does explicit test days for Xorg drivers and that could very well be adopted to solve some problems seen by KDE.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 1:21 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

The outcome of those tests is very consistent:

For most people it doesn't work. They fill out rows of N/As in the test charts, or they leave the test channel after wasting hours trying to follow broken instructions written by overworked QA staff who are expected to make up for the disinterest of the "maintainers" of the driver.

And I think I finally figured it all out, the problem is that the Wise Men involved perpetually think it will be finished tomorrow, so no need for scaffolding. If I am building something as fragile and dangerous as KMS, every other line is debug. At each step I am trying to figure out what's wrong now, and when something is wrong, I shout it from the rooftops, in the hope that someone will hear before the world ends. But my approach is defeatist, it asserts that we are trying to find the problem, which means there is a problem, and nobody wants to hear that.

Try it, report a critical bug in Rawhide. It locks solid on your machine. Yep, looks like the KMS driver. And then you'll get radio silence. They have no mechanism to debug the hang, even though such hangs are an ever present menace of low-level graphics chipset tinkering. They may tell you to keep trying new versions, maybe it'll fix itself. That's all they have.

I remember when Valgrind didn't exist. A certain species of programmer wouldn't fix memory stomping bugs in those days, nor leaks. They couldn't reproduce the problem on their PC, and they didn't have a tool that would just tell them where the problem was, so you'd just better accept that it's your lot in life to have crashes or leaks. They didn't want to buy the expensive (and frankly not very good) proprietary tools. So nothing to be done. Thankfully Julian Seward didn't agree. KMS needs a Julian Seward urgently.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 5:27 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link]

That is overtly pessimistic. Especially the part about accusing developers of disinterest. As far as impressions go, I don't have to count on that or anecdotes. Reading bugzilla stats and the reports sent to test list show you how many bug reports have been filed and fixed via these test days and Xorg test days have been pretty successful. Anyway the point is about choosing the workflow that will be better than a general beta.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 10:20 UTC (Tue) by lmb (subscriber, #39048) [Link]

I don't think it is overly pessimistic. I'm afraid it is fairly accurate.

Trying to debug a graphics driver crash is non-trivial, and somewhat beyond the average Linux hacker. It certainly is beyond me; the hardware specs are closed, and the drivers and Xorg are not so good at diagnostic traces.

(Reporting the crash, and even getting one of the rare X backtraces just results in the very driver developers shrugging.)

And because of the way "direct" rendering works, a crashing application makes your entire hardware lock up. There's absolutely nil fault isolation.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 12:24 UTC (Tue) by nix (subscriber, #2304) [Link]

That tends to happen anyway. Graphics driver crashes have a nasty tendency to assert ownership of the PCI bus and then not let it go again. It doesn't matter what happened to the kernel when *that* happens: it's big-red-switch time.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 15:17 UTC (Tue) by lmb (subscriber, #39048) [Link]

So part of the problem seems to be literally hardware-related: software can't do fault isolation if the hardware doesn't. Wasn't this who (IO-)MMU thing supposed to help with that?

What prevents an OS from properly isolating a given driver, anyway? How does a gap here square with "virtualization", where we expect to be able to isolate guests from each other fully?

Maybe this is something to get right in the next version of hardware specs? The drivers will still suck, but at least they'll be easier to debug ...

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 15:58 UTC (Tue) by michaeljt (subscriber, #39183) [Link]

> So part of the problem seems to be literally hardware-related: software can't do fault isolation if the hardware doesn't. Wasn't this who (IO-)MMU thing supposed to help with that?
Yes, but they are still not widely deployed on the desktop. I don't know to what extent Linux uses them if they are present.

> What prevents an OS from properly isolating a given driver, anyway? How does a gap here square with "virtualization", where we expect to be able to isolate guests from each other fully?
Lack of an IO-MMU I suppose. Currently PCI hardware can't be safely directly accessed from a virtual machine (the VM can provide a virtual "wrapper" device around the real one though, which will be safe as long as it doesn't go too close to the bone). It was also one of the ideas behind micro-kernels, but apart from QNX no one ever produced a convincing one, and at least Linus thinks they are actually harder to do right than monolithic (I think he has a bit of experience with kernels).

> Maybe this is something to get right in the next version of hardware specs? The drivers will still suck, but at least they'll be easier to debug ...
See IO-MMU. Perhaps it won't suck too much.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 16:01 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

Parallel PCI has a rather nasty failure mode; I don't know if it's still there in PCIe.

If a PCI device goes into a state where its interface is still capable of recognising transactions on the bus but its internal bus/switch/whatever is irretrievably wedged, it will sit there issuing Target Retry every time you try to access it. PCI bus master logic tends to be designed to wait basically forever for the Target Retry state to go away, so as soon as the processor issues a read to that device, or completely fills the host controller's posted-write buffer, the system will lock up. (And even if it doesn't lock up, it will crash.)

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 8, 2010 8:17 UTC (Wed) by cladisch (✭ supporter ✭, #50193) [Link]

I've seen something like this with a PCI device behind a PCIe/PCI bridge, where the north bridge issued a Machine Check Exception when the transaction timed out. (With a pure PCIe device, it would behave similar.)

The only difference between an immediate lockup and a MCE is that the kernel tries to output an error message before panicking.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 11:31 UTC (Tue) by interalia (subscriber, #26615) [Link]

<pedantry>
The word meaning "to lack interest" is "uninterested". To be "disinterested" means to lack bias. For example, you want a judge to be interested in what is said during the case but disinterested so that justice occurs.

It's a pretty common trip-up though.
</pedantry>

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 22:41 UTC (Tue) by mp (subscriber, #5615) [Link]

Apparently for every bit of such pedantry there exists a link to Language Log. This time it is http://languagelog.ldc.upenn.edu/nll/?p=511

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 12:29 UTC (Tue) by renox (subscriber, #23785) [Link]

> They didn't want to buy the expensive (and frankly not very good) proprietary tools.

Uh? I found purify to be a good tool.

> So nothing to be done. Thankfully Julian Seward didn't agree. KMS needs a Julian Seward urgently.

Well Valgrind's features were the same as those proprietary tools, is-there a proprietary tools which has the set of features you want for driver debugging?
AFAIK no, so IMHO this is a pipe dream.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 13:36 UTC (Tue) by nix (subscriber, #2304) [Link]

valgrind has a substantially richer feature set than e.g. purify ever did.

a Valgrind for KMS

Posted Sep 7, 2010 15:54 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

AFAIU Julian saw a problem and approached it in a new way. Previously the accepted wisdom was that a decent memory debugger must be invasive, either instrumenting source or at least being linked in to mask out the system allocator. Valgrind's radical alternative is Virtualisation. The result is categorically more useful than tools like Purify in some circumstances, and the fact that it's GPL'd and thus can be included with the OS makes those circumstances all the more likely.

The accepted wisdom about KMS bugs, already repeated in this thread, is that if a device on a PCI bus (graphics chips have rarely been on a shared bus in the last decade but let's go along with the idea) goes haywire whether due to a hardware flaw or a driver bug, you're out of luck, it's unrecoverable. But it seems to me that most likely the CPU is still in charge here, so the situation is desperate, but not immediately fatal, just like a double fault. Very often it is technically possible to get things back into a working state, or at least to stash information about the circumstances of the fault and reboot - but the current drivers are oblivious so we hang.

Why mention Valgrind? Compare these alternatives:

1. "Sorry, I can't reproduce the bug you're describing. Maybe if you download the following sixteen projects as source, and rebuild them using [a memory debugging tool users can't reasonably be expected to own] you can collect enough information to fix it. Otherwise, sorry you're out of luck".

2. "Can't reproduce here. Do 'yum install valgrind'. Run the program with 'valgrind nameofprogram' and when it crashes paste the output into this ticket".

Guess which alternative leads to bugs getting fixed and which just teaches users not to even bother reporting them.

I'm not literally suggesting that there'd be a userspace binary that somehow debugs KMS drivers, but I am suggesting that the right smart person, possibly even the existing smart people who hack on these drivers given the inspiration, could make a dramatic difference to fix rate for KMS bugs by ensuring that they're debuggable. I am suggesting that this ought to be a very high priority, given the impact of unfixed KMS bugs (typically: the user has to run Windows or buy new hardware)

a Valgrind for KMS

Posted Sep 7, 2010 17:51 UTC (Tue) by nix (subscriber, #2304) [Link]

Valgrind's radical alternative is Virtualisation.
Not unless you take severe liberties with the definition, it's not. Valgrind is a dynamic translator, a pluggable instrumentation engine, and a JITter. No virtualization as such involved (although of course the native CPU is not directly executing the valgrinded program anymore, so the CPU is virtualized in that sense, but nothing else is.)

if a device on a PCI bus (graphics chips have rarely been on a shared bus in the last decade but let's go along with the idea) goes haywire whether due to a hardware flaw or a driver bug, you're out of luck, it's unrecoverable. But it seems to me that most likely the CPU is still in charge here
Er, if the PCI bus has been grabbed by something else, I think you're doomed. This is the bus arbitration protocol that's gone wrong: there is no way to break a lock, no meta-arbitration protocol we can appeal to. (At least I can't recall one. If anyone knows differently, please speak up!)

Equally, if the GPU has gone into an infinite loop and is not listening to the outside world anymore, even if the PCI bus is live you're not going to see anything on the screen. This is akin to an infinite loop in the kernel: all you can do is big-red-switch to get the system to listen to you again.

Does not work this way in Windows...

Posted Sep 7, 2010 21:07 UTC (Tue) by khim (guest, #9252) [Link]

The theory is sound, but the practice is different. Windows ATI driver can restart the card if it's stuck (usually from overclocking and/or bad cooling).

I don't really know how it's done (perhaps there are some kind of hardware watchdog in GPU?) but it works. So even if theoretically it's possible to create hostile GPU which will hog the bus forever the real existing GPUs were not designed to work this way.

Does not work this way in Windows...

Posted Sep 8, 2010 7:08 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

That parallel-PCI failure mode I described elsethread? I have encountered real existing hardware that could by incorrect operation be induced to cause that failure. Just because some hardware is sanely designed and well-implemented doesn't mean it all is.

Does not work this way in Windows...

Posted Sep 8, 2010 10:04 UTC (Wed) by lmb (subscriber, #39048) [Link]

Conceivably, it would be an improvement if 80%+ of errors were easier to report. If there are still some that can't be - such as your PCI lockup - that is, of course, suckish, but getting the others fixed would help considerably.

Because it does make a difference if my system crashes every day, every week, or once per month.

a Valgrind for KMS

Posted Sep 7, 2010 22:52 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

"but nothing else is".

I hope I didn't mislead anyone, I didn't intend to. You are correct "only" the CPU is virtual. A valgrinded program writes to a real disk, connects over a real network, prints to a real printer. But its JMP instruction does not change the instruction pointer on a real CPU. That's where the magic happens.

a Valgrind for KMS

Posted Sep 8, 2010 0:07 UTC (Wed) by nix (subscriber, #2304) [Link]

Well, not quite (though what you say isn't misleading, just simplified).

Because of the JITting, the JMP instruction *does* sometimes eventually end up as some sort of jump instruction in native assembler, and that *does* change the instruction pointer of the real CPU. (IIRC, sometimes valgrind may choose to run several JITted blocks together, so a JMP may be converted into nothing at all. But it's been a while since I looked at this.)

I'm not sure there's a term for the sort of on-the-fly there-and-back-again instrumented translation valgrind does. Paravirtualization, maybe, only that's already been used for something else. JIT is probably the best available term for it which isn't total compiler-hacker techspeak.

I wish there was a new valgrind paper that covered the fundamentals of the changes in it in the last five or so years. The earlier paper was one of the best things I read that year...

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 7, 2010 12:03 UTC (Tue) by alecs1 (subscriber, #46699) [Link]

More people would test if it was easy to test.

It is much much easier for me to run KDE 4.5 in Windows (I have it at work) than on my Debian box at home. Even more, I can install version 4.X for a user, and version 4.Y for another one. And all that with a few clicks, no special training.

The idea that software availability is less important than an "overblown" package system (with whatever benefits of security because libraries get upgraded and what not) is a loosing one on the desktop, particularly in this testing case.

Given the experience of wasted time, to get me testing you have to provide a program that does everything by itself (it should be an installer, rather than a script that compiles for 10 hours). I would have spent 3 hours testing 4.5 if it wasn't such a damn struggle to get 4.5.

Let's list the experiences here:
1. there's no Debian package, I use Debian (mostly because of the huge amount of packages that are easy to install and keep upgraded).
2. I also have an old Fedora installation. Sometime I tried to upgrade it to Rawhide, because Rawhide had some new KDE (4.3 beta if I remember) which wouldn't come to Debian too soon. Guess what, that couldn't be upgraded because of signing issues.
3. KDE also has a tutorial on how compile KDE and get it working. Following it never got me the correct result, I had to look for help and use Unix knowledge to get it working.

Oh, and I reported tens of bugs on KDE, and followed on each of them. I can't be accused I didn't try.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 8, 2010 12:13 UTC (Wed) by vonbrand (subscriber, #4458) [Link]

Re 2: Rawhide packages are usually not signed (too much turnover), just do "yum update --nopgp".

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 8, 2010 12:21 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link]

That would be

# yum update --nogpgcheck

kde-redhat.sf.net probably has the latest version available, backported for older releases. Safer than cherry picking from Rawhide.

Graesslin: Driver dilemma in KDE workspaces 4.5

Posted Sep 18, 2010 13:22 UTC (Sat) by jospoortvliet (subscriber, #33164) [Link]

Try using a liveCD?

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds