LWN.net Logo

LCA: Andrew Tanenbaum on creating reliable systems

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 4:20 UTC (Thu) by jwb (guest, #15467)
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

It sounds like the guy doesn't do that much with his computer. People who play games or edit broadcast video are using their computers right up to the very edge of their capabilities. Consider a GPU. Most GPUs cannot survive a stream of bad commands. If you send the wrong command you will deadlock the part and the computer will need to be reset. You could redesign the GPU to analyze the incoming command stream and reject bad commands, returning to a known-good state afterwards. Basically you want the GPU to be a device on a network with its own operating system. That would not be cheap nor easy, and the resulting system will be considerably slower.

High performance software relies on tricks which are, for the most part, quite unsafe. Your 3D game only works because the GPU is allowed unchecked access to main memory. If you were to start being careful about that access, performance will suffer. The same is true of cluster supercomputing with remote DMA.

Perhaps Tanenbaum envisions two classes of computer users: those who are willing to absorb the performance hit (because they only run PINE) as opposed to those who demand all the capabilities technology can offer.


(Log in to post comments)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 6:54 UTC (Thu) by khim (subscriber, #9252) [Link]

If you send the wrong command you will deadlock the part and the computer will need to be reset.

Computer ? Probably not. GPU ? Absolutely. It can be done without any hardware redesign. ATI drivers for Windows are doing it (not so sure about Linux ones).

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 19, 2007 4:40 UTC (Fri) by drag (subscriber, #31333) [Link]

Well I asked this on the Xorg mailing list and the basic respons was when the drivers in Linux bork, X restarts. Also you can do the ctl-alt-backspace to break out of X when it locks up. This is basicly what Vista has implimented, but in a more automatic manner.

If the driver fails in a manner that borks the hardware then your screwed irregardless in both operating systems.

That's my limited understanding. The details on the internet on what actually happens with the video card reset features in Vista is few and far between. So don't take it as the gospil truth.

The video drivers in Linux have always been userspace, which is suppose to be a big new feature for Vista. The 'DRM' Linux kernel modules allow the 'DRI' drivers (drivername_dri.so) to control the hardware. As have been the USB drivers, optionally.

Another example seems like Microsoft is full of shit about a lot of other aspects of Vista that are touted as huge improvements.

For example they toute their 'resolution independant ui'. this is to make your UI to work better with very high resolution displays. In reality all this means is that you can change the DPI for the display (with a reboot, I beleive).

Effectively they just implimented a feature that was aviable in other OSes for years and years and it's this big new selling point for them.

We need clients that can survive an X restart.

Posted Jan 19, 2007 18:43 UTC (Fri) by AJWM (guest, #15888) [Link]

> when the drivers in Linux bork, X restarts.

The problem is, that pretty much takes all the X clients with it, so you've lost your whole session anyway.

Now, arguably that's the fault of the client(s) rather than the X server, but I've yet to see an X client program written that could gracefully recover (ie to the point of picking up exactly where it left off) if its X display was yanked out from under it and restarted. It's been long enough since I programmed at the Xlib level that I don't recall how much state (of windows, etc) is in the server vs the client (via the X libraries), but I imagine that enough state _could_ be kept in the client (again, courtesy of the xlib) to redisplay to the new X server everything that was there when the old one borked.

It'd be an interesting and non-trivial programming exercise, but a massively useful one. (Nothing worse than having numerous windows open, having your X display lock up, and knowing that the only thing you can do is blow it all away even though the client programs are still perfectly OK - or will be until their display connection is killed.)

We need clients that can survive an X restart.

Posted Jan 19, 2007 21:27 UTC (Fri) by jwb (guest, #15467) [Link]

Actually X11 is (used to be) stateless. These days with backing stores and compositing it's not quite true. GTK+ has had for years the ability to disconnect from one display and reconnect to another, which also means that you can connect it to a dummy display while your real display restarts. However this toolkit capability has been long neglected by application writers.

We need clients that can survive an X restart.

Posted Jan 20, 2007 0:31 UTC (Sat) by drag (subscriber, #31333) [Link]

Well I would think that compositing would make it easier to make things stateless, since windows and such are not rendered on the actual display but in off-screen buffers.

I'd think you'd just have to keep those off-screen buffers alive and when the main display comes back up then you "re-composite" them.

Maybe you need a different thread for the application management vs the part of the X server that does the actual rendering or something, I don't know.

We need clients that can survive an X restart.

Posted Jan 20, 2007 8:21 UTC (Sat) by cworth (subscriber, #27653) [Link]

> GTK+ has had for years the ability to disconnect from one display and
> reconnect to another, which also means that you can connect it to a dummy
> display while your real display restarts.

There is a missing piece here though. The GTK+ code can successfully
migrate an X connection through a client-initiated disconnect. But it
turns out that design flaws in Xlib make it impossible for a client
to cleanly recover from an X server that disappears out from under the
client.

I've actually looked into what it would take to retrofit Xlib to add
what's missing. It'd be possible, but it would require a programmer
with a stronger constitution than I have to wade through the Xlib
internals to make the fix. And then one would still need to fixup
GTK+ to properly respond to the new XServerDisconnected event that
would have to be added.

Meanwhile, a more realistic approach is to get toolkits to switch to
XCB which doesn't suffer from the same shortcoming as Xlib in this area.

> However this toolkit capability has been long neglected by application
> writers.

I agree that there are some interesting aspects of migrating applications
from one X server to another that applications aren't taking advantage of.

But for the idea of replacing an X server for an entire session---I'd much
rather that be something that not require any application knowledge at all.
That's a much quicker route to making it work reliably for as many
applications as possible.

-Carl

We need clients that can survive an X restart.

Posted Jan 22, 2007 5:19 UTC (Mon) by elanthis (guest, #6227) [Link]

With XCB around, is it really necessary to retrofit Xlib with those changes? Either way, client apps will need to be updated, and XCB brings a lot of other benefits with it, no? Most apps use a toolkit, so once you get the major ones ported (including whatever today's popular Motif clone is, and Tk) you should be set.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 11:28 UTC (Thu) by nix (subscriber, #2304) [Link]

What's more, banning DMA has a *really* high price. Yes, bus-mastering DMA means that misprogrammed hardware can scribble over any memory it likes: but the cost of avoiding it is immense (certainly far more than 5% in e.g. I/O-bound loads).

What we really need is a better MMIO controller such that devices can have multiple privilege rings (or capability tokens); with that in place, it could be made *impossible* for devices to DMA into memory other than that the CPU wants it to DMA into.

But as far as I know nobody has written such a controller, let alone put it in any sort of affordable hardware. I'd be overjoyed to be corrected.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 12:08 UTC (Thu) by Los__D (guest, #15263) [Link]

He talked about constraining DMA to the memory areas needed, not banning DMA... If the first is possible without the last, I have no idea.

The ban was on mmap.

Dennis

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 14:19 UTC (Thu) by nix (subscriber, #2304) [Link]

Banning mmap() of hardware would be reasonable except that... anything a bug can do to a memory-mapped region, external hardware can do to you anyway through a bug in DMA programming.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 15:28 UTC (Thu) by gnb (subscriber, #5132) [Link]

So you need an IOMMU. They are arriving on server-grade x86 hardware, so
I assume they'll make their way into people's desktops eventually. And
eventually into sub-PC priced devices.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 15:41 UTC (Thu) by cventers (subscriber, #31465) [Link]

Even then, isn't it fairly trivial to hang the bus on common PC
architecture?

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 16:17 UTC (Thu) by nix (subscriber, #2304) [Link]

Certainly a lot of hardware has bugs/misfeatures whereby it can be convinced to grab the bus and never let it go: again, graphics cards are the most common crashers. Graphics card interfaces always seem to me to have been written by madmen, from state machines where if you don't do exactly the right thing the bus locks up, through write-only memory locations, to entire undocumented languages on modern cards...

I remain impressed that Dave Airlie and the other free software graphics cards retain their sanity. I'm sure I wouldn't.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 17:19 UTC (Thu) by nix (subscriber, #2304) [Link]

Um, the other free software graphics card *hackers*. As far as I know you can't buy Dave on the high street yet (and I'm not sure how fast he'd be able to do 3D rendering).

(I'll, um, blame it on the weather. I was warned that `high winds and heavy rain are forecast and this will disruption', so presumably as well as disrupting their grammar it's disrupted my posts.)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 19, 2007 2:17 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

Doing the "banning" right presumes faultless software (elsewhere). I don't see that that software will be any simpler (and thus more probably right) than the one futzing around. Looks to me like the sum total will be buggier.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 18:15 UTC (Thu) by bcd (subscriber, #11759) [Link]

High performance software relies on tricks which are, for the most part, quite unsafe. Your 3D game only works because the GPU is allowed unchecked access to main memory. If you were to start being careful about that access, performance will suffer.

There's always a performance/reliability tradeoff. But if ask the users of *most* systems -- outside of the limited scope of 3D gaming -- you'll find that reliability is more important, as it should be. And we're not just talking about desktop PCs: it's also about the embedded devices that we're all putting more trust into these days.

It's hard to focus on performance first and reliability later -- I've tried this in my own software, and what usually happens is the bug fixes and redesigns to address instability blow away all of the performance gains you started with. It's much easier as a developer to get it right first, and worry about the performance second. Sure, it's a close second, but it's still second.

Tanenbaum's points should be first discussed and debated on their own merits, regardless of performance implications. Do these principles really guarantee higher reliability? Are microkernels the best way to implement these principles? Trying to address performance concerns at the same time only complicates things.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 22:30 UTC (Thu) by intgr (subscriber, #39733) [Link]

Your 3D game only works because the GPU is allowed unchecked access to main memory.

As far as I can tell, this statement alone is incorrect. All graphics cards produced since AGP become widespread, are now using an IOMMU called GART, which ought to stop the GPU from making memory requests to bad addresses in the main memory.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 25, 2007 10:55 UTC (Thu) by nix (subscriber, #2304) [Link]

What about all the other devices relying on DMA for decent performance, like disk controllers? An IOMMU specific only to graphics cards strikes me as silly.

PINE and reliability is especially punny

Posted Feb 1, 2007 11:30 UTC (Thu) by gvy (guest, #11981) [Link]

I know lots of people who conside "UW", "WU" and other signs of "made in Washington University" being a label of inherent insecurity, a kind of non-reliability too.

So running microkernel for reliability and then for pine is yeah nice joke :)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds