LCA: Andrew Tanenbaum on creating reliable systems
almost always the famous
debate between the two which happened early in the history of Linux.
Mr. Tanenbaum called Linux "obsolete," and made it clear that he would not
have been proud to have Mr. Torvalds as a student; Linus made some choice
comments of his own in return.
So it was pleasant to see Andrew Tanenbaum introduced in Sydney by none other than Linus Torvalds. According to Linus, Andrew introduced him to Unix by way of Minix. Minix also convinced Linus (wrongly, he says) that writing an operating system was not hard. The similarities between the two, he said, far outweigh any differences they may have had.
The talk began with a quoting of Myhrvold's laws: (1) software is a gas which expands to fill its container, and (2) software is getting slower faster than hardware is getting faster. Software bloat, he says, is a huge problem. He discussed the size of various Windows releases, ending up with Windows XP at 60 million lines. Nobody, he says, understands XP. That leads to situations where people - even those well educated in computer science, do not understand their systems and cannot fix them.
The way things should be, instead, is described by the "TV model." Generally, one buys a television, plugs it in, and it just works for ten years. The computer model, instead, goes something like this: buy the computer, plug it in, install the service packs, install the security patches, install the device drivers, install the anti-virus application, install the anti-spyware system, and reboot...
...and it doesn't work. So call the helpdesk, wait on hold, and be told to reinstall Windows. A recent article in the New York Times reported that 25% of computer users have become so upset with their systems that they have hit them.
So what we want to do is to build more reliable systems. The working definition of a reliable system is this: a typical heavy user never experiences a single failure, and does not know anybody who has ever experienced a failure. Some systems which can meet this definition now include televisions, stereos, DVD players, cellular phones (though some in the audience have had different experiences), and automobiles (at least, with regard to the software systems they run). Reliability is possible, and it is necessary: "Just ask Grandma."
As an aside, Mr. Tanenbaum asked whether Linux was more reliable than Windows. His answer was "probably," based mainly on the fact that the kernel is much smaller. Even so, doing some quick back-of-the-envelope calculations, he concluded that there must be about 10,000 bugs in the Linux kernel. So Linux has not yet achieved the level of reliability he is looking for.
Is reliability achievable? It was noted that there are systems which can survive hardware failures; RAID arrays and ECC memory were the examples given. TCP/IP can survive lost packets, and CDROMs can handle all kinds of read failures. What we need is a way to survive software failures too. We'll have succeeded, he says, when no computer comes equipped with a reset button.
It is time, says Mr. Tanenbaum, to rethink operating systems. Linux, for how good it is, is really a better version of Multics, a system which dates from the 1960's. It is time to refocus, bearing in mind that the environment has changed. We have "nearly infinite" hardware, but we have filled it with software weighed down with tons of useless features. This software is slow, bloated, and buggy; it is a bad direction to have taken. To achieve the TV model we need to build software which is small, modular, and self-healing. In particular, it needs to be able to replace crashed modules on the fly.
So we get into Andrew Tanenbaum's notion of "intelligent design," as applied to software. The core rules are:
- Isolate components from each other so that they cannot interfere
with each other - or even communicate unless there is a reason to do
so.
- Stick to the "principle of least authority"; no component should have
more privilege than it needs to get its job done.
- The failure of one component should not cause others to fail.
- The health of components should be monitored; if one stops operating
properly, the system should know about it.
- One must be prepared to replace components in a running system.
There is a series of steps to take to apply these principles. The first is to move all loadable modules out of the kernel; these include drivers, filesystems, and more. Each should run as a separate process with limited authority. He pointed out that this is beginning to happen with Linux with the interest in user-space drivers - though it is not clear how far Linux will go in that direction.
Then it's time to isolate I/O devices. One key to reliability is to do away with memory-mapped I/O; it just brings too many race conditions and opportunities for trouble. Access to devices is through I/O ports, and that is strictly limited; device drivers can only work with the ports they have been specifically authorized to use. Finally, DMA operations should be constrained to memory areas which the driver has been authorized to access; this requires a higher level of support from the hardware, however.
The third step is minimizing privileges to the greatest extent possible. Kernel calls should be limited to those which are needed to get a job done; device drivers, for example, should not be able to create new processes. Communication between processes should be limited to those which truly need to talk to each other. And, when dealing with communications, a faulty receiver should never be able to block the sender.
Mr. Tanenbaum (with students) has set out to implement all of this in Minix. He has had trouble with people continually asking for new features, but, he has been "keeping it simple waiting for the messiah." That remark was accompanied with a picture of Richard Stallman in full St. Ignucious attire. Minix 3 has been completely redesigned with reliability in mind; the current version does not have all of the features described, but 3.1.3 (due around March) will.
Minix is a microkernel system, so, at the bottom level, it has a very small
kernel. It handles interrupts, the core notion of processes, and the
system clock. There is a simple inter-process communication mechanism for
sending messages around the system. It is built on a request/reply
structure, so that the kernel always knows which requests have not yet been
acted upon.
There is also a simple kernel API for device drivers. These include reading and writing I/O ports (drivers do not have direct access to ports), setting interrupt policies, and copying data to and from a process's virtual address space. For virtual address space access, the driver will be constrained to a range of addresses explicitly authorized by the calling process.
Everything else runs in user mode. Low-level user-mode processes include the device drivers, filesystems, a process server, a "reincarnation server," an information server, a data store, a network server (implementing TCP/IP), and more. The reincarnation server's job is to be the parent of all low-level system processes. It gets notified if any of them die, and occasionally pings them to be sure that they are still responsive. Should a process go away, a table of actions is consulted to see how the system should respond; often that response involves restarting the process.
If, for example, a disk driver dies, the reincarnation server will start a new one. It will also tell the filesystem process(es) about the fact that there is a new disk driver; the filesystems can then restart any requests that had been outstanding at the time of the failure. Things pick up where they were before. Disks are relatively easy to handle this way; servers which maintain a higher level of internal or device state can be harder.
A key point is that most operating system failures in deployed systems tend to result from transient events. If a race condition leads to the demise of a device driver, that same race is unlikely to repeat after the driver is restarted. Algorithmic errors which are repeatable will get fixed eventually, but the transient problems can be much harder to track down. So the next best thing is to be able to restart failing code and expect that things will work better the second time.
There were a number of performance figures presented. Running disk benchmarks while occasionally killing the driver had the unsurprising result of hurting performance a bit - but the system continued to run. Another set of numbers made the claim that the performance impact of the microkernel architecture was on the order of 5-10%. It's worth noting that not everybody buys those numbers; there were not a whole lot of details on how they were generated.
In summary, Mr. Tanenbaum listed a number of goals for the Minix project. Minix may well be applicable for high-reliability systems, and for embedded applications as well. But, primarily, the purpose is to demonstrate the the creation of ultra-reliable systems is possible.
The talk did show that it is possible to code systems which can isolate
certain kinds of faults and attempt to recover from them. It was an
entertaining and well-presented discussion. Your editor has not, however,
noticed a surge of sympathy for the idea of moving Linux over to a
microkernel architecture. So it is not clear whether the ideas presented
in this talk will have an influence over how Linux is developed in the
future.
| Index entries for this article | |
|---|---|
| Conference | linux.conf.au/2007 |
