LWN.net Logo

2003 Kernel Summit: High Availability

This article is part of LWN's 2003 Kernel Developers' Summit coverage.
Lars Marowsky-Brée presented the near-term needs for the support of high-availability systems. At the top of the list was a simple item: write bug-free code. If the code always works, a lot of availability problems go away.

Failing that, the next thing to do is to provide better fault isolation in the kernel. Lars pointed out that there are over 1200 calls to panic() in the kernel. Many of these calls are unnecessary; the system could recover in some way and continue functioning. Supporting high availability means not bringing down the whole system if it is not necessary.

When the system does have to go down, it should make the most of the downtime. High-availability customers don't like to be told, for example, that they need to reproduce the fault. Instead, the kernel should provide things like crash dumps and sensible log entries. The kexec feature was also mentioned as a way to get a system back on its feet more quickly after a crash or a kernel upgrade.

Other features needed for high-availability include cluster filesystems and a cluster volume manager. In fact, it would be nice to have a full set of clustering infrastructure - including single system image support - in the kernel. Good multipath I/O support is needed. There is also a need for a standard interface for reporting events to programs; /sbin/hotplug is a start, but some sort of in-memory daemon would be better.

But, in the end, the main problem is that the kernel crashes. Most of the problems, says Lars, are in the driver code. Driver faults are not sufficiently isolated from the rest of the kernel. But, as Linus pointed out, achieving that sort of isolation would be difficult. If a driver fails while handling an interrupt, for example, it is almost impossible to put the system back into a reasonable state. The solution is to fix the drivers, but there is only so far that one can go in that direction. In some cases, the hardware is so bad that fixing the driver is not very helpful. What needs to be done is to concentrate on a few drivers, corresponding to good hardware, and make sure they operate properly.


(Log in to post comments)

Calls to panic in the kernel.

Posted Jul 22, 2003 3:23 UTC (Tue) by StevenCole (guest, #3068) [Link]

Lars pointed out that there are over 1200 calls to panic() in the kernel.
A panic which shouldn't happen gives programmers the opportunity to write a message which should (almost) never see the light of day. Some examples:

From drivers/message/fusion/mptlan.c:

panic("Damn it Jim! I'm a doctor, not a programmer! "
                "Oh, wait a sec, I am a programmer. "
                "And, who's Jim?!?!\n"
                "Arrgghh! We've done it again!\n");

From drivers/scsi/aha1542.c:

panic("Foooooooood fight!");
In the recent past, kernel/suspend.c had this:
if (pos%PAGE_SIZE) panic("Sorry, dave, I can't let you do that!\n");

Internet Explorer is broke

Posted Jul 22, 2003 8:15 UTC (Tue) by stuart (subscriber, #623) [Link]

I'm at work at the moment and having to use smelly Internet Explorer - Lars's surname appears to have a copyright 'c' (with a circle around it) is this intended or just another bug in IE?

Cheers,
Stu.

Internet Explorer is broke

Posted Jul 22, 2003 8:44 UTC (Tue) by fiberbit (subscriber, #693) [Link]

It's just a small mixup of codepages, the page says it's in ISO-8859-1 whereas Lars's surname is written here in UTF-8. Your browser is not at fault.

driver and device reliability authentication and DRM monoopolisation

Posted Jul 22, 2003 10:31 UTC (Tue) by copsewood (subscriber, #199) [Link]

This is a very difficult one for a free-software decentralised development model to get right, because it depends upon the trustability of specific hardware and software, both of which are constantly changing. Where you have a single development auditing authority, they can charge for checking and signing something, and of course in the area of centralised systems development such an authority can readily be identified i.e. Microsoft. Even this doesn't totally prevent availability of a driver cryptographically signed as audited and stress tested against particular hardware being used on less reliable cheaper clone hardware, or (which could be much worse) the digital signature being used to squash competition in an otherwise clonable hardware market.

Truly decentralised trust management is an inherently very difficult thing to achieve, which is part of the reason why people are so dependant upon centralised banks to control the thing we call money, despite high charges and terrible service creating a sloping playing field (called economy while significantly misusing the term). A potential danger in the Linux world is a single vendor firstly getting a reputation for signing drivers responsibly, and then progressively using DRM techniques to turn this and connected markets into a rent-seeking monopolistic heirarchy.

driver and device reliability authentication and DRM monoopolisation

Posted Jul 22, 2003 13:54 UTC (Tue) by ggoebel (guest, #4487) [Link]

> Truly decentralised trust management is an inherently very
> difficult thing to achieve, which is part of the reason why
> people are so dependant upon centralised banks to control the
> thing we call money...

Actually decentralized banking worked quite well before central governments eliminated them. And if you examine the collapse of the corrupt second Bank of the US and the State Banking Era (1837-1862), you'd see that decentralized banking has remerged when trust and efficiency were required.

Open Source gives much the same advantage: trust and effieciency through an open distributed meritocracy. Who cares if you can get signed drivers? What is the signature worth?

Decentralized trust management is in fact easy to achieve and scale. We do it every day. You do it when you ask a friend for a recommendation on a contractor or dentist, write a book review on Amazon, or read Consumer Reports before buying a dish washer.

It is based on reputations and chains of trust. Linus accepts patches from people he trusts or code he himself will examine and vouch for. And likewise down the chain of trust to the person submitting her first patch. And as a direct consequence, bugs get fixed and features added in direct proportion to how much it matters to the people involved. Show me a centralized control process that scales better.

Your warning of danger is misplaced. The primary way a monopoly can form under open source licensing is through merit. Which is why everyone still regularly attempts to merge their forks back in with Linus' kernel. The only other way to gain a monopoly is through fear, which I'd argue is what SCO/Caldera is attempting at the moment.

You should read your Machiavelli. Any single vendor who depends on goodwill to maintain a position of power can lose that goodwill as easily as they gained it. Sure there is inertia to overcome, but even Linus could conceivably be knocked from his roost. And might have been in the not to distant past, if he hadn't started using BitKeeper and improved his ability merge patches faster.

driver and device reliability authentication and DRM monoopolisation

Posted Jul 22, 2003 17:43 UTC (Tue) by iabervon (subscriber, #722) [Link]

I disagree; I think that it would be pretty easy to add support for signed code. Of course, you could just turn it off, but users who want to use signed code wouldn't do that. Signing could be done by anyone, and the users could be made aware in this way of who has tested the code on what; you don't really need decentralized trust in such an application, because customers who want this sort of thing are generally getting support for somebody, trust them, and will ask them to verify anything else they might use. (Or they might have a testing environment themselves, and sign code which passes, and only allow signed code in production).

Isn't that what we use?

Posted Jul 31, 2003 16:20 UTC (Thu) by job (guest, #670) [Link]

But, we already have that!

Drivers for Linux must pass a quite rigorous quality test to become an official driver. Those are signed and distributed together, through the official channels.

2003 Kernel Summit: High Availability

Posted Jul 31, 2003 8:32 UTC (Thu) by ofranja (subscriber, #11084) [Link]

That's where micro-kernels have a lot of advantages: driver isolation is one of it's principles. Maybe w/initramfs, daemonization of acpi and other places of kernel code - knfsd for example - we'll see a more stable kernel. But that's just a start.

BTW, being able to kill -9 some bad written or bugged program is much better than hanging all program that tries to access /mnt/my_nfs_share and having to reboot the machine.

PS: Micro-kernels are fast, real, and work. Take a look at http://l4ka.org/ and enjoy. That's even a linux 2.2 server avaiable.

2003 Kernel Summit: High Availability

Posted Jul 31, 2003 16:23 UTC (Thu) by job (guest, #670) [Link]

Did you read the article? If a driver crashes in an interrupt request, it can be
very hard to catch errors and reinitialize everything to (hopefully) restore the
hardware's state.

2003 Kernel Summit: High Availability

Posted Jul 31, 2003 21:40 UTC (Thu) by mmarq (guest, #2332) [Link]

Yes, you are right, being micro or monolithic dosent make a difference for that matter, but... a kernel wich has all -"determinant and prone to errors"- drivers "OUTSIDE" of the "kernel" is a major burst to any reliability point you might consider.
Linux already does that by allowing external modules for quite some time,... WHY DONT MAKE THAT THE GENERAL DEFAULT
It only needs evolving DKMS to a better state, and that way make " a hardware abstration layer " that realy works!!... in my view in manner of a split driver model ,... foolowing the footsteps of the nowaday USB structure... we can have a DRM5(direct rendering) abstraction layer for video devices,... a ALSA2 abstration layer for sound devices,... and a special in kernel D-BUS abstration layer for communication between devices and for stacking multiple devices for multipropose devices...
And all that without stoping being a monolithic kernel.

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds