LWN.net Logo

Not necessarily *getting* less reliable; rather, consistently not very reliable

Not necessarily *getting* less reliable; rather, consistently not very reliable

Posted Jul 10, 2006 18:29 UTC (Mon) by Richard_J_Neill (subscriber, #23093)
Parent article: Survey: Linux kernel quality

I wouldn't necessarily say that newer kernels (2.6.16+) are worse than older ones (2.6.12 ish). BUT, none of the kernels are sufficiently stable.

I run, or administrate about 10 machines, mainly desktops. None of the desktops ever have uptime exceeding about 2 months. I always unplug USB devices with some trepidation (even USB mass storage), and quite often experience X server crashes which take down the whole system. This inevitably happens when I have left the machine unattended for a week, and cannot physically reset it!
[A software watchdog, and panic=60 doesn't help much]

The servers are pretty solid in normal use, although not absolutely.

In my view, Linux is no longer sufficiently stable. [I had 500 days uptime on a 2.4.19 kernel in a server]. There are (at least) the following general problems:

1)When the kernel crashes, i.e. locks completely, requiring a reset, there is no way to get diagnostic info. Why can we use, say, the floppy drive for debug after a panic? What happened to the idea of running a second copy of the kernel designed to take control after a panic, and dump diagnostics to file?

2)There still exist unkillable processes, or unmountable filesystems.
kill -9 should be able to terminate a process, even if it is in the "D" state. Otherwise, a reboot is required to solve the problem; worse, you can't reboot remotely, since the kernel hangs just before the end of the shutdown.

3)Unplugging a USB device which is still in use (eg a USB NIC, or a USB sound card) is a nearly guaranteed way to get a crash. It shouldn't happen!
[In this example, I specifically exclude mounted filesystems on USB.]

I am led to believe that this is due to poor driver design; we have just comissioned a USB driver for a USB I/O device, and this can be repeatedly hot-(un)plugged even while it is active, without causing any trouble at all. So, why not sound or networking?

On a related note, when you try to unmount a filesystem, and get a "filesystem busy" error, or when you try to rmmod a module and get "module in use", there needs to be a way to find out what is using it, and, if desired, to kill the process.

4)X server crashes should never take down the entire system. But they often do, especially when using 3D accel. This applies with both non-free (nvidia) drivers and free drivers (eg xorg's ati driver for the r128)

5)Most importantly, the class of people such as myself (technical users, who are not kernel developers) make up the majority of the Linux community. We own most of the hardware, and experience most of the more subtle bugs. Yet, as a resource, we go untapped, since there is very little we can do to debug a problem with our hardware. This is a dreadful waste of most of the community! Is there any way to automate debugging/diagnostics so that we can be of more help?

Regards,

Richard


(Log in to post comments)

Not necessarily *getting* less reliable; rather, consistently not very reliable

Posted Jul 10, 2006 18:50 UTC (Mon) by shirgall (guest, #24745) [Link]

http://sourceware.org/systemtap/

Not necessarily *getting* less reliable; rather, consistently not very reliable

Posted Jul 10, 2006 19:13 UTC (Mon) by arjan (subscriber, #36785) [Link]

3D acceleration runs partially in the kernel, and in case of the binary crud, almost entirely. In addition, X is effectively a ring 3 kernel component, it is in just all aspects part of the kernel entirely; it does DMA, it programs PCI devices etc etc. That means it exposes the same risk as the kernel to the stability of the system...

and with 3D one of the most common failure scenarios is that the 3D card locks up the PCI bus. Not a lot the kernel can do after that to get anything useful out ;)

Not necessarily *getting* less reliable; rather, consistently not very reliable

Posted Jul 11, 2006 0:10 UTC (Tue) by dlang (subscriber, #313) [Link]

you can find what is useing a filesystem with lsof (just do a lsof |grep path and you should be able to find the processes)

since X is accessing the memory and PCI bus directly there are all sorts of ways that it can crash the system that the kernel cannot do anything about. blame X, not the kernel for those crashes.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds