Not necessarily *getting* less reliable; rather, consistently not very reliable
Posted Jul 10, 2006 18:29 UTC (Mon) by Richard_J_Neill
Parent article: Survey: Linux kernel quality
I wouldn't necessarily say that newer kernels (2.6.16+) are worse than older ones (2.6.12 ish). BUT, none of the kernels are sufficiently stable.
I run, or administrate about 10 machines, mainly desktops. None of the desktops ever have uptime exceeding about 2 months. I always unplug USB devices with some trepidation (even USB mass storage), and quite often experience X server crashes which take down the whole system. This inevitably happens when I have left the machine unattended for a week, and cannot physically reset it!
[A software watchdog, and panic=60 doesn't help much]
The servers are pretty solid in normal use, although not absolutely.
In my view, Linux is no longer sufficiently stable. [I had 500 days uptime on a 2.4.19 kernel in a server]. There are (at least) the following general problems:
1)When the kernel crashes, i.e. locks completely, requiring a reset, there is no way to get diagnostic info. Why can we use, say, the floppy drive for debug after a panic? What happened to the idea of running a second copy of the kernel designed to take control after a panic, and dump diagnostics to file?
2)There still exist unkillable processes, or unmountable filesystems.
kill -9 should be able to terminate a process, even if it is in the "D" state. Otherwise, a reboot is required to solve the problem; worse, you can't reboot remotely, since the kernel hangs just before the end of the shutdown.
3)Unplugging a USB device which is still in use (eg a USB NIC, or a USB sound card) is a nearly guaranteed way to get a crash. It shouldn't happen!
[In this example, I specifically exclude mounted filesystems on USB.]
I am led to believe that this is due to poor driver design; we have just comissioned a USB driver for a USB I/O device, and this can be repeatedly hot-(un)plugged even while it is active, without causing any trouble at all. So, why not sound or networking?
On a related note, when you try to unmount a filesystem, and get a "filesystem busy" error, or when you try to rmmod a module and get "module in use", there needs to be a way to find out what is using it, and, if desired, to kill the process.
4)X server crashes should never take down the entire system. But they often do, especially when using 3D accel. This applies with both non-free (nvidia) drivers and free drivers (eg xorg's ati driver for the r128)
5)Most importantly, the class of people such as myself (technical users, who are not kernel developers) make up the majority of the Linux community. We own most of the hardware, and experience most of the more subtle bugs. Yet, as a resource, we go untapped, since there is very little we can do to debug a problem with our hardware. This is a dreadful waste of most of the community! Is there any way to automate debugging/diagnostics so that we can be of more help?
to post comments)