Diskdump: a new crash dump system
Posted Jun 3, 2004 12:36 UTC (Thu) by danpb
Parent article: Diskdump: a new crash dump system
Red Hat kernels have had an alternative crash dump facility called 'netdump' in them for some time now
This dumps to a host on the network rather than local disk precisely to avoid some of the issues with complex disk controllers and interrupts.
So what? There are two main problems that come up; failure to dump a memory image, and overwriting parts of file systems because the crash has damaged some data structures or code being used to do the dump. Do not laugh, the later happens in real life; failures in drivers, the SCSI layer, or other intermediate data structures or code is as common a place as any for bugs that cause a crash. A simple failure to dump the memory image is the more common of the two, and can be caused by a myriad of problems, including failures in interrupt handling (for example, interrupts being disabled at the time of the crash; a common problem), locks taken and not released, and data structures that are inconsistent at the time of the crash causing the system to wait forever.
By contrast, network devices are simple, are easy to modify to enable a non-interrupt-driven polled mode, and even if there is a bug in a network device driver, it is entirely likely not to disable the crash dump over the network, because the code path used for network crash dump is highly restricted. The entire network stack can crash and network crash dump can still work, because the network crash dump code implements a separate small but standard-compliant subset of the UDP protocol sufficient to perform the crash dump. Interrupts can be disabled, arbitrary locks can be held indefinitely, and the network crash dump will still function perfectly.
to post comments)