LWN.net Logo

Advertisement

GStreamer, Embedded Linux, Android, VoD, Smooth Streaming, DRM, RTSP, HEVC, PulseAudio, OpenGL. Register now to attend.

Advertise here

Diskdump: a new crash dump system

A standard feature of most commercial operating systems is a "crash dump" facility. If something goes wrong in the operating system kernel, the system saves its entire state to a file and reboots; the contents of that file can then be examined at leisure to try to figure out what went wrong. The Linux kernel, however, lacks this capability. There are a few possible reasons for this omission: the kernel never crashes (not quite true, unfortunately), kernel developers rarely want crash dumps for their own work, and there is a certain degree of unhappiness with all of the crash dump patches currently in circulation. The fact of the matter, however, is that a number of Linux vendors would like to have a good crash dump system in place so they can better support their customers.

A recent patch posted by Takao Indoh may provide that capability. The new "diskdump" system has taken a simpler approach to crash dumps that, with some fixes, may just get enough core hacker support to be considered for merging into the (presumably 2.7) mainline.

Diskdump works by taking absolute control of the system when a panic occurs. It shuts down all interrupts to keep the processor from getting distracted; it also freezes all other processors on SMP systems. It then checksums its own code, comparing against a value computed at initialization time; if the checksums fail to match, diskdump assumes that it has been corrupted as a result of whatever went wrong and refuses to run.

The next step involves finding a place to store the crash dump. Diskdump can be set up with multiple dump partitions. For each possibility, it queries the state of the driver, then reads and verifies the entire crash dump space. The diskdump authors are (rightly) fearful of overwriting important data while the system is in an unstable state, so diskdump requires that every block of the crash dump partition be initialized with a special pattern. If any blocks fail the test, that destination will not be used.

When a suitable location has been found, diskdump writes a header with the system state and panic information, followed by a memory image. At that point the system can be rebooted; once things are stable again, the "savecore" utility turns the memory image into a proper core dump and reinitializes the crash dump partition. All is then in readiness for debugging and, if need be, the next crash.

Diskdump needs some significant block driver modifications to be able to do its job. The driver must export a new set of operations:

    struct disk_dump_device_ops {
        int (*sanity_check)(struct disk_dump_device *);
        int (*quiesce)(struct disk_dump_device *);
        int (*shutdown)(struct disk_dump_device *);
        int (*rw_block)(struct disk_dump_partition *, int rw, unsigned long
            block_nr, void *buf);
    };

The sanity_check() call checks to ensure that the device in question is ready to accept a crash dump. If that function finds that, for example, the device is offline or somebody, somewhere is holding a spinlock for the device, the sanity check will fail and the dump will have to go somewhere else. A call to quiesce() follows, in case any preparation is needed. The current implementation (which only works with some SCSI devices) performs a full SCSI bus reset at this point. The actual I/O is done via rw_block, which is expected to transfer one page per call. This I/O should be done without interrupts (which are, remember, disabled when the panic happens), so the typical implementation will work by polling the device. At the end, shutdown() is called to ensure that all blocks have been flushed to the media.

Perhaps the ugliest part of the patch - and the part which some developers have complained about - is the rerouting of timer and tasklet calls. Since all interrupts are disabled, the normal timer and software interrupt mechanisms will not function. Diskdump does not need those capabilities itself, but a number of disk drivers do. As a result, diskdump must, somehow, run tasklets and timers expected by the driver, but without running arbitrary code unrelated to the dump process. To this end, diskdump sets up its own private timer and tasklet lists which come into action once the system is locked down and the dump process begins.

Currently, all this works by modifying the drivers to call diskdump's functions rather than the core kernel variants. So, for example, instead of setting up a timer with add_timer(), a driver implementing dumps would call this little wrapper:

    static inline void diskdump_add_timer(struct timer_list *timer)
    {
        if (crashdump_mode())
            _diskdump_add_timer(timer);
        else
            add_timer(timer);
    }

But that function is only available if crash dumps are configured into the system, so some preprocessor macros are used to redefine add_timer() if need be. This solution is not going to make it into the mainline kernel, however. The preferred approach would appear to be integrating this functionality directly into the core timer and tasklet routines; that change will make the driver changes smaller, but at the cost of intruding into some of the core kernel code.


(Log in to post comments)

Diskdump: a new crash dump system

Posted Jun 3, 2004 12:36 UTC (Thu) by danpb (subscriber, #4831) [Link]

Red Hat kernels have had an alternative crash dump facility called 'netdump' in them for some time now

http://www.redhat.com/support/wpapers/redhat/netdump/

This dumps to a host on the network rather than local disk precisely to avoid some of the issues with complex disk controllers and interrupts.

<quote>
So what? There are two main problems that come up; failure to dump a memory image, and overwriting parts of file systems because the crash has damaged some data structures or code being used to do the dump. Do not laugh, the later happens in real life; failures in drivers, the SCSI layer, or other intermediate data structures or code is as common a place as any for bugs that cause a crash. A simple failure to dump the memory image is the more common of the two, and can be caused by a myriad of problems, including failures in interrupt handling (for example, interrupts being disabled at the time of the crash; a common problem), locks taken and not released, and data structures that are inconsistent at the time of the crash causing the system to wait forever.

By contrast, network devices are simple, are easy to modify to enable a non-interrupt-driven polled mode, and even if there is a bug in a network device driver, it is entirely likely not to disable the crash dump over the network, because the code path used for network crash dump is highly restricted. The entire network stack can crash and network crash dump can still work, because the network crash dump code implements a separate small but standard-compliant subset of the UDP protocol sufficient to perform the crash dump. Interrupts can be disabled, arbitrary locks can be held indefinitely, and the network crash dump will still function perfectly.
</quote>

Diskdump: a new crash dump system

Posted Jun 14, 2004 9:44 UTC (Mon) by fillods (subscriber, #22226) [Link]

Better than a core dump, which is the state of the system when the fault happened, a flight recorder (which is kind of tracer) would prove very helpfull to tell what sequence of actions (ie. function calls, exceptions, ..) brought to this situation.

Provided the system is equipped with some non volatile memory, the content of the flight recorder could be stored continuously, thus enabling the debug of system freeze, which the new diskdump does not address (was not meant to though).

Does anyone of the readers heard about such a tool?

Diskdump: a new crash dump system

Posted Mar 16, 2007 14:21 UTC (Fri) by leitao (subscriber, #42946) [Link]

Yes, a very good feature that enable you to make the dump eaiser to understand.
The best thing is that you could save the core dump in another partition, which is usefull if you are debuging disk i/o.

Regards,
Breno Leitão

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds