Persistent storage for a kernel's "dying breath"

By Jake Edge
March 23, 2011

When Linux systems crash, there are various ways to find out what went wrong, but generally those rely on writing to log files on disk. For some systems, disk may not be available, or trusted in the case of a crash, so a way to poke some data into a platform-specific place for use by a subsequent kernel boot would be useful. That's exactly what the pstore filesystem, which was just added during the current 2.6.39 merge window, will provide.

The idea for pstore came out of a conversation between Tony Luck and Thomas Gleixner at last year's Linux Plumbers Conference. Luck wanted to use the ACPI error record serialization table (ERST) to store crash information across a reboot. The ERST is a mechanism specified by the ACPI specification [PDF] (in section 17.4, page 519) that allows saving and retrieving hardware error information to and from a non-volatile location (like flash).

Rather than just doing something specific for the x86 architecture, he decided to create a more general framework so that other platforms could use whatever persistent storage they had available. It would be, as Luck put, "a generic layer for persistent storage usable to pass tens or hundreds of kilobytes of data from the dying breath of a crashing kernel to its successor".

There have been a number of iterations of the code since Luck first posted it for comments back in November. After Alan Cox's suggestion, pstore moved from its original firmware driver with a sysfs interface to a more straightforward filesystem-based implementation.

The basic idea is that a platform can register the availability of a persistent storage location with a call to pstore_register() and pass a pointer to a struct pstore_info, which looks like:

    struct pstore_info {
	    struct module   *owner;
	    char            *name;
	    struct mutex    buf_mutex;      /* serialize access to 'buf' */
	    char            *buf;
	    size_t          bufsize;
	    size_t          (*read)(u64 *id, enum pstore_type_id *type,
			    struct timespec *time);
	    u64             (*write)(enum pstore_type_id type, size_t size);
	    int             (*erase)(u64 id);
    };

The platform driver needs to provide three I/O routines and a buffer. There is also a mutex present to protect against simultaneous access to the buffer. With that, pstore will implement a filesystem that can be accessed from the kernel—or from user space once it has been mounted. The underlying ERST storage is record oriented, and Luck posits that other platform storage areas will be also, so the I/O interface is record oriented as well.

In addition to the pstore framework, the ERST driver was modified to take advantage of pstore; that change was also merged, so there is an in-kernel user of pstore. The pstore_info buffer is allocated and managed by drivers/acpi/apei/erst.c, and is larger than the bufsize advertised to account for the record and section headers required by ERST. Users of the IO interface either fill the buffer before calling pstore_info.write() or read the data from the buffer after a call to pstore_info.read().

Each item is stored with a type, either PSTORE_TYPE_DMESG for log messages (likely oops output), PSTORE_TYPE_MCE for hardware errors, or PSTORE_TYPE_UNKNOWN for other undefined types. When stored, each item gets a record ID associated with it, which gets returned from the pstore_info.write() call. That ID can then be used in read() and erase() operations, but it also appears in the filenames in the pstore filesystem.

The filesystem can be mounted using:

    # mount -t pstore - /dev/pstore

Files will appear there with names based on the type, name of the storage driver, and the id, so the first dmesg record for ERST would be /dev/pstore/dmesg-erst-1. The typical scenario would be for the filesystem to be mounted at boot time, then some user-space process would check for any files there, copy them to some more permanent place, and delete the files with rm. That will allow the storage facility driver to reclaim the space in order to be ready for other crashes or hardware errors.

By default, pstore will register a dump handler with kmsg_dump to write the last 10K bytes of data from the kernel log to the pstore device when there is a kernel oops or panic. The amount of data to store can be configured at mount time using the kmsg_bytes parameter.

Luck has also put out an RFC patch to disable dumping information into pstore for some kinds of kmsg_dump reasons (e.g. KMSG_DUMP_HALT or KMSG_DUMP_RESTART), but various other developers weren't so sure. Seiji Aguchi pointed to two use cases (1, 2) he has found for needing to store the tail of the kernel log messages in most of those cases. In addition, Artem Bityutskiy pointed out that having pstore decide which kmsg_dump reasons to handle "smells like policy in the kernel". Adding more options to control that behavior is certainly possible, but Luck seems to be of a mind to wait a bit before making any change.

There are other persistent storage methods for kernel log messages, notably devices/mtd/mtdoops.c and devices/char/ramoops.c. But those are targeted at the embedded space where NVRAM devices are prevalent or for platforms where RAM can be reserved that will not be cleared on a restart. Pstore is more flexible, as it can store more than just kernel logs, while the two *oops devices are wired into storing the output of kmsg_dump.

Now that pstore has been merged, others will likely start using it. David Miller has already indicated that he will use it for sparc64, where a region of memory can be set aside to persist across reboots. One would guess that other architectures that have hardware support for similar mechanisms will as well.

Index entries for this article
Kernel	Crash dumps
Kernel	Debugging
Kernel	Pstore

Persistent storage for a kernel's "dying breath"

Posted Mar 24, 2011 13:39 UTC (Thu) by rwmj (subscriber, #5474) [Link] (3 responses)

On ordinary PCs, can RAM be reserved that is preserved across a reboot?

I'd love this feature, if it was able to tell me why some box silently rebooted overnight.

Rich.

Persistent storage for a kernel's "dying breath"

Posted Mar 24, 2011 13:57 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (1 responses)

Not reliably, no. The only persistent storage that's guaranteed on a BIOS-based PC is the real time clock, and there's not really enough space there. EFI actually helps here(!), although right now we don't implement the bits of the spec that cover this.

Persistent storage for a kernel's "dying breath"

Posted Jan 9, 2021 15:38 UTC (Sat) by TIRTH007 (guest, #144058) [Link]

Can we use this pstore utility for our normal system (embedded device) ? Which is not server.

Persistent storage for a kernel's "dying breath"

Posted Apr 1, 2011 19:20 UTC (Fri) by cbf123 (guest, #74020) [Link]

If the vendor provided storage for ERST, then pstore would work on a normal PC.

If no hardware storage is available, the next best option is to use "kexec -p" to set up a recovery kernel that uses kexec to take over on a panic. The init scripts can then be modified to detect when they're running under the recovery kernel, and they can just dumps the original kernel memory to disk and then reboot.

Persistent storage for a kernel's "dying breath"

Posted Mar 24, 2011 13:52 UTC (Thu) by ebirdie (guest, #512) [Link] (1 responses)

The article reminded me about some very old article of writing oopses to a block device, like swap. At then the conclusion was that it isn't wise since a block device might be at stage of mayhem and there is data at stake. Still I find it appealing, that there could be a block driver for pstore and it is at sysadmin's choice to use it or not. The block driver could be configured to use a USB-storage or whatever storage auxiliary to data storage. I guess the main point is that the sysadmin is very very aware to place the pstore behind some other subsystem than the system's main storage.

crashdump to write logs

Posted Mar 26, 2011 11:28 UTC (Sat) by Tobu (subscriber, #24111) [Link]

A while ago LWN explained a proposal of using kexec to debug crashes. One prepares a crashdump kernel in some reserved memory area, and when the system crashes, it kexecs into the new kernel which can then write back a big core dump of the crashed kernel. Here are the kdump docs.

It seems like both kernels could be extended to dump and read logs in some ram area, which doesn't require hardware support and could be a fallback when there is no persistent area for pstore. Those logs can then be read without requiring kernel debug symbols or a kernel hacker to make sense of the kdump image.

Persistent storage for a kernel's "dying breath"

Posted May 19, 2011 14:37 UTC (Thu) by kayabek (guest, #40330) [Link] (1 responses)

This idea is not new. Macs have had such a device for ages.

Persistent storage for a kernel's "dying breath"

Posted May 19, 2011 14:46 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Modern macs implement it as an EFI variable. Unsurprisingly, their implementation is incompatible with the EFI spec's implementation of the concept.

ACPI links

Posted Jul 10, 2024 20:31 UTC (Wed) by naesten (subscriber, #71199) [Link]

The PDF link has gone stale. Either see the Wayback Machine for the ACPI 4.0a PDF or the HTML version of ACPI 6.5 (the latest at this time).