How to cope with hardware-poisoned page-cache pages

By Jonathan Corbet
May 5, 2022

"Hardware poisoning" is a mechanism for detecting and handling memory errors in a running system. When a particular range of memory ceases to remember correctly, it is "poisoned" and further accesses to it will generate errors. The kernel has had support for hardware poisoning for over a decade, but that doesn't mean it can't be improved. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit, Yang Shi discussed the challenges of dealing with hardware poisoning when it affects memory used for the page cache.

The page cache, of course, holds copies of pages from files in secondary storage. A page-cache page that is generating errors will no longer accurately reflect the data that is (or should be) in the file, and thus should not be used. If that page has not been modified since having been read from the backing store, the solution is easy: discard it and read the data again into memory that actually works. If the page is dirty (having been written to by the CPU), though, the situation is harder to deal with. Currently, Shi said, the page is dropped from the page cache and any data that was in it is lost. Processes will not be notified unless they have the affected page mapped into their address space.

This behavior, Shi said, leads to silent data loss. Subsequent accesses to the page will yield incorrect data, with no indication to the user that there is a problem. That leads to problems that can be difficult to debug.

To solve the problem, he continued, the kernel should keep the poisoned page in the page cache rather than dropping it. The filesystem that owns the page will need to be informed of the problem and must not try to write the page back to secondary store. Some operations, such as truncation or hole-punching, can be allowed to work normally since the end result will be correct. But if the page is accessed in other ways, an error must be returned.

There are a few ways in which this behavior could be implemented. One would be to check the hardware-poison flag on every path that accesses a page-cache page; that would require a lot of code changes. An alternative would be to return NULL when looking up the page in the cache. The advantage here is that callers already have to be able to handle NULL return values, so there should be few surprises — except that the error returned to user space will be ENOMEM, which may be surprising or misleading. Finally, page-cache lookups could, instead, return EIO, which better indicates the nature of the real problem. That would be much more invasive, though, since callers will not be prepared for that return status.

Matthew Wilcox jumped in to say that only the first alternative was actually viable. Poisoning is tracked per page, but the higher-level interfaces are being converted to folios, which can contain multiple pages. The uncorrupted parts of a folio should still be accessible, so page-cache lookups need to still work. Dan Williams said that, in the DAX code (which implements direct access to files in persistent memory), the approach taken is to inform the filesystem of the error and still remove the page from the page cache. That makes it possible to return errors to the user, he said; this might also be a good approach for errors in regular memory as well.

Ted Ts'o expressed his agreement with Williams, saying that if the information about a corrupt page exists only in memory, a crash will erase any knowledge of the problem; that, too, leads to silent data corruption. The proposed solution does a lot of work, he said, to return EIO only until the next reboot happens. Asking the filesystem to maintain this information is more work, but may be the better approach in the end. One way to make it easier, he said, would be to not worry about tracking corrupted pages individually; instead, the file could just be marked as having been corrupted somewhere.

Shi argued that memory failures are not particularly rare in large data-center environments, and that any of his approaches would be better than doing nothing. Also, he said, users may well care about which page in a file has been damaged, so just marking the file as a whole may not be sufficient.

Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.

Josef Bacik said that storing corruption information made the most sense to him; the implementation could mostly go into the generic filesystem code. When notified of problems, the filesystem code should mark the affected pages, refuse to return data from them, and take care to avoid writing them to backing store. But he suggested that a per-file flag might suffice; developers — in both user and kernel space — are not good at dealing with error cases, so this mechanism should be kept simple, especially at the beginning. Developers can "try to be fancy" later if it seems warranted.

David Hildenbrand objected that a per-file flag could get in the way of virtual machines running from images stored in a single file. A single error would prevent the whole thing from being used, essentially killing the virtual machine. Tracking individual pages is probably better for that use case. But Bacik reiterated that the community was destined to make mistakes, so the simple case should be done first.

As time ran out, Wilcox pointed out that filesystems could handle the case of writing to a corrupted page — if the entire page is being overwritten. In that case, the damaged data is gone and the file is, once again, in the state that the user intended. Goldwyn Rodrigues pointed out, though, that the situation is not so simple for copy-on-write filesystems, which may still have the damaged pages sitting around. Bacik said this case is exactly why he opposes fancy solutions.

Index entries for this article
Kernel	Fault tolerance
Kernel	HWPOISON
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

How to cope with hardware-poisoned page-cache pages

Posted May 5, 2022 13:32 UTC (Thu) by jlayton (subscriber, #31672) [Link] (5 responses)

> If the page is dirty (having been written to by the CPU), though, the situation is harder to deal with. Currently, Shi said, the page is dropped from the page cache and any data that was in it is lost. Processes will not be notified unless they have the affected page mapped into their address space.

Does it not call mapping_set_error() at that point? It seems like it should.

How to cope with hardware-poisoned page-cache pages

Posted May 5, 2022 17:26 UTC (Thu) by willy (subscriber, #9762) [Link] (4 responses)

generic_error_remove_page() is where you want to look. I sent a patch but it's clearly not the complete answer

https://lore.kernel.org/lkml/20210318183350.GT3420@casper...

How to cope with hardware-poisoned page-cache pages

Posted May 5, 2022 17:55 UTC (Thu) by jlayton (subscriber, #31672) [Link] (3 responses)

Thanks. It looks like most of the callers that end up in that function do call mapping_set_error() first if the page was still dirty. I'm not sure of the exact scenario that would lead to silent data corruption, so I'd be interested to understand how that can occur.

ISTM that while we would lose the data on the page in these situations, it wouldn't be silent. You'd get an error back on the next fsync/msync. If there are any gaps in that coverage though, we should fix them.

How to cope with hardware-poisoned page-cache pages

Posted May 5, 2022 19:37 UTC (Thu) by yang.shi (subscriber, #133088) [Link] (2 responses)

It does set AS_EIO, the first fsync does return error, but the read will return old data from disk since the page is truncated. No error is returned on the read path. Write syscall also succeeds.

For example, a simple test is we create a file and write to the file, then inject memory error to one page which is dirty, then reread the range, all the written data is lost, you will get old data (0 in this simple test).

How to cope with hardware-poisoned page-cache pages

Posted May 5, 2022 19:54 UTC (Thu) by jlayton (subscriber, #31672) [Link] (1 responses)

> It does set AS_EIO, the first fsync does return error, but the read will return old data from disk since the page is truncated. No error is returned on the read path. Write syscall also succeeds.

Which is expected behavior. Once you call fsync and get back an error, any data written since the last fsync is now suspect -- some writes may have succeeded and some may not.

It's up to the application to make sense of the state (unfortunately). That's not a trivial task, but that's really all we can guarantee in the face of this sort of problem.

How to cope with hardware-poisoned page-cache pages

Posted May 6, 2022 23:19 UTC (Fri) by yang.shi (subscriber, #133088) [Link]

> Which is expected behavior. Once you call fsync and get back an error, any data written since the last fsync is now suspect -- some writes may have succeeded and some may not.

IIUC it means the data on disk is suspect instead of the data in page cache, right? This doesn't bother the readers. The readers still consume the consistent data. The memory error is different, it means the data in page cache is even suspect. So waiting for fsync() may be already late.

> It's up to the application to make sense of the state (unfortunately). That's not a trivial task, but that's really all we can guarantee in the face of this sort of problem.

I agree it is not a trivial task, particularly if we want to handle this in page granularity.

How to cope with hardware-poisoned page-cache pages

Posted May 6, 2022 5:51 UTC (Fri) by epa (subscriber, #39769) [Link] (1 responses)

If a write-back page is poisoned, it shouldn’t be written to disk, corrupting the original file, but neither should it be thrown away. It should be saved somewhere so any useful data can be recovered. After all if some writes to a file succeed and some don’t because of poisoning, the file will be corrupt anyway. If it was important, you might want to apply part or all of the changes that were poisoned to a scratch copy of the file and manually compare them.

I guess in data centres full of ‘cattle’ nobody is going to manually fix up a disk image that was corrupted but for user data it seems the right thing to do. Like the old lost+found directory made by fsck.

How to cope with hardware-poisoned page-cache pages

Posted May 6, 2022 8:56 UTC (Fri) by taladar (subscriber, #68407) [Link]

Maybe a user-space handler program similar to the coredump mechanism would be a good idea for this. Something that could compare the poison version of the page with the old file version and display the differences to the user for resolution.

How to cope with hardware-poisoned page-cache pages

Posted May 6, 2022 8:16 UTC (Fri) by LtWorf (subscriber, #124958) [Link] (3 responses)

> A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.

I don't understand, at this point hasn't the data loss already happened?

How to cope with hardware-poisoned page-cache pages

Posted May 6, 2022 9:21 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I don't understand, at this point hasn't the data loss already happened?

Not necessarily :-) It's complicated ...

For example, I run raid-5 over dm-integrity over hardware. Any data loss in pages waiting to be flushed BELOW the dm-integity level, I don't care (that much). If I ran raid-6, that would be any failure below the raid level.

Like you I agree that Wilcox' point sounds wrong - until you think about it ...

Cheers,
Wol

How to cope with hardware-poisoned page-cache pages

Posted May 6, 2022 15:11 UTC (Fri) by flussence (guest, #85566) [Link]

I can explain this one from some real-world experience: my system has some borderline RAM which runs fine 99.99999% of the time at its rated top speed under heavy load... which obviously isn't enough. Two times in the past year, the btrfs checksumming code noticed something wrong slightly too late and panic-remounted the rootfs ro.

By then of course, the damage is done and I have to hard reboot and fiddle around with recovery tools from a USB disk. Moving the crash as early as possible would still be data loss, but more importantly it'd prevent garbage data (or worse, fs structures) being written.

How to cope with hardware-poisoned page-cache pages

Posted May 8, 2022 19:05 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

The behavior in the event of a crash is indistinguishable from the behavior where the RAM works correctly, but the page fails to get written out before the crash happens (i.e. the data is lost either way), so arguably this is no worse than the (unfixable) status quo. That's assuming, of course, that the application does not call fsync(). If the application does call fsync(), you can at least return EIO, but then the application is a bit screwed because it's unclear exactly what data got clobbered or what the application should do to recover from it. Also, not everybody remembers to check the error code from fsync().

How to cope with hardware-poisoned page-cache pages

Posted May 8, 2022 19:55 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> David Hildenbrand objected that a per-file flag could get in the way of virtual machines running from images stored in a single file. A single error would prevent the whole thing from being used, essentially killing the virtual machine. Tracking individual pages is probably better for that use case. But Bacik reiterated that the community was destined to make mistakes, so the simple case should be done first.

Sure, VMs are the userspace iteration of this problem, but what about disk images that are just mounted directly in the kernel? If I run mount /path/to/image/file /mnt/somewhere, and /path/to/image/file gets this corrupted flag set, what happens next? If I have errors=remount-ro, can I still read things under /mnt/somewhere, or does that get locked out as well? Or do you just unmount the whole thing outright, and let me accidentally create random files and directories on the parent filesystem?

> Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.

IMHO Overstreet is right here. With my SRE hat on, I would *much* rather the kernel just panic if it detects bad physical memory (perhaps setting some kind of flag or metadata in durable storage on the boot disk, so that we can detect that this happened after the fact). If memory is bad, we want to boot into Memtest (or something functionally equivalent to Memtest), find the bad stick, yank it, and redeploy the system. We don't want the system to try and limp on with bad RAM until it manages to corrupt something important. In the meantime, we're quite capable of migrating whatever was running on the system to another node, and our orchestration system will do that automatically when the host stops responding, so this doesn't even require a human to get involved.

Obviously, what is suitable for a large data center need not be suitable for every use case (the above would probably be a terrible idea in e.g. an Android handset), but it's important to remember that quite a few of us are really far on the cattle end of the spectrum. The system keeping itself alive in a broken or semi-broken state is not always desirable or helpful.