LWN: Comments on "How to cope with hardware-poisoned page-cache pages"

How to cope with hardware-poisoned page-cache pages

NYKevin — Sun, 08 May 2022 19:55:09 +0000

> David Hildenbrand objected that a per-file flag could get in the way of virtual machines running from images stored in a single file. A single error would prevent the whole thing from being used, essentially killing the virtual machine. Tracking individual pages is probably better for that use case. But Bacik reiterated that the community was destined to make mistakes, so the simple case should be done first.

Sure, VMs are the userspace iteration of this problem, but what about disk images that are just mounted directly in the kernel? If I run mount /path/to/image/file /mnt/somewhere, and /path/to/image/file gets this corrupted flag set, what happens next? If I have errors=remount-ro, can I still read things under /mnt/somewhere, or does that get locked out as well? Or do you just unmount the whole thing outright, and let me accidentally create random files and directories on the parent filesystem?

> Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.

IMHO Overstreet is right here. With my SRE hat on, I would *much* rather the kernel just panic if it detects bad physical memory (perhaps setting some kind of flag or metadata in durable storage on the boot disk, so that we can detect that this happened after the fact). If memory is bad, we want to boot into Memtest (or something functionally equivalent to Memtest), find the bad stick, yank it, and redeploy the system. We don't want the system to try and limp on with bad RAM until it manages to corrupt something important. In the meantime, we're quite capable of migrating whatever was running on the system to another node, and our orchestration system will do that automatically when the host stops responding, so this doesn't even require a human to get involved.

Obviously, what is suitable for a large data center need not be suitable for every use case (the above would probably be a terrible idea in e.g. an Android handset), but it's important to remember that quite a few of us are really far on the cattle end of the spectrum. The system keeping itself alive in a broken or semi-broken state is not always desirable or helpful.

How to cope with hardware-poisoned page-cache pages

NYKevin — Sun, 08 May 2022 19:05:33 +0000

The behavior in the event of a crash is indistinguishable from the behavior where the RAM works correctly, but the page fails to get written out before the crash happens (i.e. the data is lost either way), so arguably this is no worse than the (unfixable) status quo. That's assuming, of course, that the application does not call fsync(). If the application does call fsync(), you can at least return EIO, but then the application is a bit screwed because it's unclear exactly what data got clobbered or what the application should do to recover from it. Also, not everybody remembers to check the error code from fsync().

How to cope with hardware-poisoned page-cache pages

yang.shi — Fri, 06 May 2022 23:19:01 +0000

> Which is expected behavior. Once you call fsync and get back an error, any data written since the last fsync is now suspect -- some writes may have succeeded and some may not.

IIUC it means the data on disk is suspect instead of the data in page cache, right? This doesn't bother the readers. The readers still consume the consistent data. The memory error is different, it means the data in page cache is even suspect. So waiting for fsync() may be already late.

> It's up to the application to make sense of the state (unfortunately). That's not a trivial task, but that's really all we can guarantee in the face of this sort of problem.

I agree it is not a trivial task, particularly if we want to handle this in page granularity.

How to cope with hardware-poisoned page-cache pages

flussence — Fri, 06 May 2022 15:11:29 +0000

I can explain this one from some real-world experience: my system has some borderline RAM which runs fine 99.99999% of the time at its rated top speed under heavy load... which obviously isn't enough. Two times in the past year, the btrfs checksumming code noticed something wrong slightly too late and panic-remounted the rootfs ro.

By then of course, the damage is done and I have to hard reboot and fiddle around with recovery tools from a USB disk. Moving the crash as early as possible would still be data loss, but more importantly it'd prevent garbage data (or worse, fs structures) being written.

How to cope with hardware-poisoned page-cache pages

Wol — Fri, 06 May 2022 09:21:30 +0000

> I don't understand, at this point hasn't the data loss already happened?

Not necessarily :-) It's complicated ...

For example, I run raid-5 over dm-integrity over hardware. Any data loss in pages waiting to be flushed BELOW the dm-integity level, I don't care (that much). If I ran raid-6, that would be any failure below the raid level.

Like you I agree that Wilcox' point sounds wrong - until you think about it ...

Cheers,
Wol

How to cope with hardware-poisoned page-cache pages

taladar — Fri, 06 May 2022 08:56:49 +0000

Maybe a user-space handler program similar to the coredump mechanism would be a good idea for this. Something that could compare the poison version of the page with the old file version and display the differences to the user for resolution.

How to cope with hardware-poisoned page-cache pages

LtWorf — Fri, 06 May 2022 08:16:08 +0000

> A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.

I don't understand, at this point hasn't the data loss already happened?

How to cope with hardware-poisoned page-cache pages

epa — Fri, 06 May 2022 05:51:57 +0000

If a write-back page is poisoned, it shouldn’t be written to disk, corrupting the original file, but neither should it be thrown away. It should be saved somewhere so any useful data can be recovered. After all if some writes to a file succeed and some don’t because of poisoning, the file will be corrupt anyway. If it was important, you might want to apply part or all of the changes that were poisoned to a scratch copy of the file and manually compare them.

I guess in data centres full of ‘cattle’ nobody is going to manually fix up a disk image that was corrupted but for user data it seems the right thing to do. Like the old lost+found directory made by fsck.

How to cope with hardware-poisoned page-cache pages

jlayton — Thu, 05 May 2022 19:54:08 +0000

> It does set AS_EIO, the first fsync does return error, but the read will return old data from disk since the page is truncated. No error is returned on the read path. Write syscall also succeeds.

Which is expected behavior. Once you call fsync and get back an error, any data written since the last fsync is now suspect -- some writes may have succeeded and some may not.

It's up to the application to make sense of the state (unfortunately). That's not a trivial task, but that's really all we can guarantee in the face of this sort of problem.

How to cope with hardware-poisoned page-cache pages

yang.shi — Thu, 05 May 2022 19:37:44 +0000

It does set AS_EIO, the first fsync does return error, but the read will return old data from disk since the page is truncated. No error is returned on the read path. Write syscall also succeeds.

For example, a simple test is we create a file and write to the file, then inject memory error to one page which is dirty, then reread the range, all the written data is lost, you will get old data (0 in this simple test).

How to cope with hardware-poisoned page-cache pages

jlayton — Thu, 05 May 2022 17:55:15 +0000

Thanks. It looks like most of the callers that end up in that function do call mapping_set_error() first if the page was still dirty. I'm not sure of the exact scenario that would lead to silent data corruption, so I'd be interested to understand how that can occur.

ISTM that while we would lose the data on the page in these situations, it wouldn't be silent. You'd get an error back on the next fsync/msync. If there are any gaps in that coverage though, we should fix them.

How to cope with hardware-poisoned page-cache pages

willy — Thu, 05 May 2022 17:26:27 +0000

generic_error_remove_page() is where you want to look. I sent a patch but it's clearly not the complete answer

https://lore.kernel.org/lkml/20210318183350.GT3420@casper...

How to cope with hardware-poisoned page-cache pages

jlayton — Thu, 05 May 2022 13:32:53 +0000

> If the page is dirty (having been written to by the CPU), though, the situation is harder to deal with. Currently, Shi said, the page is dropped from the page cache and any data that was in it is lost. Processes will not be notified unless they have the affected page mapped into their address space.

Does it not call mapping_set_error() at that point? It seems like it should.