How to cope with hardware-poisoned page-cache pages

Posted May 8, 2022 19:55 UTC (Sun) by NYKevin (subscriber, #129325)
Parent article: How to cope with hardware-poisoned page-cache pages

> David Hildenbrand objected that a per-file flag could get in the way of virtual machines running from images stored in a single file. A single error would prevent the whole thing from being used, essentially killing the virtual machine. Tracking individual pages is probably better for that use case. But Bacik reiterated that the community was destined to make mistakes, so the simple case should be done first.

Sure, VMs are the userspace iteration of this problem, but what about disk images that are just mounted directly in the kernel? If I run mount /path/to/image/file /mnt/somewhere, and /path/to/image/file gets this corrupted flag set, what happens next? If I have errors=remount-ro, can I still read things under /mnt/somewhere, or does that get locked out as well? Or do you just unmount the whole thing outright, and let me accidentally create random files and directories on the parent filesystem?

> Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.

IMHO Overstreet is right here. With my SRE hat on, I would *much* rather the kernel just panic if it detects bad physical memory (perhaps setting some kind of flag or metadata in durable storage on the boot disk, so that we can detect that this happened after the fact). If memory is bad, we want to boot into Memtest (or something functionally equivalent to Memtest), find the bad stick, yank it, and redeploy the system. We don't want the system to try and limp on with bad RAM until it manages to corrupt something important. In the meantime, we're quite capable of migrating whatever was running on the system to another node, and our orchestration system will do that automatically when the host stops responding, so this doesn't even require a human to get involved.

Obviously, what is suitable for a large data center need not be suitable for every use case (the above would probably be a terrible idea in e.g. an Android handset), but it's important to remember that quite a few of us are really far on the cattle end of the spectrum. The system keeping itself alive in a broken or semi-broken state is not always desirable or helpful.