How to cope with hardware-poisoned page-cache pages
The page cache, of course, holds copies of pages from files in secondary storage. A page-cache page that is generating errors will no longer accurately reflect the data that is (or should be) in the file, and thus should not be used. If that page has not been modified since having been read from the backing store, the solution is easy: discard it and read the data again into memory that actually works. If the page is dirty (having been written to by the CPU), though, the situation is harder to deal with. Currently, Shi said, the page is dropped from the page cache and any data that was in it is lost. Processes will not be notified unless they have the affected page mapped into their address space.
This behavior, Shi said, leads to silent data loss. Subsequent accesses to the page will yield incorrect data, with no indication to the user that there is a problem. That leads to problems that can be difficult to debug.
To solve the problem, he continued, the kernel should keep the poisoned
page in the page cache rather than dropping it. The filesystem that owns
the page will need to be informed of the problem and must not try to write
the page back to secondary store. Some operations, such as truncation or
hole-punching, can be allowed to work normally since the end result will be
correct. But if the page is accessed in other ways, an error must be
returned.
There are a few ways in which this behavior could be implemented. One would be to check the hardware-poison flag on every path that accesses a page-cache page; that would require a lot of code changes. An alternative would be to return NULL when looking up the page in the cache. The advantage here is that callers already have to be able to handle NULL return values, so there should be few surprises — except that the error returned to user space will be ENOMEM, which may be surprising or misleading. Finally, page-cache lookups could, instead, return EIO, which better indicates the nature of the real problem. That would be much more invasive, though, since callers will not be prepared for that return status.
Matthew Wilcox jumped in to say that only the first alternative was actually viable. Poisoning is tracked per page, but the higher-level interfaces are being converted to folios, which can contain multiple pages. The uncorrupted parts of a folio should still be accessible, so page-cache lookups need to still work. Dan Williams said that, in the DAX code (which implements direct access to files in persistent memory), the approach taken is to inform the filesystem of the error and still remove the page from the page cache. That makes it possible to return errors to the user, he said; this might also be a good approach for errors in regular memory as well.
Ted Ts'o expressed his agreement with Williams, saying that if the information about a corrupt page exists only in memory, a crash will erase any knowledge of the problem; that, too, leads to silent data corruption. The proposed solution does a lot of work, he said, to return EIO only until the next reboot happens. Asking the filesystem to maintain this information is more work, but may be the better approach in the end. One way to make it easier, he said, would be to not worry about tracking corrupted pages individually; instead, the file could just be marked as having been corrupted somewhere.
Shi argued that memory failures are not particularly rare in large data-center environments, and that any of his approaches would be better than doing nothing. Also, he said, users may well care about which page in a file has been damaged, so just marking the file as a whole may not be sufficient.
Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.
Josef Bacik said that storing corruption information made the most sense to him; the implementation could mostly go into the generic filesystem code. When notified of problems, the filesystem code should mark the affected pages, refuse to return data from them, and take care to avoid writing them to backing store. But he suggested that a per-file flag might suffice; developers — in both user and kernel space — are not good at dealing with error cases, so this mechanism should be kept simple, especially at the beginning. Developers can "try to be fancy" later if it seems warranted.
David Hildenbrand objected that a per-file flag could get in the way of virtual machines running from images stored in a single file. A single error would prevent the whole thing from being used, essentially killing the virtual machine. Tracking individual pages is probably better for that use case. But Bacik reiterated that the community was destined to make mistakes, so the simple case should be done first.
As time ran out, Wilcox pointed out that filesystems could handle the case
of writing to a corrupted page — if the entire page is being overwritten.
In that case, the damaged data is gone and the file is, once again, in the
state that the user intended. Goldwyn Rodrigues pointed out, though, that
the situation is not so simple for copy-on-write filesystems, which may
still have the damaged pages sitting around. Bacik said this case is
exactly why he opposes fancy solutions.
Index entries for this article | |
---|---|
Kernel | Fault tolerance |
Kernel | HWPOISON |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2022 |
Posted May 5, 2022 13:32 UTC (Thu)
by jlayton (subscriber, #31672)
[Link] (5 responses)
Does it not call mapping_set_error() at that point? It seems like it should.
Posted May 5, 2022 17:26 UTC (Thu)
by willy (subscriber, #9762)
[Link] (4 responses)
https://lore.kernel.org/lkml/20210318183350.GT3420@casper...
Posted May 5, 2022 17:55 UTC (Thu)
by jlayton (subscriber, #31672)
[Link] (3 responses)
ISTM that while we would lose the data on the page in these situations, it wouldn't be silent. You'd get an error back on the next fsync/msync. If there are any gaps in that coverage though, we should fix them.
Posted May 5, 2022 19:37 UTC (Thu)
by yang.shi (subscriber, #133088)
[Link] (2 responses)
For example, a simple test is we create a file and write to the file, then inject memory error to one page which is dirty, then reread the range, all the written data is lost, you will get old data (0 in this simple test).
Posted May 5, 2022 19:54 UTC (Thu)
by jlayton (subscriber, #31672)
[Link] (1 responses)
Which is expected behavior. Once you call fsync and get back an error, any data written since the last fsync is now suspect -- some writes may have succeeded and some may not.
It's up to the application to make sense of the state (unfortunately). That's not a trivial task, but that's really all we can guarantee in the face of this sort of problem.
Posted May 6, 2022 23:19 UTC (Fri)
by yang.shi (subscriber, #133088)
[Link]
IIUC it means the data on disk is suspect instead of the data in page cache, right? This doesn't bother the readers. The readers still consume the consistent data. The memory error is different, it means the data in page cache is even suspect. So waiting for fsync() may be already late.
> It's up to the application to make sense of the state (unfortunately). That's not a trivial task, but that's really all we can guarantee in the face of this sort of problem.
I agree it is not a trivial task, particularly if we want to handle this in page granularity.
Posted May 6, 2022 5:51 UTC (Fri)
by epa (subscriber, #39769)
[Link] (1 responses)
I guess in data centres full of ‘cattle’ nobody is going to manually fix up a disk image that was corrupted but for user data it seems the right thing to do. Like the old lost+found directory made by fsck.
Posted May 6, 2022 8:56 UTC (Fri)
by taladar (subscriber, #68407)
[Link]
Posted May 6, 2022 8:16 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (3 responses)
I don't understand, at this point hasn't the data loss already happened?
Posted May 6, 2022 9:21 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
Not necessarily :-) It's complicated ...
For example, I run raid-5 over dm-integrity over hardware. Any data loss in pages waiting to be flushed BELOW the dm-integity level, I don't care (that much). If I ran raid-6, that would be any failure below the raid level.
Like you I agree that Wilcox' point sounds wrong - until you think about it ...
Cheers,
Posted May 6, 2022 15:11 UTC (Fri)
by flussence (guest, #85566)
[Link]
By then of course, the damage is done and I have to hard reboot and fiddle around with recovery tools from a USB disk. Moving the crash as early as possible would still be data loss, but more importantly it'd prevent garbage data (or worse, fs structures) being written.
Posted May 8, 2022 19:05 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Posted May 8, 2022 19:55 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Sure, VMs are the userspace iteration of this problem, but what about disk images that are just mounted directly in the kernel? If I run mount /path/to/image/file /mnt/somewhere, and /path/to/image/file gets this corrupted flag set, what happens next? If I have errors=remount-ro, can I still read things under /mnt/somewhere, or does that get locked out as well? Or do you just unmount the whole thing outright, and let me accidentally create random files and directories on the parent filesystem?
> Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.
IMHO Overstreet is right here. With my SRE hat on, I would *much* rather the kernel just panic if it detects bad physical memory (perhaps setting some kind of flag or metadata in durable storage on the boot disk, so that we can detect that this happened after the fact). If memory is bad, we want to boot into Memtest (or something functionally equivalent to Memtest), find the bad stick, yank it, and redeploy the system. We don't want the system to try and limp on with bad RAM until it manages to corrupt something important. In the meantime, we're quite capable of migrating whatever was running on the system to another node, and our orchestration system will do that automatically when the host stops responding, so this doesn't even require a human to get involved.
Obviously, what is suitable for a large data center need not be suitable for every use case (the above would probably be a terrible idea in e.g. an Android handset), but it's important to remember that quite a few of us are really far on the cattle end of the spectrum. The system keeping itself alive in a broken or semi-broken state is not always desirable or helpful.
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
Wol
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages
How to cope with hardware-poisoned page-cache pages