HWPOISON
HWPOISON
Posted Aug 27, 2009 22:35 UTC (Thu) by giraffedata (guest, #1954)Parent article: HWPOISON
First, hardware detects an uncorrectable error from memory transfers into the system cache or on the system bus....
Later, when erroneous data is read by executing software, a machine check is initiated.
...
First, the offending instruction and process cannot be determined due to delays between the data error consumption and execution of the poison handler. These delays include asynchronous hardware reporting of the machine check event,
How can a machine check for accessing erroneous memory contents be asynchronous? An instruction to load some data from memory didn't get the data because it's been destroyed. How can the CPU continue executing and generate a machine check at some arbitrarily later time?
Posted Aug 28, 2009 4:31 UTC (Fri)
by roelofs (guest, #2599)
[Link] (15 responses)
Er, maybe I'm missing the thrust of your question, but I thought it was sort of straightforward: the hardware detects the problem as soon as memory is read--imagine a bad bit in a single byte out of a page or a cacheline's worth read--but the specific bad subset of that memory (the byte) may not be used until much later, or not at all.
Or are you asking about something much more subtle?
Greg
Posted Aug 28, 2009 7:10 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (14 responses)
Yes, that's the scenario in the sentences I excerpted from the article.
And they go on to say that the poison handler runs some time after the time that the specific bad subset is used. It refers to the specific bad subset being used as "data error consumption" and the instruction that uses it as the "offending instruction" and says you can't simply locate the offending instruction and thereby the memory location and the process that are affected by the bad memory, because of the delay.
Maybe the article is confusing multiple scenarios. I can definitely see a design where the machine check happens, and the OS deals with it, before the data error is consumed. But that's not the case the article describes.
Posted Aug 31, 2009 6:36 UTC (Mon)
by jzbiciak (guest, #5246)
[Link] (13 responses)
There are a couple things at play here:
Posted Aug 31, 2009 6:41 UTC (Mon)
by jzbiciak (guest, #5246)
[Link] (12 responses)
Posted Aug 31, 2009 16:02 UTC (Mon)
by dlang (guest, #313)
[Link] (11 responses)
the key of HWPOISON is that not all memory locations contain irreplaceable data. in some cases the memory may not be allocated (so when the program goes to use it, whatever contents are there are going to be erased anyway), on other cases the data exists elsewhere (clean disk buffer pages that can be re-read from disk, etc)
so instead of erroring out when memory corruption is detected, it only throws an error if something tries to make use of the corrupt data, and even then it throws an error that the OS can catch and deal with (since only the OS knows if the data can be replaced by something read from somewhere else)
Posted Aug 31, 2009 18:50 UTC (Mon)
by jzbiciak (guest, #5246)
[Link] (10 responses)
Background scrubbing works by reading memory locations, checking the ECC, and correcting correctable errors proactively before they become uncorrectable. If background scrubbing detects something uncorrectable, it can (and it seems like it ought to) signal a machine check.
Take a look here:
http://patchwork.kernel.org/patch/16897/
There is a notion of an "action optional" machine check. It's still a machine check, and it can be triggered by scrubbing. Quoting:
This code snippet on the linked page illustrates some of the "action optional" machine check exceptions:
Posted Aug 31, 2009 21:06 UTC (Mon)
by dlang (guest, #313)
[Link] (9 responses)
however, the win here is to not generate a machine check when corrupted memory is detected, but instead wait to see if it matters.
if a memory location is corrupted, but then written before it's read from, the fact that the memory location was corrupt doesn't matter, nothing ever tried to use the corrupted data.
this can be done in hardware, transparent to the OS. it will make systems less likely to crash at the cost of a little more record keeping in the hardware.
if a memory location is corrupted, but it happens to be in a page that is a clean cache, the OS can respond to the error by throwing away the cached page and retrieving a copy from disk.
since in modern systems a _large_ percentage of memory ends up being occupied by caches, making it so that errors in that memory just cause a momentary slowdown (read to the disk) instead of a system crash is also a significant win.
and finally, if both of the above fail (so the memory contents are irreplaceable) the OS can detect what program it was running on that CPU at the time the read took place, and kill just that program (and log that the program was killed due to hardware memory errors, not an application bug) rather than killing the entire system.
none of these protections guarantee that the system won't crash when cosmic rays hit the ram, but each of these steps makes it less likely to crash.
given common use cases, I wouldn't be surprised to find that these sorts of strategies make systems an order or two of magnitude less likely to crash as a result of memory errors (although the gains in application reliability will not be as large due to the fact that some of the gain is in killing applications instead of the entire system.
Posted Aug 31, 2009 23:34 UTC (Mon)
by jzbiciak (guest, #5246)
[Link]
You seem to be assuming "machine check" means "machine halt." It's just the name of the exception vector.
Posted Aug 31, 2009 23:40 UTC (Mon)
by jzbiciak (guest, #5246)
[Link] (7 responses)
I'll quote Andi Kleen's post (that I linked above) since I think it's abundantly clear: Read that again: Background scrubbing gives a machine check. The machine check is action optional and it can do just as you suggest. It's still a machine check.
Posted Sep 1, 2009 17:59 UTC (Tue)
by dlang (guest, #313)
[Link] (6 responses)
instead of the background scrub triggering a machine check at that point in time it instead just marks the memory as corrupt (poisoned). the poisoned flag gets cleared if the memory is written to.
if nothing ever tries to read the poisoned memory a machine check happens at that point in time.
Posted Sep 1, 2009 20:24 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (2 responses)
That's not how I read this. See section 15.6, "Recovery of Uncorrected Recoverable Errors" and especially 15.6.3, "UCR Error Classification". The first two error types are the "an error was detected, but the CPU hasn't consumed the errant data yet" error types. If you want to pick nits, the first one (UCNA) is not reported as a Machine Check Exception; rather it is reported as a Corrected Machine Check Error Interrupt (described in Section 15.5). My bad for being sloppy; it is a Machine Check Error, but it isn't a Machine Check Exception. The second recoverable error type (SRAO) is a Machine Check Exception, however. In any case, both are machine checks. Now flip with me to page 15-34 and look at what SRAO errors are architecturally defined, there in section 15.9.3.1: So there we have it. Recoverable, Action Optional Machine Checks due to scrubbing. Can it be any clearer? In case you think this feature is old and was supplanted by something more recent, I urge you to flip back to 15-23 and read along here at the intro to Section 15.6: If I'm not mistaken, that's the processor family this article was referring to. (This document is dated June 2009, so it's not like it's anceint.) Do you have different documentation that suggests otherwise?
Posted Sep 1, 2009 21:28 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
is HWPOISON a hardware level feature or a OS level feature?
if it's a hardware level feature (which is what I understood from the original article) then it wouldn't necessarily cause a machine check error ever.
if this is instead a difference in how the OS responds to a memory error I just completely misunderstood what's happening.
Posted Sep 1, 2009 23:59 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
A machine check error (whether delivered as an exception or an interrupt--the new MCA does both depending on the error type) is a message from the hardware to the software. In the most recent Intel architectures, they support a notion of "recoverable machine check," wherein the hardware tells the OS that no CPU state was corrupted when it noticed the problem. If you look at that PDF I linked, there are a number of status bits (including AR--Action Required) that indicate the severity of the error. There's a separate table in Intel's PDF that suggests the possible OS responses to a particular error.
Once the hardware delivers the message to the OS (via a machine check), the OS is then free to deal with the machine check however it pleases. For "Action Optional" machine checks that can happen asynchronously to program execution (such as due to scrubbing), the OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping it or what-have-you. That's the stuff Andi Kleen and co.'s patch does.
Posted Sep 1, 2009 20:26 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (2 responses)
I guess what you're missing is who marks the memory as poisoned. The CPU sends a machine check to the OS. The OS marks the memory as poisoned, or otherwise discards the contents of the page if it was clean. The HWPOISON patch provides the OS handler and hooks to poison the page (or do whatever needs doing) when the machine check arrives.
Posted Sep 1, 2009 21:31 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
Posted Sep 1, 2009 23:47 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
How can a machine check for accessing erroneous memory contents be asynchronous? An instruction to load some data from memory didn't get the data because it's been destroyed. How can the CPU continue executing and generate a machine check at some arbitrarily later time?
HWPOISON
HWPOISON
the hardware detects the problem as soon as memory is read--imagine a bad bit in a single byte out of a page or a cacheline's worth read--but the specific bad subset of that memory (the byte) may not be used until much later, or not at all.
HWPOISON
HWPOISON
HWPOISON
HWPOISON
Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropriate action, like killing the
process with the corrupted data or logging the event properly to disk.
+
+ /* known AO MCACODs: handle by calling high level handler */
+ MASK(MCI_UC_SAR|0xfff0, MCI_UC_S|0xc0, AO,
+ "Action optional: memory scrubbing error", SER),
+ MASK(MCI_UC_SAR|MCACOD, MCI_UC_S|0x17a, AO,
+ "Action optional: last level cache writeback error", SER),
+
HWPOISON
HWPOISON
HWPOISON
Newer Intel CPUs support a new class of machine checks called recoverable action optional.
Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.
HWPOISON
HWPOISON
The following two SRAO errors are architecturally defined.
Recovery of uncorrected recoverable machine check errors is an enhancement in machine-check architecture. The first processor that supports this
feature is 45nm Intel 64 processor with CPUID signature DisplayFamily_DisplayModel encoding of 06H_2EH. This allow system soft-
ware to perform recovery action on certain class of uncorrected errors and
continue
HWPOISON
HWPOISON
HWPOISON
HWPOISON
HWPOISON