Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
instead of the background scrub triggering a machine check at that point in time it instead just marks the memory as corrupt (poisoned). the poisoned flag gets cleared if the memory is written to.
if nothing ever tries to read the poisoned memory a machine check happens at that point in time.
Posted Sep 1, 2009 20:24 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
That's not how I read this. See section 15.6, "Recovery of Uncorrected Recoverable Errors" and especially 15.6.3, "UCR Error Classification".
The first two error types are the "an error was detected, but the CPU hasn't consumed the errant data yet" error types. If you want to pick nits, the first one (UCNA) is not reported as a Machine Check Exception; rather it is reported as a Corrected Machine Check Error Interrupt (described in Section 15.5). My bad for being sloppy; it is a Machine Check Error, but it isn't a Machine Check Exception. The second recoverable error type (SRAO) is a Machine Check Exception, however.
In any case, both are machine checks.
Now flip with me to page 15-34 and look at what SRAO errors are architecturally defined, there in section 18.104.22.168:
The following two SRAO errors are architecturally defined.
UCR Errors detected by memory controller scrubbing; and
UCR Errors detected during L3 cache (L3) explicit writebacks.
So there we have it. Recoverable, Action Optional Machine Checks due to scrubbing. Can it be any clearer? In case you think this feature is old and was supplanted by something more recent, I urge you to flip back to 15-23 and read along here at the intro to Section 15.6:
Recovery of uncorrected recoverable machine check errors is an enhancement in machine-check architecture. The first processor that supports this
feature is 45nm Intel 64 processor with CPUID signature DisplayFamily_DisplayModel encoding of 06H_2EH. This allow system soft-
ware to perform recovery action on certain class of uncorrected errors and
If I'm not mistaken, that's the processor family this article was referring to. (This document is dated June 2009, so it's not like it's anceint.)
Do you have different documentation that suggests otherwise?
Posted Sep 1, 2009 21:28 UTC (Tue) by dlang (✭ supporter ✭, #313)
is HWPOISON a hardware level feature or a OS level feature?
if it's a hardware level feature (which is what I understood from the original article) then it wouldn't necessarily cause a machine check error ever.
if this is instead a difference in how the OS responds to a memory error I just completely misunderstood what's happening.
Posted Sep 1, 2009 23:59 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
A machine check error (whether delivered as an exception or an interrupt--the new MCA does both depending on the error type) is a message from the hardware to the software. In the most recent Intel architectures, they support a notion of "recoverable machine check," wherein the hardware tells the OS that no CPU state was corrupted when it noticed the problem. If you look at that PDF I linked, there are a number of status bits (including AR--Action Required) that indicate the severity of the error. There's a separate table in Intel's PDF that suggests the possible OS responses to a particular error.
Once the hardware delivers the message to the OS (via a machine check), the OS is then free to deal with the machine check however it pleases. For "Action Optional" machine checks that can happen asynchronously to program execution (such as due to scrubbing), the OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping it or what-have-you. That's the stuff Andi Kleen and co.'s patch does.
Posted Sep 1, 2009 20:26 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
I guess what you're missing is who marks the memory as poisoned. The CPU sends a machine check to the OS. The OS marks the memory as poisoned, or otherwise discards the contents of the page if it was clean. The HWPOISON patch provides the OS handler and hooks to poison the page (or do whatever needs doing) when the machine check arrives.
Posted Sep 1, 2009 21:31 UTC (Tue) by dlang (✭ supporter ✭, #313)
Posted Sep 1, 2009 23:47 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds