Posted Aug 31, 2009 21:06 UTC (Mon) by dlang (✭ supporter ✭, #313)
In reply to: HWPOISON by jzbiciak
Parent article: HWPOISON
yes, that is how things traditionally worked.
however, the win here is to not generate a machine check when corrupted memory is detected, but instead wait to see if it matters.
if a memory location is corrupted, but then written before it's read from, the fact that the memory location was corrupt doesn't matter, nothing ever tried to use the corrupted data.
this can be done in hardware, transparent to the OS. it will make systems less likely to crash at the cost of a little more record keeping in the hardware.
if a memory location is corrupted, but it happens to be in a page that is a clean cache, the OS can respond to the error by throwing away the cached page and retrieving a copy from disk.
since in modern systems a _large_ percentage of memory ends up being occupied by caches, making it so that errors in that memory just cause a momentary slowdown (read to the disk) instead of a system crash is also a significant win.
and finally, if both of the above fail (so the memory contents are irreplaceable) the OS can detect what program it was running on that CPU at the time the read took place, and kill just that program (and log that the program was killed due to hardware memory errors, not an application bug) rather than killing the entire system.
none of these protections guarantee that the system won't crash when cosmic rays hit the ram, but each of these steps makes it less likely to crash.
given common use cases, I wouldn't be surprised to find that these sorts of strategies make systems an order or two of magnitude less likely to crash as a result of memory errors (although the gains in application reliability will not be as large due to the fact that some of the gain is in killing applications instead of the entire system.