LWN.net Logo

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:06 UTC (Fri) by paragw (guest, #45306)
In reply to: Probable e1000e corruption culprit found (and 2.6.27.1 released) by nevets
Parent article: Probable e1000e corruption culprit found (and 2.6.27.1 released)

Thanks for explaining in detail - it is always very educating and interesting to learn from what follows failures.

On the hardware bug point - I think if we consider that it's just a bunch of memory that allows to be written to, will it not complicate the hardware interface if it were to validate what was being written to? Or does other pieces of hardware provide some type of protection against such things? (Coming to think of it - if the BIOS allows flashing and kernel bug overwrites BIOS - would that be BIOS's fault?)

It's hard to think what the hardware bug is here and how it could prevent it without going through unreasonable complexity - the fundamental problem is that we allow ioremap space to be shared - I would think that at least having a option of not sharing it for people who don't use all of the 32-bit address space will provide much more security than any other measures. Don't you think we need a CONFIG_SHARED_IOREMAP_SPACE=n, like now?


(Log in to post comments)

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:25 UTC (Fri) by nevets (subscriber, #11875) [Link]

Hardware simply should not be permanently disabled simply by writing to some random register. Remember, this didn't cause the device to send out corrupted data. Simply writing into the NVM caused the card to turn into a brick.

The issue with the cmpxchg, is that it will always write data, even if the data it finds did not match. It still performs a write. It will write back the same data it read if the compare fails. By performing the cmpxchg in the IO address, it did not matter what it read. It would corrupt the data it found.

The fact that the old workaround was to make the NVM read only, was not just a work around for this bug. But a proper fix to the driver code as well. This will keep any other random bugs from bricking the card.

As for the CONFIG_SHARED_IOREMAP_SPACE, I don't think that is needed. That may have protected the bug with ftrace, but it does not protect other bugs writing into bad memory areas.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:38 UTC (Fri) by paragw (guest, #45306) [Link]

>As for the CONFIG_SHARED_IOREMAP_SPACE, I don't think that is needed. That may have protected the bug with ftrace, but it does not protect other bugs writing into bad memory areas.

Well it could have saved some people's cards from bricking - who knows how many other cards allow bricking if bad stuff is written to some magic ioremap()able area. Difference between writing to other areas vs. writing to ioremap space is that as we witnessed the later is fatal, former is always recoverable on boot.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 5:45 UTC (Fri) by Cato (subscriber, #7643) [Link]

A simple idea to safeguard the hardware interface is to ensure that immediately before writing any hardware registers (mapped to memory), the driver must also write a 'magic number' to a fixed location (and then clear it afterwards). This minimise the period during which a bug could stomp on the hardware although it probably doesn't eliminate it unless it can be run without any interrupts.

I am not a kernel programmer so the above may be implausible/impractical though.

Magic number safeguard

Posted Oct 21, 2008 8:55 UTC (Tue) by i3839 (guest, #31386) [Link]

Such protection mechanism, a specific sequence of actions that needs to be done before doing anything potentially dangerous should and can be implemented by the hardware, and is already done in e.g. some microcontrollers to protect flash/eprom from accidental writes.

It can't be done in software in the kernel. Or rather, it can, but is useless, because only functions that think they're going to do something dangerous do the checking, while in this case it's a regular cpu instruction that caused the corruption.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds