e1000e and the joy of development kernels

By Jonathan Corbet
September 24, 2008

The 2.6.27-rc regression list posted on September 21 contains - deep within the list - an entry reading "e1000e: 2.6.27-rc1 corrupts EEPROM/NVM". One might be forgiven for missing it; the list of regressions is still (unfortunately) long, and there is nothing there to indicate that it is a notable problem. But it is: this particular bug goes beyond breaking networking; when it bites, it corrupts the EEPROM on the device, causing it to cease to function forevermore (or, at least, until the user can manage to flash the EEPROM with working code). This is a problem which is worth fixing.

As of this writing, though, nobody seems to know what the problem is. There was some confusion resulting from the fact that the related e1000 driver also suffered from an EEPROM corruption problem - but that turns out to have been an entirely different bug. The e1000 problem was fixed by putting a lock around accesses to the EEPROM, preventing corruption caused by concurrent access. But something else is going on with the e1000e.

Figuring out what that "something else" is appears to be a challenge. The problem is not readily reproducible, and there is this little problem that triggering the bug more than once requires the replacement of the affected hardware. It's not even clear which kernel versions are affected, though it appears that only the 2.6.27 development series shows the bug. There is some correlation between e1000e corruptions and graphics driver crashes, leading David Miller to pursue a hypothesis that the real culprit is changes to the X server, but that idea has not, yet been proven. Other developers suspect a concurrency-related problem similar to the e1000 bug.

As of this writing, the bulk of what is known can be found in this advisory from Mandriva. Kernel developers are adding information to the kernel bugzilla entry as they find it.

It has been suggested that anybody running 2.6.27 on a potentially affected system might want to save a copy of the current EEPROM contents with a command like:

    ethtool -e eth0 > eth0.eeprom

(That assumes, of course, that the relevant device is eth0 on your system). With the saved data, it should be possible to recover the device if the worst happens; without, chances are that victims will have to return their systems to the vendor.

In one sense, this bug demonstrates that the system works. It was caught while the kernel was still in the stabilization phase; one can be certain that it will be obliterated somehow before any stable 2.6.27 release comes out. On the other hand, the first report of this problem hit the net on August 8; the problem was known for over a month before distributors started responding to it and the all-out hunt for the cause began. That is a long time for any regression to persist, but it is especially long when one is dealing with a regression which has the ability to regress hardware back to a stone-age state.

The distributors have now responded; most of them have withdrawn kernels with the affected drivers. So far, nobody has posted tools to help affected users recover their hardware (suggestions to use ibautil should be ignored and forgotten about as soon as possible). Such a tool is forthcoming, but it would be hard to blame the relevant engineers for focusing on fixing the problem first. With any luck at all, the root cause will have been isolated by the time you read this.

There is one thing that will not have changed, though. Testers of unstable software - especially the kernel - have often been warned that said software can do all kinds of terrible things to their systems. It is easy to ignore those warnings; even -rc1 kernels actually work for most people, most of the time. But, as we have seen in this case, the potential for catastrophic bugs is real. Development code can brick your network adapter, scramble your filesystems, open up severe security holes, or save your documents as OOXML. When experimenting with unstable code - even if it has been neatly packaged by your distributor - it is always prudent to have good backups and an even better sense of humor.

Index entries for this article
Kernel	Device drivers/Network drivers

e1000e and the joy of development kernels

Posted Sep 25, 2008 1:57 UTC (Thu) by jwb (guest, #15467) [Link]

This really is a hardware problem. Before the hardware guys cast these interfaces into stone, they need to go to the software people and ask "Is this interface stupid?" In the current instance, the software people would have assuredly said "Yes, that's moronic, please go back and do it differently." We could have avoided this whole unpleasant business with sane hardware/software interfaces.

e1000e and the joy of development kernels

Posted Sep 25, 2008 5:28 UTC (Thu) by Thalience (subscriber, #4217) [Link] (1 responses)

So far I've heard three different (but related) theories that seem plausible, as far as my limited knowledge of the hardware goes.

1) The e1000e driver leaves the EEPROM mmio area mapped read-write. Then a rogue pointer from another kernel subsystem leads to tickling the control registers in a way that corrupts the EEPROM, or overwrites the mapped EEPROM data directly.

2) Same thing but with the X server somehow creating its own rw mapping of the mmio area (since the kernel's mapping should not be valid for a user process). Then a rogue pointer into that area.....

3) A wild DMA into the mmio area. This may be the nastiest possibility, since DMA writes may not respect the permissions on the mapping (could write through a mapping that is read-only for the cpu) unless there is an IOMMU involved.

These all lead to the natural thoughts, "Why would you design hardware where this can happen?" and "Is this issue lurking in other drivers for devices with EEPROMs?"

e1000e and the joy of development kernels

Posted Sep 25, 2008 19:54 UTC (Thu) by iabervon (subscriber, #722) [Link]

My guess is that the X driver is getting too much mmio space mapped, and accidentally writing into whatever's next; the kernel panics regardless, but it's only particularly notable if the ethernet driver happens to have just done the special ritual to start writing to the eeprom (which it's doing to reprogram it correctly), and then the graphics driver happens to hit it.

Ah - that's what happened to my laptop ...

Posted Sep 25, 2008 6:03 UTC (Thu) by Per_Bothner (subscriber, #7375) [Link]

This is on my shiny new Lenovo T400. Pretty nice on the whole (love the battery-life, light weight, and especially the daylight-readable LED-backlit display, all at a fairly modest price), but it was disturbing to have my Ethernet interface dead - even on Windows. (I needed to run a development system because of the new Intel chipset - otherwise no WiFi and borderline X support.)
I bought a cheap Airlink 101 USB Ethernet Adapter, which worked out-of-the box (even on the older Fedora 9), but it's obviously less convenient than the builtin Ethernet port.

Thanks to LWN, at least I now know what is going on!

e1000e and the joy of development kernels

Posted Sep 25, 2008 10:41 UTC (Thu) by lab (guest, #51153) [Link] (1 responses)

...or save your documents as OOXML..

:-O)

e1000e and the joy of development kernels

Posted Sep 26, 2008 4:20 UTC (Fri) by wilreichert (guest, #17680) [Link]

Still not as scary as adding a ^M to the end of every line in your text files.

e1000e and the joy of development kernels

Posted Sep 25, 2008 13:57 UTC (Thu) by nix (subscriber, #2304) [Link]

Of course, even *stable* software can do all those things, too. There's
nothing magic about the stabilization process that zaps all these bugs: it
just flushes the common ones out into the open where they can be fixed.

automating eeprom backups

Posted Sep 25, 2008 14:23 UTC (Thu) by michaelkjohnson (subscriber, #41438) [Link]

Given that it's not actually known 100% for certain that this is limited to new kernels, the next update to Foresight Linux will automatically back up ethernet device eeprom data when possible even though it will run a 2.6.26 kernel. I'm not sure that Xorg 7.4 and a 2.6.26 kernel combination has been deployed widely enough to know whether this is a kernel regression or not, and since the next release of Foresight will include Xorg 7.4, we're taking what we hope is the safe route.

Perhaps other distributions would like to deploy a similar automated backup on update, at least until the problem is understood?

e1000e and the joy of development kernels

Posted Sep 25, 2008 16:42 UTC (Thu) by aegl (subscriber, #37581) [Link] (3 responses)

Why is it so hard to restore the eeprom (given that it seems to be so easy to overwrite it!).

Most of the data should be the same from one system to another ... so if you can find another system with the same rev of the e1000, you can copy from there). The exception being the MAC address. You can almost certainly find that by searching the /etc/... files that your distro set up for the NIC when it installed the system (on RHEL look for HWADDR in /etc/sysconfig/network-scripts/ifcfg-eth0). Worst case would be that you make up a new one (being sure to not duplicate some other system on your local network ... but a 48-bit number chosen at random is very, very unlikely to match an existing device).

Am I missing something?

e1000e and the joy of development kernels

Posted Sep 25, 2008 16:46 UTC (Thu) by corbet (editor, #1) [Link]

I don't know why it's so hard, but the word on the list is that Dave Airlie bricked his laptop (the whole machine, not just the adapter) trying. Not fun.

e1000e and the joy of development kernels

Posted Sep 25, 2008 17:21 UTC (Thu) by dw (subscriber, #12017) [Link]

I read somewhere that affected adaptors may not even show up on the PCI bus, which I (in my hardware-uninitiated state) imagine would make it hard to discover where to write the new EEPROM data to.

e1000e and the joy of development kernels

Posted Sep 25, 2008 19:43 UTC (Thu) by iabervon (subscriber, #722) [Link]

I'd guess that the EEPROM contains some information about how the chipset is connected together, to the rest of the machine, and to the ethernet jack, meaning that you have to find an identical device, not just another e1000e, and putting something plausible but wrong could make it do further damage. (In particular, I wouldn't be too surprised if the device checks the eeprom checksum, if that's okay, initializes the PHY, and waits for that before setting up PCI bus interactions; if the checksum is good but the PHY info is wrong, it'll wait forever.)

e1000e and the joy of development kernels

Posted Oct 3, 2008 18:15 UTC (Fri) by bicchi (guest, #40227) [Link] (1 responses)

Nice to know that with this: "ethtool -e eth0 > eth0.eeprom" I can backup the eprom data but how do I restore it in case of failure? I am affected by this bug since I do have and e1000

# lspci | grep 8256[67]
00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network Connection (rev 02)

# lsmod | grep e1000
e1000 137536 0

Perhaps the author should post in the article how to restore it?

e1000e and the joy of development kernels

Posted Oct 3, 2008 18:42 UTC (Fri) by dlang (guest, #313) [Link]

they are still working to develop the tool to be able to do the restore.

note that it appears that not all e1000 cards are vunerable, only the ones on laptops (and possibly not all of those, but I have seen comments that the PCI/PCIE cards are not vunerable)