e1000e and the joy of development kernels
As of this writing, though, nobody seems to know what the problem is. There was some confusion resulting from the fact that the related e1000 driver also suffered from an EEPROM corruption problem - but that turns out to have been an entirely different bug. The e1000 problem was fixed by putting a lock around accesses to the EEPROM, preventing corruption caused by concurrent access. But something else is going on with the e1000e.
Figuring out what that "something else" is appears to be a challenge. The problem is not readily reproducible, and there is this little problem that triggering the bug more than once requires the replacement of the affected hardware. It's not even clear which kernel versions are affected, though it appears that only the 2.6.27 development series shows the bug. There is some correlation between e1000e corruptions and graphics driver crashes, leading David Miller to pursue a hypothesis that the real culprit is changes to the X server, but that idea has not, yet been proven. Other developers suspect a concurrency-related problem similar to the e1000 bug.
As of this writing, the bulk of what is known can be found in this advisory from Mandriva. Kernel developers are adding information to the kernel bugzilla entry as they find it.
It has been suggested that anybody running 2.6.27 on a potentially affected system might want to save a copy of the current EEPROM contents with a command like:
ethtool -e eth0 > eth0.eeprom
(That assumes, of course, that the relevant device is eth0 on your system). With the saved data, it should be possible to recover the device if the worst happens; without, chances are that victims will have to return their systems to the vendor.
In one sense, this bug demonstrates that the system works. It was caught while the kernel was still in the stabilization phase; one can be certain that it will be obliterated somehow before any stable 2.6.27 release comes out. On the other hand, the first report of this problem hit the net on August 8; the problem was known for over a month before distributors started responding to it and the all-out hunt for the cause began. That is a long time for any regression to persist, but it is especially long when one is dealing with a regression which has the ability to regress hardware back to a stone-age state.
The distributors have now responded; most of them have withdrawn kernels with the affected drivers. So far, nobody has posted tools to help affected users recover their hardware (suggestions to use ibautil should be ignored and forgotten about as soon as possible). Such a tool is forthcoming, but it would be hard to blame the relevant engineers for focusing on fixing the problem first. With any luck at all, the root cause will have been isolated by the time you read this.
There is one thing that will not have changed, though. Testers of
unstable software - especially the kernel - have often been warned that
said software can do all kinds of terrible things to their systems. It is
easy to ignore those warnings; even -rc1 kernels actually work for most
people, most of the time. But, as we have seen in this case, the
potential for catastrophic bugs is real. Development code can brick your
network adapter, scramble your filesystems, open up severe security holes,
or save your documents as OOXML. When experimenting with unstable code -
even if it has been neatly packaged by your distributor - it is always
prudent to have good backups and an even better sense of humor.
Index entries for this article | |
---|---|
Kernel | Device drivers/Network drivers |
Posted Sep 25, 2008 1:57 UTC (Thu)
by jwb (guest, #15467)
[Link]
Posted Sep 25, 2008 5:28 UTC (Thu)
by Thalience (subscriber, #4217)
[Link] (1 responses)
1) The e1000e driver leaves the EEPROM mmio area mapped read-write. Then a rogue pointer from another kernel subsystem leads to tickling the control registers in a way that corrupts the EEPROM, or overwrites the mapped EEPROM data directly.
2) Same thing but with the X server somehow creating its own rw mapping of the mmio area (since the kernel's mapping should not be valid for a user process). Then a rogue pointer into that area.....
3) A wild DMA into the mmio area. This may be the nastiest possibility, since DMA writes may not respect the permissions on the mapping (could write through a mapping that is read-only for the cpu) unless there is an IOMMU involved.
These all lead to the natural thoughts, "Why would you design hardware where this can happen?" and "Is this issue lurking in other drivers for devices with EEPROMs?"
Posted Sep 25, 2008 19:54 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Sep 25, 2008 6:03 UTC (Thu)
by Per_Bothner (subscriber, #7375)
[Link]
Thanks to LWN, at least I now know what is going on!
Posted Sep 25, 2008 10:41 UTC (Thu)
by lab (guest, #51153)
[Link] (1 responses)
:-O)
Posted Sep 26, 2008 4:20 UTC (Fri)
by wilreichert (guest, #17680)
[Link]
Posted Sep 25, 2008 13:57 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Sep 25, 2008 14:23 UTC (Thu)
by michaelkjohnson (subscriber, #41438)
[Link]
Given that it's not actually known 100% for certain that this is limited to new kernels, the next update to Foresight Linux will automatically back up ethernet device eeprom data when possible even though it will run a 2.6.26 kernel. I'm not sure that Xorg 7.4 and a 2.6.26 kernel combination has been deployed widely enough to know whether this is a kernel regression or not, and since the next release of Foresight will include Xorg 7.4, we're taking what we hope is the safe route. Perhaps other distributions would like to deploy a similar automated backup on update, at least until the problem is understood?
Posted Sep 25, 2008 16:42 UTC (Thu)
by aegl (subscriber, #37581)
[Link] (3 responses)
Most of the data should be the same from one system to another ... so if you can find another system with the same rev of the e1000, you can copy from there). The exception being the MAC address. You can almost certainly find that by searching the /etc/... files that your distro set up for the NIC when it installed the system (on RHEL look for HWADDR in /etc/sysconfig/network-scripts/ifcfg-eth0). Worst case would be that you make up a new one (being sure to not duplicate some other system on your local network ... but a 48-bit number chosen at random is very, very unlikely to match an existing device).
Am I missing something?
Posted Sep 25, 2008 16:46 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Sep 25, 2008 17:21 UTC (Thu)
by dw (subscriber, #12017)
[Link]
Posted Sep 25, 2008 19:43 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Oct 3, 2008 18:15 UTC (Fri)
by bicchi (guest, #40227)
[Link] (1 responses)
# lspci | grep 8256[67]
# lsmod | grep e1000
Perhaps the author should post in the article how to restore it?
Posted Oct 3, 2008 18:42 UTC (Fri)
by dlang (guest, #313)
[Link]
note that it appears that not all e1000 cards are vunerable, only the ones on laptops (and possibly not all of those, but I have seen comments that the PCI/PCIE cards are not vunerable)
e1000e and the joy of development kernels
e1000e and the joy of development kernels
e1000e and the joy of development kernels
Ah - that's what happened to my laptop ...
I bought a cheap Airlink 101 USB Ethernet Adapter, which worked out-of-the box (even on the older Fedora 9), but it's obviously less convenient than the builtin Ethernet port.
e1000e and the joy of development kernels
e1000e and the joy of development kernels
e1000e and the joy of development kernels
nothing magic about the stabilization process that zaps all these bugs: it
just flushes the common ones out into the open where they can be fixed.
automating eeprom backups
e1000e and the joy of development kernels
I don't know why it's so hard, but the word on the list is that Dave Airlie bricked his laptop (the whole machine, not just the adapter) trying. Not fun.
e1000e and the joy of development kernels
e1000e and the joy of development kernels
e1000e and the joy of development kernels
e1000e and the joy of development kernels
00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network Connection (rev 02)
e1000 137536 0
e1000e and the joy of development kernels