e1000e and the joy of development kernels
As of this writing, though, nobody seems to know what the problem is. There was some confusion resulting from the fact that the related e1000 driver also suffered from an EEPROM corruption problem - but that turns out to have been an entirely different bug. The e1000 problem was fixed by putting a lock around accesses to the EEPROM, preventing corruption caused by concurrent access. But something else is going on with the e1000e.
Figuring out what that "something else" is appears to be a challenge. The problem is not readily reproducible, and there is this little problem that triggering the bug more than once requires the replacement of the affected hardware. It's not even clear which kernel versions are affected, though it appears that only the 2.6.27 development series shows the bug. There is some correlation between e1000e corruptions and graphics driver crashes, leading David Miller to pursue a hypothesis that the real culprit is changes to the X server, but that idea has not, yet been proven. Other developers suspect a concurrency-related problem similar to the e1000 bug.
As of this writing, the bulk of what is known can be found in this advisory from Mandriva. Kernel developers are adding information to the kernel bugzilla entry as they find it.
It has been suggested that anybody running 2.6.27 on a potentially affected system might want to save a copy of the current EEPROM contents with a command like:
ethtool -e eth0 > eth0.eeprom
(That assumes, of course, that the relevant device is eth0 on your system). With the saved data, it should be possible to recover the device if the worst happens; without, chances are that victims will have to return their systems to the vendor.
In one sense, this bug demonstrates that the system works. It was caught while the kernel was still in the stabilization phase; one can be certain that it will be obliterated somehow before any stable 2.6.27 release comes out. On the other hand, the first report of this problem hit the net on August 8; the problem was known for over a month before distributors started responding to it and the all-out hunt for the cause began. That is a long time for any regression to persist, but it is especially long when one is dealing with a regression which has the ability to regress hardware back to a stone-age state.
The distributors have now responded; most of them have withdrawn kernels with the affected drivers. So far, nobody has posted tools to help affected users recover their hardware (suggestions to use ibautil should be ignored and forgotten about as soon as possible). Such a tool is forthcoming, but it would be hard to blame the relevant engineers for focusing on fixing the problem first. With any luck at all, the root cause will have been isolated by the time you read this.
There is one thing that will not have changed, though. Testers of
unstable software - especially the kernel - have often been warned that
said software can do all kinds of terrible things to their systems. It is
easy to ignore those warnings; even -rc1 kernels actually work for most
people, most of the time. But, as we have seen in this case, the
potential for catastrophic bugs is real. Development code can brick your
network adapter, scramble your filesystems, open up severe security holes,
or save your documents as OOXML. When experimenting with unstable code -
even if it has been neatly packaged by your distributor - it is always
prudent to have good backups and an even better sense of humor.
| Index entries for this article | |
|---|---|
| Kernel | Device drivers/Network drivers |
