The state of the e1000e bug

By Jonathan Corbet
October 1, 2008

Linus Torvalds sent out the 2.6.27-rc8 release on September 29 with this comment:

This one should be the last one: we're certainly not running out of regressions, but at the same time, at some point I just have to pick some point, and on the whole the regressions don't look _too_ scary.

This assertion raised a few eyebrows among those who are nervously watching the e1000e corruption bug. While the development community disagrees on all kinds of issues, there is a reasonably strong consensus that hardware-destroying bugs can be seen as "scary."

Given that, it would be nice to say that this particular regression has been tracked down and fixed, but that is not the case. As of this writing, nobody knows what is causing systems with 2.6.27-rc kernels to occasionally overwrite the EEPROM on e1000e network adapters. The progress which had been made, while discouragingly small, does narrow down the problem a bit:

There was an early hypothesis that the GEM graphical memory manager code might be responsible for the problem. There have been reports of corruption on distributions which do not package GEM, though, so GEM is no longer a suspect.
For similar reasons, the idea that the page attribute table (PAT) work could somehow be responsible has been discarded.
There has been a strong correlation between corrupted hardware and the presence of Intel graphics hardware. That has led to a lot of speculation that the X.org Intel driver may somehow be doing the actual corruption, though a separate bug in the e1000e driver may be enabling that to happen. But there is now a report of corruption with a system running NVIDIA graphics. If that report is truly the same problem, then the X.org hypothesis will be substantially weakened. (As an aside, it's worth pondering what would have happened if NVIDIA users had reported the problem first; the temptation to blame the proprietary NVIDIA driver could have been strong enough to delay action on the bug for some time).

So the signs point toward a problem localized within the e1000e driver, but it is too early to make that conclusion. This bug remains mysterious, and it could turn out to have surprising origins.

The nature of this bug makes it harder than usual to track down. It seems to be dependent on some sort of race condition, so it is hard to reproduce. But the way in which the bug makes itself known has the effect of greatly reducing the number of testers trying to reproduce it. People who can avoid that combination of software are doing so, and distributors shipping development kernels have disabled the e1000e driver. Dave Airlie's approach:

But I'm leaving this up to Intel, I don't think HP will take it too kindly if I keep returning my laptop.

must be fairly typical.

One gets the sense that a fairly hot fire has been ignited underneath a number of posteriors at Intel; its developers are active in the discussion and clearly wanting to get this one solved. One objective has been the creation of a utility which would return corrupted hardware to a functioning state, but that tool has been slow in coming. Restoring trashed e1000e adapters appears to be a hard problem, but this is one that Intel has to get right. If more testers are to be encouraged to risk corruption with the idea that the recovery tool will fix them up again, that tool needs to actually work when the time comes. So it is hard to blame Intel for taking the time to ensure that the recovery tool will do its job, but, in the mean time, its absence is making testing harder.

Frans Pop raised an interesting long-term concern: even if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings. Rewriting the bug out of the mainline repository's history is possible with git, but it would create disruption for everybody working from a clone of the repository.

Meanwhile, there could be some interesting consequences if the resolution of this problem takes much more time. It is hard to imagine that the 2.6.27 kernel could be released with a regression of this magnitude; let us say that the reaction in the mainstream press would not be kind. A 2.6.27 delay could force delays in a number of upcoming distribution releases. This kind of cascading delay would not look good; it would, instead, be reminiscent of the troubles encountered by certain proprietary software companies.

That said, the system is clearly working. Testers found the problem before the code was released in anything resembling a stable form. Developers are now chasing after the bug as quickly as they can. There will be no stable kernel or distribution releases which corrupt hardware. This situation is a pain, but it will be soon resolved and forgotten.

The state of the e1000e bug ... 2.6.27 fix available now

Posted Oct 2, 2008 1:39 UTC (Thu) by arjan (subscriber, #36785) [Link] (5 responses)

The thread in http://lkml.org/lkml/2008/10/1/368 has a patch that will prevent NVM corruption (and has done so in our extensive testing).
Linus has already merged this patch.

Now, there's something else going on where "something" is overwriting memory.... but now that the NVM no longer corrupts that is likely to be found very quickly.
(and very likely unrelated to e1000e itself)

The state of the e1000e bug ... 2.6.27 fix available now

Posted Oct 2, 2008 1:45 UTC (Thu) by corbet (editor, #1) [Link] (4 responses)

The patch is good stuff and will allow things to move ahead, but calling it a "fix" seems like wishful thinking. The patch interposes a barrier between the bug and its effects. That is very much a good thing to do, but it has only mitigated the symptoms of the bug, not "fixed" the bug. I sure hope that a real fix will be forthcoming before 2.6.27 comes out.

The state of the e1000e bug ... 2.6.27 fix available now

Posted Oct 2, 2008 2:05 UTC (Thu) by arjan (subscriber, #36785) [Link] (2 responses)

Yeah.. it only fixes the e1000e part of the story.
It doesn't fix the part that is causing the corruption in the first place

The state of the e1000e bug ... 2.6.27 fix available now

Posted Oct 2, 2008 5:00 UTC (Thu) by smoogen (subscriber, #97) [Link] (1 responses)

/me waits to find out that this was caused by the 'TCP' security bug that wipes out all stacks.

http://www.theregister.co.uk/2008/10/01/fundamental_net_v...

and it turns out that the bug is caused by the incoming packets from various 'testers' on the internet.

The state of the e1000e bug ... 2.6.27 fix available now

Posted Oct 2, 2008 17:17 UTC (Thu) by s0f4r (guest, #52284) [Link]

unlikely, as the bug has been hit by several testers in isolated testing labs.

The state of the e1000e bug ... 2.6.27 fix available now

Posted Oct 2, 2008 19:28 UTC (Thu) by iabervon (subscriber, #722) [Link]

It seems pretty likely that the actual bug isn't in the kernel, though, and therefore holding up 2.6.27 might not be appropriate now that the latest kernel will prevent userspace misbehaving in a particular way from persistently messing up hardware. I think the current evidence doesn't exclude: some X driver, while probing the system for its hardware, maps the frame buffer too large and writes into it, spilling into whatever's after it, which is generally either nothing or unwritable, which in turn leads to determining correctly that that driver isn't appropriate. So some arbitrary device would get some arbitrary invalid I/O, at a point where things are mostly idle, and it wouldn't get any particular attention unless it happens to do serious damage (i.e., something that would be noticed later). If the kernel gets things back to a state where nothing terrible happens due to the bug, and maybe even some logging occurs, that's enough for 2.6.27.

The state of the e1000e bug

Posted Oct 6, 2008 7:31 UTC (Mon) by jzbiciak (guest, #5246) [Link] (1 responses)

I'm not a git user nor am I kernel developer, but this caught my eye:

[E]ven if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings.

It seems like it would be useful to have a git bisect mode that allowed you to pin some changesets while otherwise warping you to the next kernel in a bisection sequence. In other words, you want to get "version XXX + plus these N changesets." That seems like it might be a generically useful facility. It also seems like it'd let people bisect to find other bugs while holding certain things in the present, such as e1000e.

Does git provide for such a thing?

The state of the e1000e bug

Posted Oct 19, 2008 1:35 UTC (Sun) by Duncan (guest, #6647) [Link]

While you've almost certainly stopped watching for replies, perhaps this
comment will help someone else coming across this later, perhaps from a
google...

Fortunately I don't have an e1000e NIC so this particular bug hasn't been
a problem here (and it has by the time I write this been traced to the
ftrace framework and fixed properly), but I did have a bug with this
kernel and used git bisect on it, so the question does pertain, and would
have been of immediate interest if I did have the hardware.

In general, it's quite possible to revert any specific commit or set of
commits, while doing bisect or otherwise testing using git. However,
there's a couple problems if trying to do it that way.

One, it's quite possible additional commits will have been built upon the
problem commit, so one could well end up reverting a decent size swath of
commits, to the point one couldn't really be said to be testing the
kernel at that particular point anyway, potentially invalidating any
conclusions the testing may come to. Perhaps not, indeed, probably not,
but it's a complicating consideration, certainly.

Two, as was the case with this bug now that it has been traced and
looking at it in hindsight, the bug can be in an area entirely unrelated
(except via the bug) to where it actually shows up. (parenthetical
example: Sort of like a leaky roof; the hole in the roof may be several
meters away from where the water drips thru the ceiling!) In this case,
it was a bug in the new ftrace functionality, coupled with removing
modules, that was eventually found to cause the problem. ftrace has been
disabled in 2.6.27.1, but the point is that until the problem is fully
traced, there's no guarantee that one would pick the correct commits to
revert while bisecting in any case. I've no idea how long ago the last
e1000e commits were, but supposing they happened in this kernel, the
instinctive thing to do would be to revert all of them while doing the
bisect, but that wouldn't have helped in this case, and there was no way
of knowing until later what /would/ help, since the ftrace stuff was
otherwise entirely unrelated.

Thus a bisect with the supposedly offending changes might both lead to
the wrong conclusions, and not remove the danger of bricking the hardware
in any case.

Unfortunately, the fact remains that testing unreleased kernels is risky.
Indeed, conservative folks will likely want to stay a full kernel release
back, not installing 2.6.26 until 2.6.27 at least, and only then
installing whatever happens to be the latest 2.6.26.x stable release.
Even distribution kernels were bit by this, altho obviously it was just
the most bleeding edge ones, the ones shipped as -rc testing, for those
willing to risk their machines and try it, and this -rc series DID point
out the very literal meaning of the "risk their machines" bit. It's
certainly not for everyone, but as one that does run -rc kernels (tho
only from -rc3 or so) myself, it can be rewarding too -- there's nothing
quite like feeling of being able to point to a particular -rc bug and
say "but for me, that may have made it to release, I played my part in
making this kernel a good one", especially for folks (like me) that might
do sysadmin level bash scripts, but little more.

Duncan