The state of the e1000e bug
This assertion raised a few eyebrows among those who are nervously watching the e1000e corruption bug. While the development community disagrees on all kinds of issues, there is a reasonably strong consensus that hardware-destroying bugs can be seen as "scary."
Given that, it would be nice to say that this particular regression has been tracked down and fixed, but that is not the case. As of this writing, nobody knows what is causing systems with 2.6.27-rc kernels to occasionally overwrite the EEPROM on e1000e network adapters. The progress which had been made, while discouragingly small, does narrow down the problem a bit:
- There was an early hypothesis that the GEM graphical memory manager
code might be responsible for the problem. There have been reports of
corruption on distributions which do not package GEM, though, so GEM
is no longer a suspect.
- For similar reasons, the idea that the page attribute table (PAT) work
could somehow be responsible has been discarded.
- There has been a strong correlation between corrupted hardware and the presence of Intel graphics hardware. That has led to a lot of speculation that the X.org Intel driver may somehow be doing the actual corruption, though a separate bug in the e1000e driver may be enabling that to happen. But there is now a report of corruption with a system running NVIDIA graphics. If that report is truly the same problem, then the X.org hypothesis will be substantially weakened. (As an aside, it's worth pondering what would have happened if NVIDIA users had reported the problem first; the temptation to blame the proprietary NVIDIA driver could have been strong enough to delay action on the bug for some time).
So the signs point toward a problem localized within the e1000e driver, but it is too early to make that conclusion. This bug remains mysterious, and it could turn out to have surprising origins.
The nature of this bug makes it harder than usual to track down. It seems to be dependent on some sort of race condition, so it is hard to reproduce. But the way in which the bug makes itself known has the effect of greatly reducing the number of testers trying to reproduce it. People who can avoid that combination of software are doing so, and distributors shipping development kernels have disabled the e1000e driver. Dave Airlie's approach:
must be fairly typical.
One gets the sense that a fairly hot fire has been ignited underneath a number of posteriors at Intel; its developers are active in the discussion and clearly wanting to get this one solved. One objective has been the creation of a utility which would return corrupted hardware to a functioning state, but that tool has been slow in coming. Restoring trashed e1000e adapters appears to be a hard problem, but this is one that Intel has to get right. If more testers are to be encouraged to risk corruption with the idea that the recovery tool will fix them up again, that tool needs to actually work when the time comes. So it is hard to blame Intel for taking the time to ensure that the recovery tool will do its job, but, in the mean time, its absence is making testing harder.
Frans Pop raised an interesting long-term concern: even if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings. Rewriting the bug out of the mainline repository's history is possible with git, but it would create disruption for everybody working from a clone of the repository.
Meanwhile, there could be some interesting consequences if the resolution of this problem takes much more time. It is hard to imagine that the 2.6.27 kernel could be released with a regression of this magnitude; let us say that the reaction in the mainstream press would not be kind. A 2.6.27 delay could force delays in a number of upcoming distribution releases. This kind of cascading delay would not look good; it would, instead, be reminiscent of the troubles encountered by certain proprietary software companies.
That said, the system is clearly working. Testers found the problem before
the code was released in anything resembling a stable form. Developers are
now chasing after the bug as quickly as they can. There will be no stable
kernel or distribution releases which corrupt hardware. This situation is
a pain, but it will be soon resolved and forgotten.
Posted Oct 2, 2008 1:39 UTC (Thu)
by arjan (subscriber, #36785)
[Link] (5 responses)
Now, there's something else going on where "something" is overwriting memory.... but now that the NVM no longer corrupts that is likely to be found very quickly.
Posted Oct 2, 2008 1:45 UTC (Thu)
by corbet (editor, #1)
[Link] (4 responses)
Posted Oct 2, 2008 2:05 UTC (Thu)
by arjan (subscriber, #36785)
[Link] (2 responses)
Posted Oct 2, 2008 5:00 UTC (Thu)
by smoogen (subscriber, #97)
[Link] (1 responses)
http://www.theregister.co.uk/2008/10/01/fundamental_net_v...
and it turns out that the bug is caused by the incoming packets from various 'testers' on the internet.
Posted Oct 2, 2008 17:17 UTC (Thu)
by s0f4r (guest, #52284)
[Link]
Posted Oct 2, 2008 19:28 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Oct 6, 2008 7:31 UTC (Mon)
by jzbiciak (guest, #5246)
[Link] (1 responses)
I'm not a git user nor am I kernel developer, but this caught my eye: It seems like it would be useful to have a git bisect mode that allowed you to pin some changesets while otherwise warping you to the next kernel in a bisection sequence. In other words, you want to get "version XXX + plus these N changesets." That seems like it might be a generically useful facility. It also seems like it'd let people bisect to find other bugs while holding certain things in the present, such as e1000e. Does git provide for such a thing?
Posted Oct 19, 2008 1:35 UTC (Sun)
by Duncan (guest, #6647)
[Link]
Fortunately I don't have an e1000e NIC so this particular bug hasn't been
In general, it's quite possible to revert any specific commit or set of
One, it's quite possible additional commits will have been built upon the
Two, as was the case with this bug now that it has been traced and
Thus a bisect with the supposedly offending changes might both lead to
Unfortunately, the fact remains that testing unreleased kernels is risky.
Duncan
The state of the e1000e bug ... 2.6.27 fix available now
Linus has already merged this patch.
(and very likely unrelated to e1000e itself)
The patch is good stuff and will allow things to move ahead, but calling it a "fix" seems like wishful thinking. The patch interposes a barrier between the bug and its effects. That is very much a good thing to do, but it has only mitigated the symptoms of the bug, not "fixed" the bug. I sure hope that a real fix will be forthcoming before 2.6.27 comes out.
The state of the e1000e bug ... 2.6.27 fix available now
The state of the e1000e bug ... 2.6.27 fix available now
It doesn't fix the part that is causing the corruption in the first place
The state of the e1000e bug ... 2.6.27 fix available now
The state of the e1000e bug ... 2.6.27 fix available now
The state of the e1000e bug ... 2.6.27 fix available now
The state of the e1000e bug
[E]ven if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings.
The state of the e1000e bug
comment will help someone else coming across this later, perhaps from a
google...
a problem here (and it has by the time I write this been traced to the
ftrace framework and fixed properly), but I did have a bug with this
kernel and used git bisect on it, so the question does pertain, and would
have been of immediate interest if I did have the hardware.
commits, while doing bisect or otherwise testing using git. However,
there's a couple problems if trying to do it that way.
problem commit, so one could well end up reverting a decent size swath of
commits, to the point one couldn't really be said to be testing the
kernel at that particular point anyway, potentially invalidating any
conclusions the testing may come to. Perhaps not, indeed, probably not,
but it's a complicating consideration, certainly.
looking at it in hindsight, the bug can be in an area entirely unrelated
(except via the bug) to where it actually shows up. (parenthetical
example: Sort of like a leaky roof; the hole in the roof may be several
meters away from where the water drips thru the ceiling!) In this case,
it was a bug in the new ftrace functionality, coupled with removing
modules, that was eventually found to cause the problem. ftrace has been
disabled in 2.6.27.1, but the point is that until the problem is fully
traced, there's no guarantee that one would pick the correct commits to
revert while bisecting in any case. I've no idea how long ago the last
e1000e commits were, but supposing they happened in this kernel, the
instinctive thing to do would be to revert all of them while doing the
bisect, but that wouldn't have helped in this case, and there was no way
of knowing until later what /would/ help, since the ftrace stuff was
otherwise entirely unrelated.
the wrong conclusions, and not remove the danger of bricking the hardware
in any case.
Indeed, conservative folks will likely want to stay a full kernel release
back, not installing 2.6.26 until 2.6.27 at least, and only then
installing whatever happens to be the latest 2.6.26.x stable release.
Even distribution kernels were bit by this, altho obviously it was just
the most bleeding edge ones, the ones shipped as -rc testing, for those
willing to risk their machines and try it, and this -rc series DID point
out the very literal meaning of the "risk their machines" bit. It's
certainly not for everyone, but as one that does run -rc kernels (tho
only from -rc3 or so) myself, it can be rewarding too -- there's nothing
quite like feeling of being able to point to a particular -rc bug and
say "but for me, that may have made it to release, I played my part in
making this kernel a good one", especially for folks (like me) that might
do sysadmin level bash scripts, but little more.