LWN: Comments on "The state of the e1000e bug"

The state of the e1000e bug

Duncan — Sun, 19 Oct 2008 01:35:38 +0000

While you've almost certainly stopped watching for replies, perhaps this
comment will help someone else coming across this later, perhaps from a
google...

Fortunately I don't have an e1000e NIC so this particular bug hasn't been
a problem here (and it has by the time I write this been traced to the
ftrace framework and fixed properly), but I did have a bug with this
kernel and used git bisect on it, so the question does pertain, and would
have been of immediate interest if I did have the hardware.

In general, it's quite possible to revert any specific commit or set of
commits, while doing bisect or otherwise testing using git. However,
there's a couple problems if trying to do it that way.

One, it's quite possible additional commits will have been built upon the
problem commit, so one could well end up reverting a decent size swath of
commits, to the point one couldn't really be said to be testing the
kernel at that particular point anyway, potentially invalidating any
conclusions the testing may come to. Perhaps not, indeed, probably not,
but it's a complicating consideration, certainly.

Two, as was the case with this bug now that it has been traced and
looking at it in hindsight, the bug can be in an area entirely unrelated
(except via the bug) to where it actually shows up. (parenthetical
example: Sort of like a leaky roof; the hole in the roof may be several
meters away from where the water drips thru the ceiling!) In this case,
it was a bug in the new ftrace functionality, coupled with removing
modules, that was eventually found to cause the problem. ftrace has been
disabled in 2.6.27.1, but the point is that until the problem is fully
traced, there's no guarantee that one would pick the correct commits to
revert while bisecting in any case. I've no idea how long ago the last
e1000e commits were, but supposing they happened in this kernel, the
instinctive thing to do would be to revert all of them while doing the
bisect, but that wouldn't have helped in this case, and there was no way
of knowing until later what /would/ help, since the ftrace stuff was
otherwise entirely unrelated.

Thus a bisect with the supposedly offending changes might both lead to
the wrong conclusions, and not remove the danger of bricking the hardware
in any case.

Unfortunately, the fact remains that testing unreleased kernels is risky.
Indeed, conservative folks will likely want to stay a full kernel release
back, not installing 2.6.26 until 2.6.27 at least, and only then
installing whatever happens to be the latest 2.6.26.x stable release.
Even distribution kernels were bit by this, altho obviously it was just
the most bleeding edge ones, the ones shipped as -rc testing, for those
willing to risk their machines and try it, and this -rc series DID point
out the very literal meaning of the "risk their machines" bit. It's
certainly not for everyone, but as one that does run -rc kernels (tho
only from -rc3 or so) myself, it can be rewarding too -- there's nothing
quite like feeling of being able to point to a particular -rc bug and
say "but for me, that may have made it to release, I played my part in
making this kernel a good one", especially for folks (like me) that might
do sysadmin level bash scripts, but little more.

Duncan

The state of the e1000e bug

jzbiciak — Mon, 06 Oct 2008 07:31:53 +0000

I'm not a git user nor am I kernel developer, but this caught my eye:

[E]ven if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings.

It seems like it would be useful to have a git bisect mode that allowed you to pin some changesets while otherwise warping you to the next kernel in a bisection sequence. In other words, you want to get "version XXX + plus these N changesets." That seems like it might be a generically useful facility. It also seems like it'd let people bisect to find other bugs while holding certain things in the present, such as e1000e.

Does git provide for such a thing?

The state of the e1000e bug ... 2.6.27 fix available now

iabervon — Thu, 02 Oct 2008 19:28:29 +0000

It seems pretty likely that the actual bug isn't in the kernel, though, and therefore holding up 2.6.27 might not be appropriate now that the latest kernel will prevent userspace misbehaving in a particular way from persistently messing up hardware. I think the current evidence doesn't exclude: some X driver, while probing the system for its hardware, maps the frame buffer too large and writes into it, spilling into whatever's after it, which is generally either nothing or unwritable, which in turn leads to determining correctly that that driver isn't appropriate. So some arbitrary device would get some arbitrary invalid I/O, at a point where things are mostly idle, and it wouldn't get any particular attention unless it happens to do serious damage (i.e., something that would be noticed later). If the kernel gets things back to a state where nothing terrible happens due to the bug, and maybe even some logging occurs, that's enough for 2.6.27.

The state of the e1000e bug ... 2.6.27 fix available now

s0f4r — Thu, 02 Oct 2008 17:17:34 +0000

unlikely, as the bug has been hit by several testers in isolated testing labs.

The state of the e1000e bug ... 2.6.27 fix available now

smoogen — Thu, 02 Oct 2008 05:00:57 +0000

/me waits to find out that this was caused by the 'TCP' security bug that wipes out all stacks.

http://www.theregister.co.uk/2008/10/01/fundamental_net_v...

and it turns out that the bug is caused by the incoming packets from various 'testers' on the internet.

The state of the e1000e bug ... 2.6.27 fix available now

arjan — Thu, 02 Oct 2008 02:05:43 +0000

Yeah.. it only fixes the e1000e part of the story.
It doesn't fix the part that is causing the corruption in the first place

The state of the e1000e bug ... 2.6.27 fix available now

corbet — Thu, 02 Oct 2008 01:45:20 +0000

The patch is good stuff and will allow things to move ahead, but calling it a "fix" seems like wishful thinking. The patch interposes a barrier between the bug and its effects. That is very much a good thing to do, but it has only mitigated the symptoms of the bug, not "fixed" the bug. I sure hope that a real fix will be forthcoming before 2.6.27 comes out.

The state of the e1000e bug ... 2.6.27 fix available now

arjan — Thu, 02 Oct 2008 01:39:57 +0000

The thread in http://lkml.org/lkml/2008/10/1/368 has a patch that will prevent NVM corruption (and has done so in our extensive testing).
Linus has already merged this patch.

Now, there's something else going on where "something" is overwriting memory.... but now that the NVM no longer corrupts that is likely to be found very quickly.
(and very likely unrelated to e1000e itself)