By Jonathan Corbet
October 1, 2008
Linus Torvalds sent out the 2.6.27-rc8 release on September 29 with
this comment:
This one should be the last one: we're certainly not running out of
regressions, but at the same time, at some point I just have to
pick some point, and on the whole the regressions don't look _too_
scary.
This assertion raised a few eyebrows among those who are nervously watching
the e1000e corruption bug. While the development community disagrees on
all kinds of issues, there is a reasonably strong consensus that
hardware-destroying bugs can be seen as "scary."
Given that, it would be nice to say that this particular regression has
been tracked down and fixed, but that is not the case. As of this writing,
nobody knows what is causing systems with 2.6.27-rc kernels to occasionally
overwrite the EEPROM on e1000e network adapters. The progress which had been
made, while discouragingly small, does narrow down the problem a bit:
- There was an early hypothesis that the GEM graphical memory manager
code might be responsible for the problem. There have been reports of
corruption on distributions which do not package GEM, though, so GEM
is no longer a suspect.
- For similar reasons, the idea that the page attribute table (PAT) work
could somehow be responsible has been discarded.
- There has been a strong correlation between corrupted hardware and the
presence of Intel graphics hardware. That has led to a lot of
speculation that the X.org Intel driver may somehow be doing the actual
corruption, though a separate bug in the e1000e driver may be enabling
that to happen. But there is now a report of corruption with a system
running NVIDIA graphics. If that report is truly the same problem,
then the X.org hypothesis will be substantially weakened. (As an
aside, it's worth pondering what would have happened if NVIDIA users
had reported the problem first; the temptation to blame the
proprietary NVIDIA driver could have been strong enough to delay
action on the bug for some time).
So the signs point toward a problem localized within the e1000e driver, but
it is too early to make that conclusion. This bug remains mysterious, and
it could turn out to have surprising origins.
The nature of this bug makes it harder than usual to track down. It seems
to be dependent on some sort of race condition, so it is hard to
reproduce. But the way in which the bug makes itself known has the effect
of greatly reducing the number of testers trying to reproduce it. People
who can avoid that combination of software are doing so, and distributors
shipping development kernels have disabled the e1000e driver. Dave
Airlie's approach:
But I'm leaving this up to Intel, I don't think HP will take it too
kindly if I keep returning my laptop.
must be fairly typical.
One gets the sense that a fairly hot fire has been ignited underneath a
number of posteriors at Intel; its developers are active in the discussion
and clearly wanting to get this one solved. One objective has been the
creation of a utility which would return corrupted hardware to a
functioning state, but that tool has been slow in coming. Restoring
trashed e1000e adapters appears to be a hard problem, but this is one that
Intel has to get right. If more testers are to be encouraged to risk
corruption with the idea that the recovery tool will fix them up again,
that tool needs to actually work when the time comes. So it is hard to
blame Intel for taking the time to ensure that the recovery tool will do
its job, but, in the mean time, its absence is making testing harder.
Frans Pop raised an interesting long-term
concern: even if this bug is fixed tomorrow, it will be present in most of
the 2.6.27 history. Anybody bisecting the kernel in an attempt to track
down an unrelated bug risks being bitten by a zombie version of the e1000e
bug. There may be no way to deal with that threat other than the posting
of some big warnings. Rewriting the bug out of the mainline repository's
history is possible with git, but it would create disruption for everybody
working from a clone of the repository.
Meanwhile, there could be some interesting consequences if the resolution
of this
problem takes much more time. It is hard to imagine that the 2.6.27
kernel could be released with a regression of this magnitude; let us say
that the reaction in the mainstream press would not be kind. A 2.6.27
delay could force delays in a number of upcoming distribution releases.
This kind of cascading delay would not look good; it would, instead, be
reminiscent of the troubles encountered by certain proprietary software
companies.
That said, the system is clearly working. Testers found the problem before
the code was released in anything resembling a stable form. Developers are
now chasing after the bug as quickly as they can. There will be no stable
kernel or distribution releases which corrupt hardware. This situation is
a pain, but it will be soon resolved and forgotten.
(
Log in to post comments)