Release status
Kernel release status
The current 2.6 development kernel is 2.6.27-rc8,
released by Linus on
September 29. It is, he says, likely to be the last -rc release
before the 2.6.27 final, but it is not clear that he was thinking about the
e1000e problem (see below) at the time.
A handful of fixes have gone into the mainline repository since the
2.6.27-rc8 release.
There have been no stable 2.6 releases over the last week; in fact, the
last such was 2.6.26.5 on
September 8.
Comments (none posted)
Kernel development news
Quotes of the week
The userspace API you propose should however be taken out and shot,
then buried with a stake through its heart, holy water in its mouth
and its head cut off, at midnight in a pentacle at a crossroads in
the presence of a priest.
--
Alan Cox, who seems strangely
appropriate for the priest role.
Btw, the _real_ bug is clearly in the hardware design that allows
you to brick those things without apparently even having a lock
bit.
I'm hoping Intel doesn't treat this as just a software bug. Some hw
designer should be thinking hard about which orifice they put their
head up in.
--
Linus Torvalds
What a low-rent cheeseball freshman maneuver. Too bad I don't
drink, or some kernel hackers would get an earful of fresh rumors
about the mental acuity of stap hackers at the bar tonight! Ok,
staprun, let me get a baby wipe there, I'm feeling parental.
--
Roland McGrath has fun with SystemTap
But I also have a UI that the kids can run to _see_ how much time
they have left, so that getting thrown off the machine doesn't come
as a total surprise. And yesterday Patricia asked why it has to be
that ugly. And I had to admit that her dad is just not very good at
UI's...
--
Linus Torvalds fails to impress his kids
(Thanks to Nicolas Pitre).
Comments (4 posted)
The state of the e1000e bug
By Jonathan Corbet
October 1, 2008
Linus Torvalds sent out the 2.6.27-rc8 release on September 29 with
this comment:
This one should be the last one: we're certainly not running out of
regressions, but at the same time, at some point I just have to
pick some point, and on the whole the regressions don't look _too_
scary.
This assertion raised a few eyebrows among those who are nervously watching
the e1000e corruption bug. While the development community disagrees on
all kinds of issues, there is a reasonably strong consensus that
hardware-destroying bugs can be seen as "scary."
Given that, it would be nice to say that this particular regression has
been tracked down and fixed, but that is not the case. As of this writing,
nobody knows what is causing systems with 2.6.27-rc kernels to occasionally
overwrite the EEPROM on e1000e network adapters. The progress which had been
made, while discouragingly small, does narrow down the problem a bit:
- There was an early hypothesis that the GEM graphical memory manager
code might be responsible for the problem. There have been reports of
corruption on distributions which do not package GEM, though, so GEM
is no longer a suspect.
- For similar reasons, the idea that the page attribute table (PAT) work
could somehow be responsible has been discarded.
- There has been a strong correlation between corrupted hardware and the
presence of Intel graphics hardware. That has led to a lot of
speculation that the X.org Intel driver may somehow be doing the actual
corruption, though a separate bug in the e1000e driver may be enabling
that to happen. But there is now a report of corruption with a system
running NVIDIA graphics. If that report is truly the same problem,
then the X.org hypothesis will be substantially weakened. (As an
aside, it's worth pondering what would have happened if NVIDIA users
had reported the problem first; the temptation to blame the
proprietary NVIDIA driver could have been strong enough to delay
action on the bug for some time).
So the signs point toward a problem localized within the e1000e driver, but
it is too early to make that conclusion. This bug remains mysterious, and
it could turn out to have surprising origins.
The nature of this bug makes it harder than usual to track down. It seems
to be dependent on some sort of race condition, so it is hard to
reproduce. But the way in which the bug makes itself known has the effect
of greatly reducing the number of testers trying to reproduce it. People
who can avoid that combination of software are doing so, and distributors
shipping development kernels have disabled the e1000e driver. Dave
Airlie's approach:
But I'm leaving this up to Intel, I don't think HP will take it too
kindly if I keep returning my laptop.
must be fairly typical.
One gets the sense that a fairly hot fire has been ignited underneath a
number of posteriors at Intel; its developers are active in the discussion
and clearly wanting to get this one solved. One objective has been the
creation of a utility which would return corrupted hardware to a
functioning state, but that tool has been slow in coming. Restoring
trashed e1000e adapters appears to be a hard problem, but this is one that
Intel has to get right. If more testers are to be encouraged to risk
corruption with the idea that the recovery tool will fix them up again,
that tool needs to actually work when the time comes. So it is hard to
blame Intel for taking the time to ensure that the recovery tool will do
its job, but, in the mean time, its absence is making testing harder.
Frans Pop raised an interesting long-term
concern: even if this bug is fixed tomorrow, it will be present in most of
the 2.6.27 history. Anybody bisecting the kernel in an attempt to track
down an unrelated bug risks being bitten by a zombie version of the e1000e
bug. There may be no way to deal with that threat other than the posting
of some big warnings. Rewriting the bug out of the mainline repository's
history is possible with git, but it would create disruption for everybody
working from a clone of the repository.
Meanwhile, there could be some interesting consequences if the resolution
of this
problem takes much more time. It is hard to imagine that the 2.6.27
kernel could be released with a regression of this magnitude; let us say
that the reaction in the mainstream press would not be kind. A 2.6.27
delay could force delays in a number of upcoming distribution releases.
This kind of cascading delay would not look good; it would, instead, be
reminiscent of the troubles encountered by certain proprietary software
companies.
That said, the system is clearly working. Testers found the problem before
the code was released in anything resembling a stable form. Developers are
now chasing after the bug as quickly as they can. There will be no stable
kernel or distribution releases which corrupt hardware. This situation is
a pain, but it will be soon resolved and forgotten.
Comments (8 posted)
Low-level tracing plumbing
By Jonathan Corbet
September 30, 2008
Kernel and user-space tracing were heavily discussed at both the kernel
summit and the Linux Plumbers Conference. Attendees did not emerge from
those discussions with any sort of comprehensive vision of how the tracing
problem will be solved; there is not, yet, a consensus on that point. But
one clear message did come out: we may end up with several different
tracing mechanisms in the kernel, but there is no patience for redundant
low-level tracing buffer implementations. All of the potential tracing
frameworks are going to have to find a way to live with a single mechanism
for collecting trace data and getting it to user space.
This conclusion may look like a way of diverting attention from the
intractable problems at the higher levels and, instead, focusing everybody
on something so low-level that the real issues disappear. There may be
some truth to that. It is also true, though, that there is no call for
duplicating the same sort of machinery across several different tracing
frameworks; coming up with a common solution to this part of the problem
can only lead to a better kernel
in the long run. But there is another objective here which is just as
important: having all the tracing frameworks using a single buffer allows
them to be used together. It is not hard to imagine a future tracing tool
integrating information gathered with simultaneous use of ftrace, LTTng,
SystemTap, and other tracing tools that have not been written yet. Having
all of those tools using the same low-level plumbing should make that
integration easier.
With that in mind, Steven Rostedt set out to create a new, unified tracing
buffer; as of this writing, that patch was already up to its tenth iteration. A casual perusal of the
patch might well leave a reader confused; 2000 lines of relatively complex
code to implement what is, in the end, just a circular buffer.
This circular buffer is not even
suitable for use by tracing frameworks yet; a separate "tracing" layer is to
be added for that. The key point here is that, with tracing code,
efficiency is crucially important. One of the main use cases for tracing
is to debug performance problems in highly stressed production
environments. A heavyweight tracing mechanism will create an observer
effect which can obscure the situation which called for tracing in the
first place, disrupt the production use of the system, or both. To be
accepted, a tracing framework must have the smallest possible impact on the
system.
So the unified trace buffer patch applies just about every known trick to
limit its runtime cost. The circular buffer is actually a set of per-CPU
buffers, each of which allows lockless addition and consumption of events.
The event format is highly compact, and
every effort is made to avoid copying it, ever. Rather than maintain a
separate structure to track the contents of an individual page in the
buffer, the patch employs yet another overloaded variant of struct
page in the system memory map. (Your editor would not want to be the
next luckless developer who has to modify struct page and, in the
process, track down and fix all of the tricky
not-really-struct-page uses throughout the kernel). And so on.
The patch itself does a fairly good job of describing the trace buffer API;
that discussion will not be repeated here. It is worth taking a quick look
at the low-level event format, though:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
This format was driven by the desire to keep the per-event overhead as
small as possible, so there is a single 32-bit word of header information.
Here, type is the type of the event, len is its length
(except when it's not, see below), time_delta is a time
offset value, and array contains the actual event data.
There are four types of events; one of them (RINGBUF_TYPE_PADDING)
is just a way of filling out empty space at the end of a page. Normal
events generated by the tracing system (RINGBUF_TYPE_DATA) have a
length given by the len field, which is right-shifted by two
bits. So the maximum event length is 28 bytes (32 bytes minus four for the
header word), which is not very long. For longer events, len is
set to zero and the first word of the array field contains the
real length.
The other two event types have to do with time stamps. Over the course of
the discussion, it became clear that high-resolution timing information is
needed with all events, for two reasons. The recording of events into
per-CPU arrays, while essential for performance, does have the effect of
separating events which are related in time; the addition of precise
timekeeping will allow events to be collated in the proper order. That
collation could be handled through some sort of serial counter, but some
performance issues can only be understood by looking closely at the precise
timing of specific events. So events need to have real time data, at the highest
resolution which is practical.
Just how that data will be recorded is still unclear, and may end up being
architecture dependent. Some systems may use timestamp counter data
directly, while others may be able to provide real times in nanoseconds.
Whatever format turns out to be used, there is no doubt that it will
require 64 bits of storage. But most of the time data is redundant between
any two events, so there is no real desire to add a full 64-bit time stamp
to every event in the stream. The compromise which was reached was to
store the amount of time which passes between one event and the next in the
27 bits allotted. Should the time delta be too large to fit in that space,
the trace buffer code will insert an artificial event (of type
RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space.
The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to
help keep the buffer timestamps in sync." This little bit of functionality
has not yet been implemented, though.
The rate of change of the trace buffer code appears to be slowing somewhat
as comments from various directions are addressed; it may be getting close
to its final form. Then it will be a matter of implementing the
higher-level protocols on top of it. In the mean time, though, the
attentive reader may be wondering: what about relayfs? The relay code has
been in the kernel for years, and it was intended to solve just this kind
of problem.
The most direct (if not most politic) answer to that question was probably posted by
Peter Zijlstra:
Dude, relayfs is such a bad performing mess that extending it seems
like a bad idea. Better to write something new and delete
everything relayfs related.
Deleting relayfs would not be that hard; there are only a couple of users,
currently. But relayfs developer Tom Zanussi is not convinced that the problems with
relayfs are severe enough to justify tossing it out and starting over. He
has posted a series of patches cleaning up
the relayfs API and addressing some of its performance problems. At this
point, though, it is not clear that anybody is really looking at that work;
it has not received much in the way of comments.
One way or the other, the kernel seems set to have a low-level trace buffer
implementation in place soon. That just leaves a few other little problems
to solve, including making dynamic tracing work, instrumenting the kernel
with static trace points, implementing user-space tracing, etc. Working
those issues out is likely to take a while, and it is likely to result in a
few different tracing solutions aimed at different needs. But we'll have
the low-level plumbing, and that's a start.
Comments (13 posted)
Moving the -staging tree
By Jake Edge
October 1, 2008
Greg Kroah-Hartman was tagged as the "maintainer of crap" at this year's Kernel Summit for his
willingness to shepherd drivers of lower quality into the mainline. He has
not shrunk from that label, when introducing a patch set that would merge some
of those drivers. In fact, he has embraced the label: as part of his
patch, he introduced the
TAINT_CRAP flag for use in tainting kernels that load these, well,
crappy drivers.
There has been an ongoing
struggle between those who want to see drivers get included as quickly
as possible versus those who want to see them approach or attain normal
kernel quality levels first. Kroah-Hartman started the -staging tree last June as a way
to increase the visibility, thus testing and bug fixing, of out-of-tree
drivers. Because drivers in that tree have been steadily
improving—to the point where several have graduated to the
mainline—the belief is that moving -staging itself into the mainline
kernel will result in even faster progress.
So, Kroah-Hartman has introduced a new directory (drivers/staging)
to hold these drivers, as well as a mechanism to automatically taint the
kernel if any of them get loaded. That will warn users when loading the
module—at least if they check their logs—and include that info
in any oops message that kernel might produce. Kernel
hackers can then filter out problems depending on what
the taint is—problems in kernels tainted with binary-only drivers are
generally
actively ignored.
Getting those drivers into the mainline, though, will make it much easier
for folks who want to test them. In addition, clean-ups and fixes
for the drivers will go in as mainline patches, raising the
visibility of the developers working on them. The change should have very
minimal impact on other kernel users and developers. In particular,
developers will not
have to worry about reflecting API changes into drivers/staging as
Kroah-Hartman will keep them up-to-date.
The main complaint about the proposal has
been that it
duplicates the functionality or intent of the EXPERIMENTAL flag.
There was also some belief that tainting the kernel was unduly harsh, but
as Kroah-Hartman points out: "It
isn't costing
anything, and if a developer doesn't want to debug the kernel if such a
driver is loaded, this allows them to do this."
As part of the thread, Paul Mundt explains why
EXPERIMENTAL has no meaning in the kernel today:
EXPERIMENTAL today is pretty damn meaningless. What it tends to mean in
practice is that somethings needs some more testing, someone wants to be
able to pull out the EXPERIMENTAL card when someone enables their option
and their kernel blows up, the option/feature hasn't been around in the
kernel for that long, or someone has just been too lazy to remove the
flag (this last one probably covers about 90% of in-tree cases today).
Stuff that is actively broken (in case of your kernel blowing up, not
building, etc.) tends to be shoved under BROKEN instead.
Mundt goes on to show the default configurations almost all enable
CONFIG_EXPERIMENTAL, further reducing its meaning. It would
be nice to audit all of the uses and restore the meaning of the flag, but
that is beyond the scope of what Kroah-Hartman has set out to do. There
still would be a difference, though, even if EXPERIMENTAL were meaningful.
Mundt continues:
The other key difference is that even with experimental stuff in the
kernel, you will still get support, so it's not really a taintable
offense. Stuff in staging/ on the other hand while potentially not
actively hostile against the rest of the system, is still very much an
unknown, and therefore the only safe thing to do is to taint the system
and allow individual developers to make a choice regarding whether any
resulting oopses are worth looking at or not.
There are still some who are concerned about adding
less-than-kernel-quality code. Randy
Dunlap puts it this way: "I think that we
have enough quality problems without adding crap." But, Linus Torvalds
has always been solidly in the "merge early" camp, so this proposal
seems likely to go in for 2.6.28. Besides, as
Stefan Richter notes:
OTOH many if not most of the -staging drivers are ones which are
already in use. Their users already deal with whatever quality problems
these drivers have, in addition to having to fight with the installation
hassles that are inherent to out-of-tree drivers.
In a fairly short span of time, merging drivers into the mainline has
gotten a whole lot easier. At one time, developers might have to work on a
driver for several development cycles before it reached a quality level
that would allow it to be merged. In the interim, the -staging tree
made things easier and more visible for testers and developers; soon that
visibility will rise substantially again.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>