Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.27-rc8, released by Linus on September 29. It is, he says, likely to be the last -rc release before the 2.6.27 final, but it is not clear that he was thinking about the e1000e problem (see below) at the time.A handful of fixes have gone into the mainline repository since the 2.6.27-rc8 release.
There have been no stable 2.6 releases over the last week; in fact, the last such was 2.6.26.5 on September 8.
Kernel development news
Quotes of the week
I'm hoping Intel doesn't treat this as just a software bug. Some hw designer should be thinking hard about which orifice they put their head up in.
The state of the e1000e bug
Linus Torvalds sent out the 2.6.27-rc8 release on September 29 with this comment:
This assertion raised a few eyebrows among those who are nervously watching the e1000e corruption bug. While the development community disagrees on all kinds of issues, there is a reasonably strong consensus that hardware-destroying bugs can be seen as "scary."
Given that, it would be nice to say that this particular regression has been tracked down and fixed, but that is not the case. As of this writing, nobody knows what is causing systems with 2.6.27-rc kernels to occasionally overwrite the EEPROM on e1000e network adapters. The progress which had been made, while discouragingly small, does narrow down the problem a bit:
- There was an early hypothesis that the GEM graphical memory manager
code might be responsible for the problem. There have been reports of
corruption on distributions which do not package GEM, though, so GEM
is no longer a suspect.
- For similar reasons, the idea that the page attribute table (PAT) work
could somehow be responsible has been discarded.
- There has been a strong correlation between corrupted hardware and the presence of Intel graphics hardware. That has led to a lot of speculation that the X.org Intel driver may somehow be doing the actual corruption, though a separate bug in the e1000e driver may be enabling that to happen. But there is now a report of corruption with a system running NVIDIA graphics. If that report is truly the same problem, then the X.org hypothesis will be substantially weakened. (As an aside, it's worth pondering what would have happened if NVIDIA users had reported the problem first; the temptation to blame the proprietary NVIDIA driver could have been strong enough to delay action on the bug for some time).
So the signs point toward a problem localized within the e1000e driver, but it is too early to make that conclusion. This bug remains mysterious, and it could turn out to have surprising origins.
The nature of this bug makes it harder than usual to track down. It seems to be dependent on some sort of race condition, so it is hard to reproduce. But the way in which the bug makes itself known has the effect of greatly reducing the number of testers trying to reproduce it. People who can avoid that combination of software are doing so, and distributors shipping development kernels have disabled the e1000e driver. Dave Airlie's approach:
must be fairly typical.
One gets the sense that a fairly hot fire has been ignited underneath a number of posteriors at Intel; its developers are active in the discussion and clearly wanting to get this one solved. One objective has been the creation of a utility which would return corrupted hardware to a functioning state, but that tool has been slow in coming. Restoring trashed e1000e adapters appears to be a hard problem, but this is one that Intel has to get right. If more testers are to be encouraged to risk corruption with the idea that the recovery tool will fix them up again, that tool needs to actually work when the time comes. So it is hard to blame Intel for taking the time to ensure that the recovery tool will do its job, but, in the mean time, its absence is making testing harder.
Frans Pop raised an interesting long-term concern: even if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings. Rewriting the bug out of the mainline repository's history is possible with git, but it would create disruption for everybody working from a clone of the repository.
Meanwhile, there could be some interesting consequences if the resolution of this problem takes much more time. It is hard to imagine that the 2.6.27 kernel could be released with a regression of this magnitude; let us say that the reaction in the mainstream press would not be kind. A 2.6.27 delay could force delays in a number of upcoming distribution releases. This kind of cascading delay would not look good; it would, instead, be reminiscent of the troubles encountered by certain proprietary software companies.
That said, the system is clearly working. Testers found the problem before the code was released in anything resembling a stable form. Developers are now chasing after the bug as quickly as they can. There will be no stable kernel or distribution releases which corrupt hardware. This situation is a pain, but it will be soon resolved and forgotten.
Low-level tracing plumbing
Kernel and user-space tracing were heavily discussed at both the kernel summit and the Linux Plumbers Conference. Attendees did not emerge from those discussions with any sort of comprehensive vision of how the tracing problem will be solved; there is not, yet, a consensus on that point. But one clear message did come out: we may end up with several different tracing mechanisms in the kernel, but there is no patience for redundant low-level tracing buffer implementations. All of the potential tracing frameworks are going to have to find a way to live with a single mechanism for collecting trace data and getting it to user space.This conclusion may look like a way of diverting attention from the intractable problems at the higher levels and, instead, focusing everybody on something so low-level that the real issues disappear. There may be some truth to that. It is also true, though, that there is no call for duplicating the same sort of machinery across several different tracing frameworks; coming up with a common solution to this part of the problem can only lead to a better kernel in the long run. But there is another objective here which is just as important: having all the tracing frameworks using a single buffer allows them to be used together. It is not hard to imagine a future tracing tool integrating information gathered with simultaneous use of ftrace, LTTng, SystemTap, and other tracing tools that have not been written yet. Having all of those tools using the same low-level plumbing should make that integration easier.
With that in mind, Steven Rostedt set out to create a new, unified tracing buffer; as of this writing, that patch was already up to its tenth iteration. A casual perusal of the patch might well leave a reader confused; 2000 lines of relatively complex code to implement what is, in the end, just a circular buffer. This circular buffer is not even suitable for use by tracing frameworks yet; a separate "tracing" layer is to be added for that. The key point here is that, with tracing code, efficiency is crucially important. One of the main use cases for tracing is to debug performance problems in highly stressed production environments. A heavyweight tracing mechanism will create an observer effect which can obscure the situation which called for tracing in the first place, disrupt the production use of the system, or both. To be accepted, a tracing framework must have the smallest possible impact on the system.
So the unified trace buffer patch applies just about every known trick to limit its runtime cost. The circular buffer is actually a set of per-CPU buffers, each of which allows lockless addition and consumption of events. The event format is highly compact, and every effort is made to avoid copying it, ever. Rather than maintain a separate structure to track the contents of an individual page in the buffer, the patch employs yet another overloaded variant of struct page in the system memory map. (Your editor would not want to be the next luckless developer who has to modify struct page and, in the process, track down and fix all of the tricky not-really-struct-page uses throughout the kernel). And so on.
The patch itself does a fairly good job of describing the trace buffer API; that discussion will not be repeated here. It is worth taking a quick look at the low-level event format, though:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
This format was driven by the desire to keep the per-event overhead as small as possible, so there is a single 32-bit word of header information. Here, type is the type of the event, len is its length (except when it's not, see below), time_delta is a time offset value, and array contains the actual event data.
There are four types of events; one of them (RINGBUF_TYPE_PADDING) is just a way of filling out empty space at the end of a page. Normal events generated by the tracing system (RINGBUF_TYPE_DATA) have a length given by the len field, which is right-shifted by two bits. So the maximum event length is 28 bytes (32 bytes minus four for the header word), which is not very long. For longer events, len is set to zero and the first word of the array field contains the real length.
The other two event types have to do with time stamps. Over the course of the discussion, it became clear that high-resolution timing information is needed with all events, for two reasons. The recording of events into per-CPU arrays, while essential for performance, does have the effect of separating events which are related in time; the addition of precise timekeeping will allow events to be collated in the proper order. That collation could be handled through some sort of serial counter, but some performance issues can only be understood by looking closely at the precise timing of specific events. So events need to have real time data, at the highest resolution which is practical.
Just how that data will be recorded is still unclear, and may end up being architecture dependent. Some systems may use timestamp counter data directly, while others may be able to provide real times in nanoseconds. Whatever format turns out to be used, there is no doubt that it will require 64 bits of storage. But most of the time data is redundant between any two events, so there is no real desire to add a full 64-bit time stamp to every event in the stream. The compromise which was reached was to store the amount of time which passes between one event and the next in the 27 bits allotted. Should the time delta be too large to fit in that space, the trace buffer code will insert an artificial event (of type RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space.
The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to help keep the buffer timestamps in sync." This little bit of functionality has not yet been implemented, though.
The rate of change of the trace buffer code appears to be slowing somewhat as comments from various directions are addressed; it may be getting close to its final form. Then it will be a matter of implementing the higher-level protocols on top of it. In the mean time, though, the attentive reader may be wondering: what about relayfs? The relay code has been in the kernel for years, and it was intended to solve just this kind of problem.
The most direct (if not most politic) answer to that question was probably posted by Peter Zijlstra:
Deleting relayfs would not be that hard; there are only a couple of users, currently. But relayfs developer Tom Zanussi is not convinced that the problems with relayfs are severe enough to justify tossing it out and starting over. He has posted a series of patches cleaning up the relayfs API and addressing some of its performance problems. At this point, though, it is not clear that anybody is really looking at that work; it has not received much in the way of comments.
One way or the other, the kernel seems set to have a low-level trace buffer implementation in place soon. That just leaves a few other little problems to solve, including making dynamic tracing work, instrumenting the kernel with static trace points, implementing user-space tracing, etc. Working those issues out is likely to take a while, and it is likely to result in a few different tracing solutions aimed at different needs. But we'll have the low-level plumbing, and that's a start.
Moving the -staging tree
Greg Kroah-Hartman was tagged as the "maintainer of crap" at this year's Kernel Summit for his willingness to shepherd drivers of lower quality into the mainline. He has not shrunk from that label, when introducing a patch set that would merge some of those drivers. In fact, he has embraced the label: as part of his patch, he introduced the TAINT_CRAP flag for use in tainting kernels that load these, well, crappy drivers.
There has been an ongoing struggle between those who want to see drivers get included as quickly as possible versus those who want to see them approach or attain normal kernel quality levels first. Kroah-Hartman started the -staging tree last June as a way to increase the visibility, thus testing and bug fixing, of out-of-tree drivers. Because drivers in that tree have been steadily improving—to the point where several have graduated to the mainline—the belief is that moving -staging itself into the mainline kernel will result in even faster progress.
So, Kroah-Hartman has introduced a new directory (drivers/staging) to hold these drivers, as well as a mechanism to automatically taint the kernel if any of them get loaded. That will warn users when loading the module—at least if they check their logs—and include that info in any oops message that kernel might produce. Kernel hackers can then filter out problems depending on what the taint is—problems in kernels tainted with binary-only drivers are generally actively ignored.
Getting those drivers into the mainline, though, will make it much easier for folks who want to test them. In addition, clean-ups and fixes for the drivers will go in as mainline patches, raising the visibility of the developers working on them. The change should have very minimal impact on other kernel users and developers. In particular, developers will not have to worry about reflecting API changes into drivers/staging as Kroah-Hartman will keep them up-to-date.
The main complaint about the proposal has
been that it
duplicates the functionality or intent of the EXPERIMENTAL flag.
There was also some belief that tainting the kernel was unduly harsh, but
as Kroah-Hartman points out: "It
isn't costing
anything, and if a developer doesn't want to debug the kernel if such a
driver is loaded, this allows them to do this.
"
As part of the thread, Paul Mundt explains why EXPERIMENTAL has no meaning in the kernel today:
Mundt goes on to show the default configurations almost all enable CONFIG_EXPERIMENTAL, further reducing its meaning. It would be nice to audit all of the uses and restore the meaning of the flag, but that is beyond the scope of what Kroah-Hartman has set out to do. There still would be a difference, though, even if EXPERIMENTAL were meaningful. Mundt continues:
There are still some who are concerned about adding
less-than-kernel-quality code. Randy
Dunlap puts it this way: "I think that we
have enough quality problems without adding crap.
" But, Linus Torvalds
has always been solidly in the "merge early" camp, so this proposal
seems likely to go in for 2.6.28. Besides, as
Stefan Richter notes:
In a fairly short span of time, merging drivers into the mainline has gotten a whole lot easier. At one time, developers might have to work on a driver for several development cycles before it reached a quality level that would allow it to be merged. In the interim, the -staging tree made things easier and more visible for testers and developers; soon that visibility will rise substantially again.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
