Kernel development
Brief items
Kernel release status
There is no development kernel as the 2.6.35 merge window is open. As of this writing, the flow of patches into the mainline has been relatively slow; see below for details.2.6.34 was released on May 16. As usual, there's a lot of new stuff in this release, though it is slightly less feature-packed than some. Highlights include asynchronous power management, a lot of tracing enhancements, LogFS, and the Ceph distributed filesystem. As always, lots of details can be found in the excellent KernelNewbies summary.
There have been no stable updates over the last week.
Quotes of the week
Every pig we can
Our love will grow
As large as this land.
So I promise to help
Every pig in defeat
For if it weren't for pigs
There would be no bacon to eat.
As a software writer, I fully buy into that world view. The trouble is that when I go to dinner with hardware people, they seem to be awfully nice chaps ... almost exactly like me, in fact ...
Minutes from the Hardware Error Kernel Mini-Summit
A number of kernel developers held a minisummit dedicated to the collection and reporting of hardware errors at the Linux Foundation's Collaboration Summit. Mauro Carvalho Chehab has put together and posted a set of minutes from that meeting; click below for the full text. It's worth noting that some developers who were not present are not in full agreement with everything found here; follow the discussion thread for more information.Augmented red-black trees
Red-black trees (rbtrees) are a highly-optimized data structure used in a number of places in the kernel. With an rbtree, a kernel programmer can quickly locate data structures corresponding to a specific value; all that is needed is to store data structures with the value of interest as the key. Some fuzzier sorts of matches can be hard to do with rbtrees; consider, for example, the case of finding the lowest-valued node which overlaps with a given range of values. Venkatesh Pallipadi recently encountered this problem while trying to improve the functioning of the page attribute table (PAT) support for the x86 architecture. Rather than give up on rbtrees, he chose to enhance that data structure to meet a wider range of needs.Venkatesh's patch (which was one of the first things merged for 2.6.35) implements the concept of "augmented rbtrees." Such a tree works very much like an ordinary rbtree, with the exception that it keeps additional information in each node. That information, almost certainly, is a function of any child nodes in the tree - the maximum key value among all children, for example. Since users of rbtrees must write their own search functions anyway, they can easily take advantage of this extra information to optimize searches.
Users of augmented rbtrees must define an augment_cb() callback with this prototype:
void (*augment_cb)(struct rb_node *node);
When the tree is initialized, the callback should be stored in its root node:
struct rb_root my_root = RB_AUGMENT_ROOT(my_augment_cb);
Thereafter, the augment_cb() callback will be invoked whenever the value of one (or both) of a node's children might have changed. The callback can then update the node's additional information to match the new tree topology. The callback will be invoked from insert and delete operations - anything which might change the tree - so rbtree users should ensure that nodes are in a consistent state before inserting them.
Callbacks are not called recursively up the tree. So if a change to a node's augmented value might ripple upward, the augment_cb() callback must work its way up the tree and make the requisite updates. Note that a recursive call on the parent node is probably not a good idea unless the tree is known to be extremely shallow.
As of this writing, the PAT code is the only in-tree user of this functionality, but others seem likely to appear now that this feature is globally available.
The SLEB allocator
Longtime LWN readers will know that the kernel does not have just one internal memory allocator. Instead, we have the longstanding "slab" allocator (perennially due to be removed someday), the SLUB allocator (intended for better scalability, but it hasn't been able to beat slab on every test), and the SLOB allocator (a space-efficient allocator for embedded use). There is also the SLQB allocator waiting in the wings, but it has been waiting there long enough that one may wonder if it will ever emerge from there.All told, one might assume that we have enough allocators. Then again, there are quite a few letters still available in the SL*B namespace, so why not make another one?
Thus Christoph Lameter, author of SLUB, has come forward with the SLEB allocator, which is meant to be a mixture of the best of slab and SLUB. Unlike SLUB, SLEB retains the object queues used by slab, but it also adds a bitmap for object management as well. Also unlike SLUB, there is no storage of metadata in the objects themselves. That is a performance enhancement: if a cache-cold object is allocated or freed, SLEB will not bring it into the cache.
This code is very new; it apparently has not yet been trusted outside of a KVM virtual machine. The long benchmarking process that might lead to merging and, possibly, displacing one of the other allocators has not yet begun. But the code is there, and that's a start.
Kernel development news
2.6.35 merge window part 1
It's that time again: a new kernel development cycle has started and the merge window is currently open for new code. As of this writing, some 1100 non-merge changes have been incorporated into the mainline kernel. The most significant user-visible changes include:
- The performance monitoring subsystem supports the Intel "precise event
based sampling" (PEBS) mode, in which the hardware directly records
event information into a dedicated memory region. The perf subsystem
also can now obtain performance
information from old Pentium4 CPUs.
- The "perf kvm" tool, which allows the monitoring of virtualized guests
from the host, has been merged.
- The dynamic probe code has better support for a number of basic
integer types.
- The "fair sleepers," "sync wakeups," and "affine wakeup" scheduler
feature flags have
been removed. It seems that, at this point, the scheduler developers
don't believe that things will work properly without those features,
so they are always enabled.
- The SuperH architecture now has hotplug CPU support.
- New drivers:
- Processors and boards: HP iPAQ rx1950 devices, Acer N35
systems, Samsung S3C2416-based systems, Marvell GuruPlug
reference boards, Voipac PXA270 single-board computers, Aeronix
Zipit Z2 systems, Cavium Networks CNS3xxx processors, Cavium
Networks CNS3420 MPCore boards, taskit PortuxG20 and Stamp9G20
boards, ARM SPEAr3XX- and
SPEAr6XX-based systems, Versatile Express CA9x4 processors, and ARM
Ltd Versatile Express boards.
- Miscellaneous: DaVinci DM365-based realtime clock devices.
- Processors and boards: HP iPAQ rx1950 devices, Acer N35
systems, Samsung S3C2416-based systems, Marvell GuruPlug
reference boards, Voipac PXA270 single-board computers, Aeronix
Zipit Z2 systems, Cavium Networks CNS3xxx processors, Cavium
Networks CNS3420 MPCore boards, taskit PortuxG20 and Stamp9G20
boards, ARM SPEAr3XX- and
SPEAr6XX-based systems, Versatile Express CA9x4 processors, and ARM
Ltd Versatile Express boards.
Changes visible to kernel developers include:
- The "cpu_stop" (formerly cpuhog) mechanism has been
merged. A cpu_stop allows kernel code to monopolize one or more
processors for brief periods.
- Augmented rbtrees are now in the
mainline kernel.
- The INIT_RCU_HEAD() macro is going away; it was never really needed for RCU functionality, and RCU debugging is moving to the object debugging infrastructure.
As can be seen above, the 2.6.35 merge window has gotten off to a bit of a slow start. By the old schedule, the window would remain open through the end of the month; there has been speculation that Linus will close it rather sooner than that this time around, though, to inconvenience maintainers who wait too long to get their pull requests in. One way or another, there should certainly be more changes to report on next week.
The trouble with the TSC
The time stamp counter (TSC) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC), which makes it a tempting target for applications that need fine-grained timestamps. Unfortunately, it is also rather unreliable, so the kernel jumps through hoops to decide whether to use it and to try to detect when it goes awry. An effort to export the kernel's knowledge about the reliability of the TSC has met strong resistance for a number of reasons, but the biggest is that the kernel developers don't think that applications should be accessing the counter directly.
Dan Magenheimer and Venkatesh Pallipadi proposed adding a /sys/devices/tsc directory with several entries corresponding to the kernel's internal TSC information, including the tsc_unstable flag, which governs whether the kernel uses the counter as a stable time source. Andi Kleen questioned the idea:
That is exactly what the patch is meant to do, Magenheimer said, because applications have no reliable way to determine whether the standard system calls will be "fast" or "slow":
Note also that even vsyscall with TSC as the clocksource will still be significantly slower than rdtsc, especially in the common case where a timestamp is directly stored and the delta between two timestamps is later evaluated; in the vsyscall case, each timestamp is a function call and a convert to nsec but in the TSC case, each timestamp is a single instruction.
Depending on the hardware, gettimeofday() and
clock_gettime() may be implemented as vsyscalls—virtual
system calls—rather than standard
system calls, which eliminates the user space to kernel transition.
Vsyscalls are code that is stored in a special memory region in user space
(the vdso region)
that may access kernel-maintained data, like clock ticks.
Using vsyscalls, the calls are (relatively) fast, but on some hardware (or
virtual machines) that
requires kernel-space operations to get to a reliable counter, a vsyscall
cannot be
used, so the calls are slower. For applications that "need to obtain timestamp data
tens or hundreds of thousands of times per second
", the difference
is significant.
But Magenheimer believes that
if the kernel finds the TSC stable enough for its own timekeeping purposes, then that guarantees that it is usable by applications. Arjan
van de Ven and Thomas Gleixner are quick to correct that misunderstanding.
Van de Ven notes that the stability of the
TSC can change under certain circumstances and there would be no way to
notify the applications. His advice: "friends don't let friends use
rdtsc in application code
".
Gleixner goes into some detail about how
the TSC can get out of whack, including system management mode interrupts (SMIs)
fiddling with the TSC to hide their presence, that multiple cores can
have different values because of boot offsets and/or hotplugging, and that
multiple sockets can introduce differences due to separate clocks or drift
in the clock signals due to temperature. There is, in short, nothing
reliable about the TSC: "the stupid hardware is
not reliable whether it has some 'I claim to be reliable tag' on it or
not
". Gleixner did offer a possible alternative, though:
What we can talk about is a vget_tsc_raw() interface along with a vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an nasty error code for everything which is not usable.
Currently, there are unnamed "enterprise applications" that attempt to figure out whether they can use the TSC, and do so if they think it will work because of the uncertainty in the performance of gettimeofday() and friends. Magenheimer suggests that perhaps that information could be made available:
Magenheimer also wonders if the kernel developers are suffering from "hot stove" syndrome, in that they have been burned in the past and are reluctant to even consider changes. But Gleixner and van de Ven both point out that there is no hardware that can make the guarantees that Magenheimer wants. And Gleixner has the burn marks to prove it:
While the discussion had various interesting analogies including hanging
ropes/knives and condoms versus abstention, it did not (yet) find a car
analogy. It did, however, seem to find some common ground that information
about whether the clock calls are implemented as vsyscalls or system calls
should be exported. That is unlikely to satisfy those that have been "using vsyscalls for a while and still have a
performance headache
", who Magenheimer quotes, but there is nothing stopping
applications from reading the TSC directly. Those applications just have
to be prepared to handle any strange TSC behavior they encounter.
Ingo Molnar tries to clarify the reasons
that the kernel can't export the reliability information: "The point is for the kernel to not be complicit in
practices that are technically not reliable.
[...]
So the kernel wont 'signal' that something is safe to
use if it is not safe to use.
"
But he also sees some reason to hope:
I really mean it - and it might be possible - but we have not found it yet.
Peter Zijlstra has another solution to the problem. He would like to see the kernel move to eventually disable RDTSC from user space. By emulating the instruction and logging all uses of it (and the related RDTSCP), user-space programs that use it could be identified and changed:
Of course closed source stuff will have to deal with it themselves, but who cares about that anyway ;-)
Exporting the information about whether gettimeofday() is "slow" or not seems like a reasonable starting point. No patches to do that have emerged yet, but it is a fairly straightforward thing to do. Eventually, something like Gleixner's vget_tsc_raw() may also come about, though it won't satisfy those who are unhappy with the current vsyscall performance. Those applications will just have to read the TSC themselves and deal with whatever the hardware throws at them.
Blocking suspend blockers
When LWN last looked at suspend blockers in April, it appeared that this functionality was on a path to be merged into the mainline sometime soon. It may still be on that path, but an extended discussion has muddied the picture somewhat. It is a relatively small and obscure bit of code, but the fate of suspend blockers may have significant implications on how the kernel community deals with external projects in the future.Suspend blockers, remember, are tied to the "opportunistic suspend" mode used by the Android system. In this mode, the kernel is placed into a sort of controlled narcolepsy; it will fall asleep (suspend the system) just about anytime that somebody is not actively prodding it. A suspend blocker is a form of prod which can be used to keep the system awake while some sort of important processing is going on. As long as there are suspend blockers outstanding, the system will not suspend.
There are two aspects to this approach which sit well with the Android developers. One is that they are able to get better power performance (longer battery life) by suspending the entire system whenever nothing is going on. Using normal runtime power management does not give them the same results. The other key point is that opportunistic suspend can happen even when processes are running in user space. In the absence of a suspend blocker, any computation underway is not considered to be important enough to keep the system awake. This behavior is a form of defense against poorly-written applications which might, otherwise, drain a system's battery in a short period of time.
Suspend blockers have few enthusiastic supporters. Opportunistic suspend seems like a bit of a hack, and the need to put suspend blocker calls into drivers looks invasive. Even if this feature is configured out of most kernels, it looks like it could be a maintenance burden going forward. Even so, most of the developers involved - including almost everybody involved with Linux power management - have concluded that nobody has any better ideas. So, as Matthew Garrett put it:
Requests for alternatives have been posted a number of times in this discussion, but actual proposals have been rather rare. A number of the suspend blocker opponents seem more interested in changing the use case - mandating that all Android applications be well written, for example. The problem is that users will blame the device (rather than that new dog whistle application) if its battery fails to last long enough. So the Android developers must choose between somehow forcing good behavior on all application developers (perhaps losing the "open to all" feature that is at the core of the Android way of doing things) or creating a system robust enough to function with non-ideal applications installed.
The Android developers have taken the latter approach. In the process, they have made suspend blockers a key part of their platform. Many of the drivers which have been developed for Android have suspend blocker support built into them; they cannot be merged in their current form if the suspend blocker API is not available for them. So the current alternatives are to keep those drivers out, or to hack out the suspend blocker usage before merging them. In the former case, we have more out-of-tree code; in the latter case, we have in-tree drivers which are not actually used, tested, or maintained by anybody. Neither alternative looks good.
Merging suspend blockers would make it easier to get much of the rest of this code in; as Android developer Brian Swetland said:
This helps get us ever closer to being able to build a production-ready kernel for various android devices "out of the box" from the mainline tree and gets me ever closer to not being in the business of maintaining a bunch of SoC-specific android-something-2.6.# trees, which seriously is not a business I particularly want to be in.
("Wakelocks" are the old name for suspend blockers).
Google and Android have taken a lot of grief for their failure to work with upstream and get their code into the mainline kernel. There can be no doubt that their code could have been handled better; had the Android developers worked with the kernel community before shipping this functionality in millions of handsets, perhaps much of this trouble could have been avoided. But that history cannot be rewritten now; not even the secret git "plumbing" commands can make that happen. But we can try to improve the situation going forward.
The suspend blocker effort looks like a real attempt to do better. The code has not just been posted; it has been through rather more than the usual number of revisions as its developers have put considerable time into trying to address comments which have been made. A failure to merge it would be demoralizing at best. If the development community refuses this attempt to bring Android and the mainline closer, it risks creating an impression of bad faith at best. If we do not accept their code, we really should not complain about them maintaining it outside of the mainline.
As of this writing, the 2.6.35 merge window is open. What will happen with suspend blockers is anybody's guess. The power management developers are in favor of merging it, but some others have made a fair amount of noise and Linus has not made his feelings known. So it is hard to say whether this long story is about to come to a close or not.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
