2.6.34 was released on May 16. As usual, there's a lot of new stuff in this release, though it is slightly less feature-packed than some. Highlights include asynchronous power management, a lot of tracing enhancements, LogFS, and the Ceph distributed filesystem. As always, lots of details can be found in the excellent KernelNewbies summary.
There have been no stable updates over the last week.
As a software writer, I fully buy into that world view. The trouble is that when I go to dinner with hardware people, they seem to be awfully nice chaps ... almost exactly like me, in fact ...
Venkatesh's patch (which was one of the first things merged for 2.6.35) implements the concept of "augmented rbtrees." Such a tree works very much like an ordinary rbtree, with the exception that it keeps additional information in each node. That information, almost certainly, is a function of any child nodes in the tree - the maximum key value among all children, for example. Since users of rbtrees must write their own search functions anyway, they can easily take advantage of this extra information to optimize searches.
Users of augmented rbtrees must define an augment_cb() callback with this prototype:
void (*augment_cb)(struct rb_node *node);
When the tree is initialized, the callback should be stored in its root node:
struct rb_root my_root = RB_AUGMENT_ROOT(my_augment_cb);
Thereafter, the augment_cb() callback will be invoked whenever the value of one (or both) of a node's children might have changed. The callback can then update the node's additional information to match the new tree topology. The callback will be invoked from insert and delete operations - anything which might change the tree - so rbtree users should ensure that nodes are in a consistent state before inserting them.
Callbacks are not called recursively up the tree. So if a change to a node's augmented value might ripple upward, the augment_cb() callback must work its way up the tree and make the requisite updates. Note that a recursive call on the parent node is probably not a good idea unless the tree is known to be extremely shallow.
As of this writing, the PAT code is the only in-tree user of this functionality, but others seem likely to appear now that this feature is globally available.SLQB allocator waiting in the wings, but it has been waiting there long enough that one may wonder if it will ever emerge from there.
All told, one might assume that we have enough allocators. Then again, there are quite a few letters still available in the SL*B namespace, so why not make another one?
Thus Christoph Lameter, author of SLUB, has come forward with the SLEB allocator, which is meant to be a mixture of the best of slab and SLUB. Unlike SLUB, SLEB retains the object queues used by slab, but it also adds a bitmap for object management as well. Also unlike SLUB, there is no storage of metadata in the objects themselves. That is a performance enhancement: if a cache-cold object is allocated or freed, SLEB will not bring it into the cache.
This code is very new; it apparently has not yet been trusted outside of a KVM virtual machine. The long benchmarking process that might lead to merging and, possibly, displacing one of the other allocators has not yet begun. But the code is there, and that's a start.
Kernel development news
Changes visible to kernel developers include:
As can be seen above, the 2.6.35 merge window has gotten off to a bit of a slow start. By the old schedule, the window would remain open through the end of the month; there has been speculation that Linus will close it rather sooner than that this time around, though, to inconvenience maintainers who wait too long to get their pull requests in. One way or another, there should certainly be more changes to report on next week.
The time stamp counter (TSC) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC), which makes it a tempting target for applications that need fine-grained timestamps. Unfortunately, it is also rather unreliable, so the kernel jumps through hoops to decide whether to use it and to try to detect when it goes awry. An effort to export the kernel's knowledge about the reliability of the TSC has met strong resistance for a number of reasons, but the biggest is that the kernel developers don't think that applications should be accessing the counter directly.
Dan Magenheimer and Venkatesh Pallipadi proposed adding a /sys/devices/tsc directory with several entries corresponding to the kernel's internal TSC information, including the tsc_unstable flag, which governs whether the kernel uses the counter as a stable time source. Andi Kleen questioned the idea:
That is exactly what the patch is meant to do, Magenheimer said, because applications have no reliable way to determine whether the standard system calls will be "fast" or "slow":
Note also that even vsyscall with TSC as the clocksource will still be significantly slower than rdtsc, especially in the common case where a timestamp is directly stored and the delta between two timestamps is later evaluated; in the vsyscall case, each timestamp is a function call and a convert to nsec but in the TSC case, each timestamp is a single instruction.
Depending on the hardware, gettimeofday() and clock_gettime() may be implemented as vsyscalls—virtual system calls—rather than standard system calls, which eliminates the user space to kernel transition. Vsyscalls are code that is stored in a special memory region in user space (the vdso region) that may access kernel-maintained data, like clock ticks. Using vsyscalls, the calls are (relatively) fast, but on some hardware (or virtual machines) that requires kernel-space operations to get to a reliable counter, a vsyscall cannot be used, so the calls are slower. For applications that "need to obtain timestamp data tens or hundreds of thousands of times per second", the difference is significant.
But Magenheimer believes that if the kernel finds the TSC stable enough for its own timekeeping purposes, then that guarantees that it is usable by applications. Arjan van de Ven and Thomas Gleixner are quick to correct that misunderstanding. Van de Ven notes that the stability of the TSC can change under certain circumstances and there would be no way to notify the applications. His advice: "friends don't let friends use rdtsc in application code".
Gleixner goes into some detail about how the TSC can get out of whack, including system management mode interrupts (SMIs) fiddling with the TSC to hide their presence, that multiple cores can have different values because of boot offsets and/or hotplugging, and that multiple sockets can introduce differences due to separate clocks or drift in the clock signals due to temperature. There is, in short, nothing reliable about the TSC: "the stupid hardware is not reliable whether it has some 'I claim to be reliable tag' on it or not". Gleixner did offer a possible alternative, though:
What we can talk about is a vget_tsc_raw() interface along with a vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an nasty error code for everything which is not usable.
Currently, there are unnamed "enterprise applications" that attempt to figure out whether they can use the TSC, and do so if they think it will work because of the uncertainty in the performance of gettimeofday() and friends. Magenheimer suggests that perhaps that information could be made available:
Magenheimer also wonders if the kernel developers are suffering from "hot stove" syndrome, in that they have been burned in the past and are reluctant to even consider changes. But Gleixner and van de Ven both point out that there is no hardware that can make the guarantees that Magenheimer wants. And Gleixner has the burn marks to prove it:
While the discussion had various interesting analogies including hanging ropes/knives and condoms versus abstention, it did not (yet) find a car analogy. It did, however, seem to find some common ground that information about whether the clock calls are implemented as vsyscalls or system calls should be exported. That is unlikely to satisfy those that have been "using vsyscalls for a while and still have a performance headache", who Magenheimer quotes, but there is nothing stopping applications from reading the TSC directly. Those applications just have to be prepared to handle any strange TSC behavior they encounter.
Ingo Molnar tries to clarify the reasons that the kernel can't export the reliability information: "The point is for the kernel to not be complicit in practices that are technically not reliable. [...] So the kernel wont 'signal' that something is safe to use if it is not safe to use." But he also sees some reason to hope:
I really mean it - and it might be possible - but we have not found it yet.
Peter Zijlstra has another solution to the problem. He would like to see the kernel move to eventually disable RDTSC from user space. By emulating the instruction and logging all uses of it (and the related RDTSCP), user-space programs that use it could be identified and changed:
Of course closed source stuff will have to deal with it themselves, but who cares about that anyway ;-)
Exporting the information about whether gettimeofday() is "slow" or not seems like a reasonable starting point. No patches to do that have emerged yet, but it is a fairly straightforward thing to do. Eventually, something like Gleixner's vget_tsc_raw() may also come about, though it won't satisfy those who are unhappy with the current vsyscall performance. Those applications will just have to read the TSC themselves and deal with whatever the hardware throws at them.looked at suspend blockers in April, it appeared that this functionality was on a path to be merged into the mainline sometime soon. It may still be on that path, but an extended discussion has muddied the picture somewhat. It is a relatively small and obscure bit of code, but the fate of suspend blockers may have significant implications on how the kernel community deals with external projects in the future.
Suspend blockers, remember, are tied to the "opportunistic suspend" mode used by the Android system. In this mode, the kernel is placed into a sort of controlled narcolepsy; it will fall asleep (suspend the system) just about anytime that somebody is not actively prodding it. A suspend blocker is a form of prod which can be used to keep the system awake while some sort of important processing is going on. As long as there are suspend blockers outstanding, the system will not suspend.
There are two aspects to this approach which sit well with the Android developers. One is that they are able to get better power performance (longer battery life) by suspending the entire system whenever nothing is going on. Using normal runtime power management does not give them the same results. The other key point is that opportunistic suspend can happen even when processes are running in user space. In the absence of a suspend blocker, any computation underway is not considered to be important enough to keep the system awake. This behavior is a form of defense against poorly-written applications which might, otherwise, drain a system's battery in a short period of time.
Suspend blockers have few enthusiastic supporters. Opportunistic suspend seems like a bit of a hack, and the need to put suspend blocker calls into drivers looks invasive. Even if this feature is configured out of most kernels, it looks like it could be a maintenance burden going forward. Even so, most of the developers involved - including almost everybody involved with Linux power management - have concluded that nobody has any better ideas. So, as Matthew Garrett put it:
Requests for alternatives have been posted a number of times in this discussion, but actual proposals have been rather rare. A number of the suspend blocker opponents seem more interested in changing the use case - mandating that all Android applications be well written, for example. The problem is that users will blame the device (rather than that new dog whistle application) if its battery fails to last long enough. So the Android developers must choose between somehow forcing good behavior on all application developers (perhaps losing the "open to all" feature that is at the core of the Android way of doing things) or creating a system robust enough to function with non-ideal applications installed.
The Android developers have taken the latter approach. In the process, they have made suspend blockers a key part of their platform. Many of the drivers which have been developed for Android have suspend blocker support built into them; they cannot be merged in their current form if the suspend blocker API is not available for them. So the current alternatives are to keep those drivers out, or to hack out the suspend blocker usage before merging them. In the former case, we have more out-of-tree code; in the latter case, we have in-tree drivers which are not actually used, tested, or maintained by anybody. Neither alternative looks good.
Merging suspend blockers would make it easier to get much of the rest of this code in; as Android developer Brian Swetland said:
This helps get us ever closer to being able to build a production-ready kernel for various android devices "out of the box" from the mainline tree and gets me ever closer to not being in the business of maintaining a bunch of SoC-specific android-something-2.6.# trees, which seriously is not a business I particularly want to be in.
("Wakelocks" are the old name for suspend blockers).
Google and Android have taken a lot of grief for their failure to work with upstream and get their code into the mainline kernel. There can be no doubt that their code could have been handled better; had the Android developers worked with the kernel community before shipping this functionality in millions of handsets, perhaps much of this trouble could have been avoided. But that history cannot be rewritten now; not even the secret git "plumbing" commands can make that happen. But we can try to improve the situation going forward.
The suspend blocker effort looks like a real attempt to do better. The code has not just been posted; it has been through rather more than the usual number of revisions as its developers have put considerable time into trying to address comments which have been made. A failure to merge it would be demoralizing at best. If the development community refuses this attempt to bring Android and the mainline closer, it risks creating an impression of bad faith at best. If we do not accept their code, we really should not complain about them maintaining it outside of the mainline.
As of this writing, the 2.6.35 merge window is open. What will happen with suspend blockers is anybody's guess. The power management developers are in favor of merging it, but some others have made a fair amount of noise and Linus has not made his feelings known. So it is hard to say whether this long story is about to come to a close or not.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds