Kernel development

Brief items

Kernel release status

There is no development kernel as the 2.6.35 merge window is open. As of this writing, the flow of patches into the mainline has been relatively slow; see below for details.

2.6.34 was released on May 16. As usual, there's a lot of new stuff in this release, though it is slightly less feature-packed than some. Highlights include asynchronous power management, a lot of tracing enhancements, LogFS, and the Ceph distributed filesystem. As always, lots of details can be found in the excellent KernelNewbies summary.

There have been no stable updates over the last week.

Comments (none posted)

Quotes of the week

We do not trust BIOS tables, because BIOS writers are invariably totally incompetent crack-addicted monkeys. If they weren't, they wouldn't be BIOS writers. QED.

-- Linus Torvalds

If we continue to hug
Every pig we can
Our love will grow
As large as this land.

So I promise to help
Every pig in defeat
For if it weren't for pigs
There would be no bacon to eat.

-- Daniela Torvalds

Right, because Firmware writers are from the rugged unresponsive uplands of planet ignore-user-complaints-and-eat-them-for-breakfast-if-they-file-bugs and Software writers are from the emollient responsive groves of planet harmony. Obviously what would work for one wouldn't work for the other.

As a software writer, I fully buy into that world view. The trouble is that when I go to dinner with hardware people, they seem to be awfully nice chaps ... almost exactly like me, in fact ...

-- James Bottomley

If my phone is able to avoid losing almost all of its standby time without me having to care about whether my bouncing cow game was written by a complete fool or not, that means that my phone is *better* than one where I have to care. Would the world be better if said fool could be sent to reeducation camps before being allowed to write any more software? Probably, but sadly that doesn't seem to be something we can implement through code.

-- Matthew Garrett

Comments (7 posted)

Minutes from the Hardware Error Kernel Mini-Summit

A number of kernel developers held a minisummit dedicated to the collection and reporting of hardware errors at the Linux Foundation's Collaboration Summit. Mauro Carvalho Chehab has put together and posted a set of minutes from that meeting; click below for the full text. It's worth noting that some developers who were not present are not in full agreement with everything found here; follow the discussion thread for more information.

Full Story (comments: 1)

Augmented red-black trees

By Jonathan Corbet
May 18, 2010

Red-black trees (rbtrees) are a highly-optimized data structure used in a number of places in the kernel. With an rbtree, a kernel programmer can quickly locate data structures corresponding to a specific value; all that is needed is to store data structures with the value of interest as the key. Some fuzzier sorts of matches can be hard to do with rbtrees; consider, for example, the case of finding the lowest-valued node which overlaps with a given range of values. Venkatesh Pallipadi recently encountered this problem while trying to improve the functioning of the page attribute table (PAT) support for the x86 architecture. Rather than give up on rbtrees, he chose to enhance that data structure to meet a wider range of needs.

Venkatesh's patch (which was one of the first things merged for 2.6.35) implements the concept of "augmented rbtrees." Such a tree works very much like an ordinary rbtree, with the exception that it keeps additional information in each node. That information, almost certainly, is a function of any child nodes in the tree - the maximum key value among all children, for example. Since users of rbtrees must write their own search functions anyway, they can easily take advantage of this extra information to optimize searches.

Users of augmented rbtrees must define an augment_cb() callback with this prototype:

    void (*augment_cb)(struct rb_node *node);

When the tree is initialized, the callback should be stored in its root node:

    struct rb_root my_root = RB_AUGMENT_ROOT(my_augment_cb);

Thereafter, the augment_cb() callback will be invoked whenever the value of one (or both) of a node's children might have changed. The callback can then update the node's additional information to match the new tree topology. The callback will be invoked from insert and delete operations - anything which might change the tree - so rbtree users should ensure that nodes are in a consistent state before inserting them.

Callbacks are not called recursively up the tree. So if a change to a node's augmented value might ripple upward, the augment_cb() callback must work its way up the tree and make the requisite updates. Note that a recursive call on the parent node is probably not a good idea unless the tree is known to be extremely shallow.

As of this writing, the PAT code is the only in-tree user of this functionality, but others seem likely to appear now that this feature is globally available.

Comments (none posted)

The SLEB allocator

By Jonathan Corbet
May 19, 2010

Longtime LWN readers will know that the kernel does not have just one internal memory allocator. Instead, we have the longstanding "slab" allocator (perennially due to be removed someday), the SLUB allocator (intended for better scalability, but it hasn't been able to beat slab on every test), and the SLOB allocator (a space-efficient allocator for embedded use). There is also the SLQB allocator waiting in the wings, but it has been waiting there long enough that one may wonder if it will ever emerge from there.

All told, one might assume that we have enough allocators. Then again, there are quite a few letters still available in the SL*B namespace, so why not make another one?

Thus Christoph Lameter, author of SLUB, has come forward with the SLEB allocator, which is meant to be a mixture of the best of slab and SLUB. Unlike SLUB, SLEB retains the object queues used by slab, but it also adds a bitmap for object management as well. Also unlike SLUB, there is no storage of metadata in the objects themselves. That is a performance enhancement: if a cache-cold object is allocated or freed, SLEB will not bring it into the cache.

This code is very new; it apparently has not yet been trusted outside of a KVM virtual machine. The long benchmarking process that might lead to merging and, possibly, displacing one of the other allocators has not yet begun. But the code is there, and that's a start.

Comments (1 posted)

Kernel development news

2.6.35 merge window part 1

By Jonathan Corbet
May 19, 2010

It's that time again: a new kernel development cycle has started and the merge window is currently open for new code. As of this writing, some 1100 non-merge changes have been incorporated into the mainline kernel. The most significant user-visible changes include:

The performance monitoring subsystem supports the Intel "precise event based sampling" (PEBS) mode, in which the hardware directly records event information into a dedicated memory region. The perf subsystem also can now obtain performance information from old Pentium4 CPUs.
The "perf kvm" tool, which allows the monitoring of virtualized guests from the host, has been merged.
The dynamic probe code has better support for a number of basic integer types.
The "fair sleepers," "sync wakeups," and "affine wakeup" scheduler feature flags have been removed. It seems that, at this point, the scheduler developers don't believe that things will work properly without those features, so they are always enabled.
The SuperH architecture now has hotplug CPU support.
New drivers:
- Processors and boards: HP iPAQ rx1950 devices, Acer N35 systems, Samsung S3C2416-based systems, Marvell GuruPlug reference boards, Voipac PXA270 single-board computers, Aeronix Zipit Z2 systems, Cavium Networks CNS3xxx processors, Cavium Networks CNS3420 MPCore boards, taskit PortuxG20 and Stamp9G20 boards, ARM SPEAr3XX- and SPEAr6XX-based systems, Versatile Express CA9x4 processors, and ARM Ltd Versatile Express boards.
- Miscellaneous: DaVinci DM365-based realtime clock devices.

Changes visible to kernel developers include:

The "cpu_stop" (formerly cpuhog) mechanism has been merged. A cpu_stop allows kernel code to monopolize one or more processors for brief periods.
Augmented rbtrees are now in the mainline kernel.
The INIT_RCU_HEAD() macro is going away; it was never really needed for RCU functionality, and RCU debugging is moving to the object debugging infrastructure.

As can be seen above, the 2.6.35 merge window has gotten off to a bit of a slow start. By the old schedule, the window would remain open through the end of the month; there has been speculation that Linus will close it rather sooner than that this time around, though, to inconvenience maintainers who wait too long to get their pull requests in. One way or another, there should certainly be more changes to report on next week.

Comments (none posted)

The trouble with the TSC

By Jake Edge
May 19, 2010

The time stamp counter (TSC) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC), which makes it a tempting target for applications that need fine-grained timestamps. Unfortunately, it is also rather unreliable, so the kernel jumps through hoops to decide whether to use it and to try to detect when it goes awry. An effort to export the kernel's knowledge about the reliability of the TSC has met strong resistance for a number of reasons, but the biggest is that the kernel developers don't think that applications should be accessing the counter directly.

Dan Magenheimer and Venkatesh Pallipadi proposed adding a /sys/devices/tsc directory with several entries corresponding to the kernel's internal TSC information, including the tsc_unstable flag, which governs whether the kernel uses the counter as a stable time source. Andi Kleen questioned the idea:

Is this really a good idea? It will encourage the applications to use RDTSC directly, but there are all kinds of constraints on that. Even the kernel has a hard time with them, how likely is it that applications will get all that right?

That is exactly what the patch is meant to do, Magenheimer said, because applications have no reliable way to determine whether the standard system calls will be "fast" or "slow":

The problem is from an app point-of-view there is no vsyscall. There are two syscalls: gettimeofday and clock_gettime. Sometimes, if it gets lucky, they turn out to be very fast and sometimes it doesn't get lucky and they are VERY slow (resulting in a performance hit of 10% or more), depending on a number of factors completely out of the control of the app and even undetectable to the app.

Note also that even vsyscall with TSC as the clocksource will still be significantly slower than rdtsc, especially in the common case where a timestamp is directly stored and the delta between two timestamps is later evaluated; in the vsyscall case, each timestamp is a function call and a convert to nsec but in the TSC case, each timestamp is a single instruction.

Depending on the hardware, gettimeofday() and clock_gettime() may be implemented as vsyscalls—virtual system calls—rather than standard system calls, which eliminates the user space to kernel transition. Vsyscalls are code that is stored in a special memory region in user space (the vdso region) that may access kernel-maintained data, like clock ticks. Using vsyscalls, the calls are (relatively) fast, but on some hardware (or virtual machines) that requires kernel-space operations to get to a reliable counter, a vsyscall cannot be used, so the calls are slower. For applications that "need to obtain timestamp data tens or hundreds of thousands of times per second", the difference is significant.

But Magenheimer believes that if the kernel finds the TSC stable enough for its own timekeeping purposes, then that guarantees that it is usable by applications. Arjan van de Ven and Thomas Gleixner are quick to correct that misunderstanding. Van de Ven notes that the stability of the TSC can change under certain circumstances and there would be no way to notify the applications. His advice: "friends don't let friends use rdtsc in application code".

Gleixner goes into some detail about how the TSC can get out of whack, including system management mode interrupts (SMIs) fiddling with the TSC to hide their presence, that multiple cores can have different values because of boot offsets and/or hotplugging, and that multiple sockets can introduce differences due to separate clocks or drift in the clock signals due to temperature. There is, in short, nothing reliable about the TSC: "the stupid hardware is not reliable whether it has some 'I claim to be reliable tag' on it or not". Gleixner did offer a possible alternative, though:

[...] but as long as we do not have some really reliable hardware I'm going to NACK any exposure of the gory details to user space simply because I have to deal with the fallout of this.

What we can talk about is a vget_tsc_raw() interface along with a vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an nasty error code for everything which is not usable.

Currently, there are unnamed "enterprise applications" that attempt to figure out whether they can use the TSC, and do so if they think it will work because of the uncertainty in the performance of gettimeofday() and friends. Magenheimer suggests that perhaps that information could be made available:

But the kernel doesn't expose a "gettimeofday performance sucks" flag either. If it did (or in the case of the patch, if tsc_reliable is zero) the application could at least choose to turn off the 10000-100000 timestamps/second and log a message saying "you are running on old hardware so you get fewer features".

Magenheimer also wonders if the kernel developers are suffering from "hot stove" syndrome, in that they have been burned in the past and are reluctant to even consider changes. But Gleixner and van de Ven both point out that there is no hardware that can make the guarantees that Magenheimer wants. And Gleixner has the burn marks to prove it:

I'm unfortunately forced to deal with the 500+ different variants of borked timers and that makes me very reluctant to believe anything what chip/board/bios vendors promise. It's not the one time hot stove experience, it's the constant exposure to the never ending supply of hot stoves, which makes me nervous.

While the discussion had various interesting analogies including hanging ropes/knives and condoms versus abstention, it did not (yet) find a car analogy. It did, however, seem to find some common ground that information about whether the clock calls are implemented as vsyscalls or system calls should be exported. That is unlikely to satisfy those that have been "using vsyscalls for a while and still have a performance headache", who Magenheimer quotes, but there is nothing stopping applications from reading the TSC directly. Those applications just have to be prepared to handle any strange TSC behavior they encounter.

Ingo Molnar tries to clarify the reasons that the kernel can't export the reliability information: "The point is for the kernel to not be complicit in practices that are technically not reliable. [...] So the kernel wont 'signal' that something is safe to use if it is not safe to use." But he also sees some reason to hope:

You could win the argument by coming up with a patch that changes gettimeofday to make use of the TSC in a reliable manner.

I really mean it - and it might be possible - but we have not found it yet.

Peter Zijlstra has another solution to the problem. He would like to see the kernel move to eventually disable RDTSC from user space. By emulating the instruction and logging all uses of it (and the related RDTSCP), user-space programs that use it could be identified and changed:

Once we get most of userspace running fine, we can switch it to generating faults.

Of course closed source stuff will have to deal with it themselves, but who cares about that anyway ;-)

Exporting the information about whether gettimeofday() is "slow" or not seems like a reasonable starting point. No patches to do that have emerged yet, but it is a fairly straightforward thing to do. Eventually, something like Gleixner's vget_tsc_raw() may also come about, though it won't satisfy those who are unhappy with the current vsyscall performance. Those applications will just have to read the TSC themselves and deal with whatever the hardware throws at them.

Comments (17 posted)

Blocking suspend blockers

By Jonathan Corbet
May 18, 2010

When LWN last looked at suspend blockers in April, it appeared that this functionality was on a path to be merged into the mainline sometime soon. It may still be on that path, but an extended discussion has muddied the picture somewhat. It is a relatively small and obscure bit of code, but the fate of suspend blockers may have significant implications on how the kernel community deals with external projects in the future.

Suspend blockers, remember, are tied to the "opportunistic suspend" mode used by the Android system. In this mode, the kernel is placed into a sort of controlled narcolepsy; it will fall asleep (suspend the system) just about anytime that somebody is not actively prodding it. A suspend blocker is a form of prod which can be used to keep the system awake while some sort of important processing is going on. As long as there are suspend blockers outstanding, the system will not suspend.

There are two aspects to this approach which sit well with the Android developers. One is that they are able to get better power performance (longer battery life) by suspending the entire system whenever nothing is going on. Using normal runtime power management does not give them the same results. The other key point is that opportunistic suspend can happen even when processes are running in user space. In the absence of a suspend blocker, any computation underway is not considered to be important enough to keep the system awake. This behavior is a form of defense against poorly-written applications which might, otherwise, drain a system's battery in a short period of time.

Suspend blockers have few enthusiastic supporters. Opportunistic suspend seems like a bit of a hack, and the need to put suspend blocker calls into drivers looks invasive. Even if this feature is configured out of most kernels, it looks like it could be a maintenance burden going forward. Even so, most of the developers involved - including almost everybody involved with Linux power management - have concluded that nobody has any better ideas. So, as Matthew Garrett put it:

Look, I don't want to sound like I have a one-track mind or anything, but all of these arguments would be significantly more compelling if someone would actually provide a concrete implementation proposal that deals with the set of use-cases that Google's implementation does and which doesn't make anyone cry. Otherwise the immeasurably most likely outcome is that this code gets merged and we get to live with it.

Requests for alternatives have been posted a number of times in this discussion, but actual proposals have been rather rare. A number of the suspend blocker opponents seem more interested in changing the use case - mandating that all Android applications be well written, for example. The problem is that users will blame the device (rather than that new dog whistle application) if its battery fails to last long enough. So the Android developers must choose between somehow forcing good behavior on all application developers (perhaps losing the "open to all" feature that is at the core of the Android way of doing things) or creating a system robust enough to function with non-ideal applications installed.

The Android developers have taken the latter approach. In the process, they have made suspend blockers a key part of their platform. Many of the drivers which have been developed for Android have suspend blocker support built into them; they cannot be merged in their current form if the suspend blocker API is not available for them. So the current alternatives are to keep those drivers out, or to hack out the suspend blocker usage before merging them. In the former case, we have more out-of-tree code; in the latter case, we have in-tree drivers which are not actually used, tested, or maintained by anybody. Neither alternative looks good.

Merging suspend blockers would make it easier to get much of the rest of this code in; as Android developer Brian Swetland said:

With wakelock support in the kernel, I'm able to maintain drivers that (provided they meet the normal style, correctness, etc requirements) that both can be submitted to mainline (yay!) and can ship on production hardware as-is (yay!). Porting other linux based environments to hardware like G1, N1, etc becomes that much easier too, which hopefully makes various folks happy.

This helps get us ever closer to being able to build a production-ready kernel for various android devices "out of the box" from the mainline tree and gets me ever closer to not being in the business of maintaining a bunch of SoC-specific android-something-2.6.# trees, which seriously is not a business I particularly want to be in.

("Wakelocks" are the old name for suspend blockers).

Google and Android have taken a lot of grief for their failure to work with upstream and get their code into the mainline kernel. There can be no doubt that their code could have been handled better; had the Android developers worked with the kernel community before shipping this functionality in millions of handsets, perhaps much of this trouble could have been avoided. But that history cannot be rewritten now; not even the secret git "plumbing" commands can make that happen. But we can try to improve the situation going forward.

The suspend blocker effort looks like a real attempt to do better. The code has not just been posted; it has been through rather more than the usual number of revisions as its developers have put considerable time into trying to address comments which have been made. A failure to merge it would be demoralizing at best. If the development community refuses this attempt to bring Android and the mainline closer, it risks creating an impression of bad faith at best. If we do not accept their code, we really should not complain about them maintaining it outside of the mainline.

As of this writing, the 2.6.35 merge window is open. What will happen with suspend blockers is anybody's guess. The power management developers are in favor of merging it, but some others have made a fair amount of noise and Linus has not made his feelings known. So it is hard to say whether this long story is about to come to a close or not.

Comments (48 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.34 ?

Architecture-specific

Yinghai Lu Use lmb with x86 ?

Ian Munsie ftrace syscalls, PowerPC: Fixes and PowerPC raw syscall tracepoint implementation ?

Core kernel code

Tejun Heo sched: prepare for cmwq ?

Arve Hjønnevåg Suspend block api (version 7) ?

Michel Lespinasse V2: rwsem changes + down_read_unfair() proposal ?

Suresh Siddha sched: change nohz idle load balancing logic to push model ?

Srikar Dronamraju Uprobes v4 ?

Mike Chan Enable cpu frequency and power tracking for cpuacct cgroup ?

Development tools

Frederic Weisbecker Unified lockup detector ?

Shaohui Zheng NUMA Hotplug emulator ?

Steven Rostedt [GIT PULL] tracing: shrinking trace events and updates ?

Jason Wessel kdb / early debug (1 of 2) for 2.6.35 ?

Jason Baron jump label v8 ?

Device drivers

Frederic Weisbecker V4l bkl ioctl pushdown ?

Arnd Bergmann BKL conversion in tty layer ?

felipe.balbi@nokia.com teach gpiolib about gpio debouncing ?

Robert Love FC subsystem update ?

Hans Verkuil [RFCv2] [RFC] New control handling framework ?

Huang Ying ACPI, APEI support ?

Michal Nazarewicz Several improvements to USB gadgets ?

David Dajun Chen MFD of DA9052 Linux device drivers (1/9) ?

Daniel Vetter [RFC] GPU fair-lru eviction ?

Henrik Rydberg input: multitouch: Introduce MT event slots (rev 3) ?

Filesystems and block I/O

Dave Chinner Per-superblock shrinkers ?

Aneesh Kumar K.V Generic name to handle and open by handle syscalls ?

Jeff Moyer ext3/4: enhance fsync performance when using CFQ ?

Memory management

Naoya Horiguchi HWPOISON for hugepage (v5) ?

Nitin Gupta ramzswap: Eliminate stale data from compressed memory (v2 resend) ?

Christoph Lameter [RFC] SLEB: The Enhanced Slab Allocator ?

Virtualization and containers

Stefano Stabellini PV on HVM Xen ?

Miscellaneous

Theodore Ts'o E2fsprogs 1.41.12 release ?

Stephen Hemminger iproute2 2.6.34 ?

Page editor: Jonathan Corbet
Next page: Distributions>>