User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.33 merge window is still open, so there is no published development kernel as of this writing. The 2.6.33-rc1 release, closing the merge window, can be expected almost any time now.

Stable kernel updates: and were released on December 14. Both contain a long list of fixes, with many of them applied to the ext4 filesystem.

Comments (none posted)

Quotes of the week

No mum just the creator of Linux making my life hard on a Friday. I'm sure Dad can find articles about it.
-- Dave Airlie

Damn, this is complicated crap. The analagous task in real life would be keeping a band of howler monkeys, each in their own tree, singing in unison while the lead vocalist jumps from tree to tree, and meanwhile, an unseen conductor keeps changing the tempo the piece is played at. Thankfully, there are no key changes, however, occasionally new trees sprout up at random and live ones fall over.
-- Zachary Amsden (thanks to Markus Armbruster)

Overdesigning is a SIN. It's the archetypal example of what I call "bad taste". I get really upset when a subsystem maintainer starts overdesigning things.
-- Linus Torvalds

Or maybe he's talking about ye olde readlocke, used widely for OS research throughout the middle ages. You still find that spelling in some really old CS literature.
-- Linus Torvalds

Comments (6 posted)

RCU mistakes

By Jonathan Corbet
December 15, 2009
Thomas Gleixner has set himself the task of getting rid of the messy rwlock called tasklist_lock; in many cases, the solution is to use read-copy-update (RCU) instead. In the process, he found some problems with how some code uses RCU. They merit a quick look, since these problems may occur elsewhere, and may reflect an outdated understanding of how RCU works.

The core idea behind RCU is to delay the freeing of obsoleted, globally-visible data until it is known that no users of that data exist. Traditionally, this has been accomplished by (1) requiring that all uses of RCU-protected data be in atomic code, and (2) not freeing any old data until every CPU in the system has scheduled at least once after that data was replaced by an updated copy. Since atomic code cannot schedule, this set of rules is sufficient to know that no references to the old data exist.

Needless to say, code working with RCU-protected data must have preemption disabled - otherwise the processor could schedule while a reference to that data still exists. So the rcu_read_lock() primitive has traditionally disabled preemption. Based on the code Thomas found, that seems to have led to the conclusion that disabling preemption is sufficient for code using RCU.

The problem is that newer forms of RCU use a more sophisticated batching mechanism to track references to RCU-protected data. This change was necessary to make RCU scale better, especially in situations (realtime, for example) where disabling preemption is undesirable. When using hierarchical (or "tree") RCU, code which simply disables preemption before accessing RCU-protected data will have ugly race conditions. So it's important to always use rcu_read_lock() when working with such data. Unfortunately, this is a hard rule to enforce in an automated way, so programmers will simply have to remember it.

Comments (2 posted)

Power capping

By Jonathan Corbet
December 16, 2009
Salman Qazi hypothesizes a situation many of us have certainly found ourselves in:

Imagine being in a tent in Death Valley with a laptop. You are bored, and you want to watch a movie. However, you also want to do your best to make the battery last and watch as much of the movie as possible.

The proposed solution, as it happens, also happens to work for another situation. Imagine you are Google, and you want to get the most out of each data center. One way to do that is to populate the site with more machines than the incoming power is able to handle, then moderate the power consumption of individual machines to keep the total below the limit.

In particular, the code that Google has works by forcing the processor to go idle for a given percentage of the time, where that percentage is set dynamically depending on the load on the machine and on the data center as a whole. If need be, a special-purpose realtime task will take over and idle the processor for the required time to keep the total computing time below the limit. There's some interesting heuristics for trying to force the idle cycles onto low-priority processes and for determining whose time slices the idle cycles are charged to.

This work sounds quite similar to the ACPI processor aggregator driver which was merged for 2.6.32 over scheduler maintainer Peter Zijlstra's objections. Peter has not yet spoken up on this patch, but, from the description, it sounds like it is closer to what he was requesting for this kind of functionality. It is hard to tell for sure, though; the actual code has not yet been posted. Hopefully that will follow soon, and this change can be evaluated for real.

Comments (none posted)


By Jonathan Corbet
December 16, 2009
Nice new tracing tools notwithstanding, kernel developers still tend to reach for printk() when trying to figure out problems. But one need not work on kernel code for very long before running into an unpleasant fact: the most interesting stuff is often printed immediately before a crash, but, for many kinds of problems, the death of the system can prevent the output of those crucial lines. It's no fun to stare at a hung system, knowing that the information needed to find the problem is probably trapped in a buffer somewhere in that system's memory.

2.6.33 will contain a new mechanism designed to help get that last bit of information out of a dying system's clutches. The developer need only set up a new "kmsg dumper" along these lines:

    #include <linux/kmsg_dump.h>

    struct kmsg_dumper {
	void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
			const char *s1, unsigned long l1,
			const char *s2, unsigned long l2);
	struct list_head list;
	int registered;

The dump() function will be called in the event of a crash; the two arguments s1 and s2 will have pointers to the data in the kernel's output buffer. Two pointers are needed due to the circular nature of this buffer; s1 will point to the older set of messages.

Registering and unregistering this function is a matter of calling:

    int kmsg_dump_register(struct kmsg_dumper *dumper);
    int kmsg_dump_unregister(struct kmsg_dumper *dumper);

In the 2.6.33 kernel, the "mtdoops" module has been reworked to use this new mechanism to save crash data to a flash device.

Comments (1 posted)

A new set of per-CPU operations

By Jonathan Corbet
December 16, 2009
Per-CPU variables are a performance-improving technology. They allow processors to work with data without having to worry about locking or cache contention. One would want these operations to be well optimized, but, as it turns out, they can be improved; Tejun Heo and Christoph Lameter have done just that for 2.6.33. In the process, they have changed the way developers work with these variables.

There is a set of new operations:

    this_cpu_write(scalar, value);
    this_cpu_add(scalar, value);
    this_cpu_sub(scalar, value);
    this_cpu_and(scalar, value);
    this_cpu_or(scalar, value);
    this_cpu_xor(scalar, value);

In each case, scalar is either a per-CPU variable obtained with a new allocator or a static per-CPU variable as obtained from per_cpu_var(). All of them are atomic, in that the operation will not be interrupted part-way through on the current processor. It is not necessary to call put_cpu() after using these operations.

See, for example, the VM statistics conversion for an example of how operations on per-CPU variables change under the new scheme.

Comments (2 posted)

Kernel development news

2.6.33 merge window part 2

By Jonathan Corbet
December 16, 2009
Since last week's summary, there have been over 4200 patches merged for the 2.6.33 development cycle. That makes a total of 8152 patches for this merge window, as of this writing.

User-visible changes include:

  • If there are any remaining reiserfs users out there: that filesystem has seen a major rework of its internal locking to eliminate use of the big kernel lock.

  • The Super-H architecture has gained perf events support for a number of system types.

  • The exofs filesystem (for object storage devices) now has multi-device mirror support.

  • There is a new "discard" mount option for ext4 filesystems, controlling whether ext4 issues TRIM commands for newly-freed space. It defaults to off due to fears about how well this feature will really work once hardware begins to support it.

  • It is now possible to configure a kernel without ext2 or ext3 support, but still mount filesystems with those formats using the ext4 code.

  • The Nouveau reverse-engineered NVIDIA driver has been merged, but without the accompanying firmware; see this article for more information.

  • The "ramzswap" device, formerly known as compcache, has been merged into the staging tree.

  • There is now support for the "BATMAN" mesh network protocol in the staging tree.

  • The "perf" tool now has a "diff" mode which will calculate the change in performance between two different runs and generate a report.

  • The semantics for the O_SYNC and O_DSYNC open-time flags have been rationalized, as described in this article.

  • The MD layer now supports barrier requests for all RAID types. The device mapper, too, has improved barrier support.

  • The snapshot merge target for the device mapper has been merged.

  • An extensive set of tracepoints has been added to the XFS filesystem, allowing fine-grained visibility into most aspects of its operation.

  • Memory pages shared with the kernel shared memory (KSM) mechanism are now swappable.

  • New hardware support:

    • Block devices: The VMware paravirtualized SCSI HBA device, LSI 3ware SAS/SATA-RAID controllers, PMC-Sierra SPC 8001 SAS/SATA based host adapters, Apple PowerMac/PowerBook internal 'MacIO' IDE controllers, Blackfin Secure Digital host controllers, TI DAVINCI multimedia card interfaces, and BCM Reference Board NAND flash controllers.

    • Miscellaneous: Dynapro serial touchscreens, Altera University Program PS/2 ports, Samsung S3C2410 touchscreens, National Semiconductor LM73 temperature sensors, Nuvoton NUC900 series SPI controllers SuperH MSIOF SPI controllers, OMAP SPI 100K master controllers, ST-Ericsson AB4500 Mixed Signal Power management chips, Freescale MC13783 realtime clocks, Freescale MC13783 touchscreen devices, SHARP LQ035Q1DH02 TFT displays, and TI BQ32000 I2C realtime clocks.

    • Networking: RealTek RTL8192U Wireless LAN NICs, Agere Systems HERMES II Wireless PC Cards (Model 0110), and Analog Devices Blackfin on-chip CAN controllers.

    • Sound: AD525x digital potentiometers and Texas Instruments DAC7512 digital-to-analog converters.

    • Systems and processors: Neuros OSD 2.0 devices, Nintendo GameCubes, Freescale P1020RDB processors, Freescale p4080ds reference boards, Arcom/Eurotech ZEUS single-board SBC systems, ATNGW100 mkII Network Gateway boards, and Acvilon BF561 boards.

    • USB: Xilinx USB host controllers and OMAP34xx USBHOST 3 port EHCI controllers.

    • Video4Linux: OmniVision OV2610, OV3610, and OV96xx sensors, Sharp RJ54N1CB0C sensors, E3C EC168 DVB-T USB2.0 receivers, E3C EC100 DVB-T demodulators, Maxim MAX2165 silicon tuners, Aptina MT9T112 cameras, and DiBcom DiB0090 tuners.

Changes visible to kernel developers include:

  • The scsi_debug module can now emulate "thin provisioning" devices.

  • The detect() callback in struct i2c_driver has lost the unused kind parameter. Also, struct i2c_client_address_data is no more; address lists are represented with simple unsigned short arrays instead.

  • The spinlock renaming patch has been applied. Developers working near low-level code will see the new arch_spin_lock_t type being used with non-sleeping (even in the realtime tree) locks.

  • Video4Linux2 has a new subdevice API, called media-bus, intended to help in the negotiation of image formats between the sensor and the controller.

  • There is a new mechanism for grabbing and saving kernel messages on a system crash; see this article for more information.

  • The per-CPU variable allocator has been replaced, and there is a new set of operations for working with these variables; see this article for a brief introduction.

This merge window should close in the very near future, so the 2.6.33 kernel is, at this point, close to being feature-complete. Any final additions will be noted in next week's edition.

Comments (1 posted)

Redesigning asynchronous suspend/resume

By Jonathan Corbet
December 16, 2009
Your editor suspects that, were somebody to poll the community of Linux users, very few would state that they dislike the idea of having their systems suspend and resume more quickly. Rafael Wysocki has been working toward this goal for some time; his asynchronous suspend/resume patches were covered here back in August. This code has not encountered any real turbulence for a while, so one might well assume that Rafael's 2.6.33 pull request containing asynchronous suspend/resume would not be controversial. Such assumptions, however, fail to take into account the "last-minute Linus" effect.

The simple fact of the matter is that, like anybody else, Linus cannot possibly follow all of the projects under way at any given time; that makes it entirely possible for work on a specific project to proceed to a conclusion without ever drawing his attention. That will inevitably come to an end, though, when somebody sends a pull request asking that the work be merged into the mainline. It seems clear that some requests are scrutinized more closely than others, but some are looked at closely indeed. The power management request, as it turns out, was one of those.

Linus didn't like what he saw, to say the least. The code struck him as overly complex and possibly unsafe; he refused to pull it. In particular, he thought that far too much work went into trying to map out the device tree topology and all of the dependencies between devices. In the past, attempts to make things asynchronous based on just the apparent topology have run into trouble; why should it be different this time?

Having said that, Linus then went on to outline an alternative solution based mainly on the device tree. In so doing, he wanted to make it possible for most drivers to ignore the concept of asynchronous suspend and resume entirely. For much of the hardware on the system, the time required for either operation is so short that there is really little point in trying to do it in parallel. If a device can be suspended in a few milliseconds, one might as well just do it serially and avoid the complexity.

For the rest, Linus very much wanted the decision on whether to do things asynchronously to be made at the driver level. But the power management core still needs to know enough about asynchronous operation to wait until it is done; one cannot suspend a controller until all devices connected to it have, themselves, completed suspending. After some revisions, Linus's plan came down to something like this:

  • A reader/writer semaphore (rwsem) is associated with each node in the device tree. These semaphores allow an unlimited number of concurrent reader locks, but only one writer lock can exist at any given time, and writers must first wait for any readers to finish. At the beginning of the suspend process, no locks are taken.

  • The suspend process is initiated on all children of a given node. If suspend is done synchronously, it happens right away and no further action is required.

  • Should the driver decide to suspend its device asynchronously, it starts a thread to do that work. It also takes a read lock on the parent's rwsem.

  • When an asynchronous suspend for a specific device completes, the read lock is released.

  • The parent node acquires a write lock on its own rwsem before suspending the device. If any child nodes are suspending asynchronously, the write lock will block as a result of the outstanding read locks. Only when all read locks are released - meaning that all children are suspended - can the parent acquire its write lock and suspend.

For resume, the write lock is taken first, and all children take read locks on their parent before resuming the hardware. That will ensure that all devices complete resuming before any child devices begin the process.

This scheme has the benefit of simplicity. Getting it implemented took a few rounds of discussion, though, with Linus repeatedly asking developers to retain that simplicity and not try to make up new locking schemes. Things still changed along the way; as of this writing, the current suspend/resume patch set does not use Linus's plan as originally written. Among other things, Rafael, who did implement an rwsem-based solution, ran into problems with lockdep that Linus agreed were serious.

What has been implemented instead is a variant on that scheme based on completions. Every device node gets a completion structure, initially set to the "not complete" state. Additionally, any driver which implements asynchronous suspend/resume needs to call device_enable_async_suspend() to inform the power management core of that fact. It's now up to that core to create threads for asynchronous suspend/resume operations, and to invoke driver callbacks from those threads. Before suspending a specific device node, the power core will wait for completions for any child devices which have been marked for asynchronous callbacks. Once again, that ensures that all children have been suspended before the parent node is suspended.

Linus doesn't like the completion-based approach, but has indicated that he will be willing to take it. As of this writing, that has not yet happened, though.

Seen in one light, this episode highlights the sort of disregard for developer time which is occasionally seen in the kernel development process. It is not that uncommon for code which has seen a lot of work to end up being discarded or massively reworked. This model can seem quite wasteful, and there can be no doubt that it can be highly frustrating for the developers involved. But it is also a fundamental part of how quality control for the kernel works. The suspend/resume code was clearly improved by this last-minute redesign. One might say that it would have been better done some months ago, but what matters most for Linux users is that it happens at all.

Comments (6 posted)

The abrupt merging of Nouveau

By Jonathan Corbet
December 15, 2009
The merge window is normally a bit of a hectic time for subsystem maintainers. They have two weeks in which to pull together a well-formed tree containing all of the changes destined for the next kernel development cycle. Occasionally, though, last-minute snags can make the merge window even more busy than usual. The unexpected merging of the Nouveau driver is the result of one such snag - but it is a story with a happy ending for all.

Dave Airlie probably thought he had enough on his plate when he generated the DRM pull request for 2.6.33. This tree contained 203 commits touching 122 different files, and adding over 9,000 lines of code. One of the key features aimed at the kernel is the new "page flipping ioctl()," helpfully described in the commit message as "The ioctl takes an fb ID and a ctrc ID and flips the crtc to the given fb at the next vblank." In English, it means that a specific video output can be quickly switched from one region of video memory to another, allowing for clean video changes without the "tearing" that results from display of a video buffer which is being changed.

Other changes for DRM this time around include support for Intel's "Ironlake" GPU and "Pineview" Atom processor, and a great deal of work supporting kernel mode setting on Radeon GPUs. Radeon, it seems, only lacks good power management support at this point; it will likely lose its "staging" designation before the end of this development cycle.

Linus was not impressed by any of that, though. Instead, he had one concern: the fact that the Nouveau driver - a reverse-engineered driver for NVIDIA chipsets - was not a part of the pull request. Nouveau had been discussed at the 2009 Kernel Summit, and it was generally agreed that this code should find its way into the mainline as soon as possible. 2.6.33 is the first merge window since the summit, and Linus clearly had expected some action on that front. When he didn't get it, he made his disappointment known.

One might wonder what the problem with Nouveau was. The world is full of out-of-tree Linux drivers; recent efforts have reduced their number considerably, but they still exist and Linus does not normally complain about them. Certainly Nouveau has a higher profile than most other out-of-tree drivers; it is the only hope for a free driver for a large percentage of available machines. But the real problem is that Fedora (at least) has been shipping this driver without doing enough (in Linus's opinion) to get it upstream. In Linus's words:

I'm pissed off at distribution people. For years now, distributions have talked about "upstream first", because of the disaster and fragmentation that was Linux-2.4. And most of them do it, and have been fairly good about it.

But not only is Fedora not following the rules, I know that Fedora people are actively making excuses about not following the rules. I know Red Hat actually employs (full-time or part-time I have no idea) some Nouveau developer, and by that point Red Hat should also man up and admit that they need to make "merge upstream" be a priority for them.

A number of reasons for the non-merging of Nouveau have been given, ranging from "not ready yet" and "unstable user-space API" to "we haven't found the time yet." The real blocker in recent times, though, has been the binary blob loaded into some NVIDIA GPUs by the driver. This chunk of code, known as the "voodoo" or "ctxprogs," was obtained by watching the proprietary drivers in action. Since nobody in the Nouveau project wrote this code, nobody has been willing to sign off on it; it's not at all clear that it can be legally distributed. Linus has not been impressed by this reason either, but the fact remains: developers take the Signed-off-by: line seriously and are not willing to attach it to something which might be legally questionable.

The obvious answer, one which has been applied in other situations, is to pull the firmware out of the driver and load it into the kernel at run time. And that is exactly what happened with Nouveau: Ben Skeggs put in an intensive effort to remove ctxprogs and use the firmware loading API to get it when the driver loads. Dave then put together the "DRM Nouveau pony tree" and requested that it be pulled for 2.6.33. Linus, of course, did exactly that.

Potential users will still have to get the "ctxprogs" from elsewhere. For whatever reason, pointers to "elsewhere" are hard to find, but your editor happens to know that the firmware can be found in the Nouveau git tree. Simply grabbing the right version and placing it in the local firmware directory should be sufficient.

All of this marks significant progress for Nouveau, but a dependence on firmware of dubious origin is likely to inhibit the adoption of this driver in the long term. So it was good to learn (via an LWN comment posting) that the contents of the ctxprogs blob are not quite as obscure as many of us had thought:

[W]e know a lot about ctxprogs these days, including their purpose [context switching], what they do [save/restore PGRAPH state], and most of their opcodes. There are still some unknowns that prevent us from writing new ctxprogs from scratch right now, but we're working on that and it *will* be resolved in the proper way. Which is throwing out nvidia's progs and writing our own prog generator.

It seems that things are moving quickly on this front too; on December 15, Ben announced the availability of a replacement firmware for NVIDIA GeForce 6/7 hardware. This is a first posting for this code; doubtless testers will encounter some problems. But it sounds very much like the hardest problems have been overcome, at least for this particular variant of the hardware. With luck, NVIDIA's firmware will not be needed for much longer. In the longer term, it might even turn out to be possible to program interesting functions into the hardware, extending its capabilities in surprising ways.

Once upon a time, Linux users had to be very careful about which hardware they bought. Over the years, most of those problems have gone away; it is now easy to find systems which are completely supported by free software. One of the biggest exceptions has been in the area of graphics. Vendors like Intel and ATI/AMD have made the decision that their hardware should be supported with free drivers (most of the time) and have invested resources to make that happen. NVIDIA has been rather less cooperative, and support for its hardware has suffered accordingly. It would appear that the driver problem is getting close to a solution, but we should never forget the effort which was required to get to this point. NVIDIA would be far more worthy of our future commercial support if it had not made that effort necessary.

Comments (114 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds