LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.35-rc3, released on June 11. "So I've been hardnosed now for a week - perhaps overly so - and hopefully that means that 2.6.35-rc3 will be better than -rc2 was. Not only do we have a number of regressions handled, we don't have that silly memory corruptor that bit so many people with -rc2 and confused people with its many varied forms of bugs it seemed to take, depending on just what random memory it happened to corrupt." The short-form changelog is in the announcement, or see the full changelog for all the details. Linus now evidently goes offline for a little while, so the flow of changes into the mainline will slow down.

Stable updates: there have been no stable updates in the last week.

Comments (3 posted)

Quotes of the week

The kernel's whole approach to messaging is pretty haphazard and lame and sad. There have been various proposals to improve the usefulness and to rationally categorise things in way which are more useful to operators, but nothing seems to ever get over the line.
-- Andrew Morton

I do fairly commonly see patches where the description can be summarised as "change lots and lots of stuff to no apparent end" and one does have to push and poke to squeeze out the thinking and the reasons. It's a useful exercise and will sometimes cause the originator to have a rethink, and sometimes reveals that it just wasn't a good change.
-- Andrew Morton

Comments (none posted)

Finding a patch's kernel version with git

By Jake Edge
June 16, 2010

Back in May, Jan Kara posted a VFS patch that fixed a regression and he sent the patch to the stable tree folks as well. Linus Torvalds noted that it had been introduced in the merge window, so it wasn't relevant for the stable tree. That led to a discussion about how to figure out which kernel version includes a particular patch. While the conversation is a month old, the advice is pretty much timeless.

Andrew Morton's method is rather sub-optimal: "I just keep lots of kernel trees around and poke about with `patch --dry-run'. PITA." Christoph Hellwig and James Bottomley both suggested git-describe <revid>, which will show the tag of the version a patch was applied to, or was pulled into if you use the --contains flag. As one might guess, though, Torvalds had some more elaborate suggestions. One can use git name-rev in much the same way as git-describe --contains, but a more "obscure" way to get the same kind of information is:

    git log --tags --source --author=viro --oneline fs/namei.c
which shows commits by Al Viro of fs/namei.c along with the tagged version each commit was included into. On a recent kernel tree, the start of that output looks like:
    d83c49f v2.6.34 Fix the regression created by "set S_DEAD on unlink()..." commit
    3e297b6 v2.6.34-rc3 Restore LOOKUP_DIRECTORY hint handling in final lookup on op
    781b167 v2.6.34-rc2 Fix a dumb typo - use of & instead of &&
    1f36f77 v2.6.34-rc2 Switch !O_CREAT case to use of do_last()

While the specific example Torvalds gave might not be widely applicable, the basic idea behind it is. Using git-blame to track down the commit where a particular change was made is often useful, but the dates in the log can be misleading with regards to which kernel(s) the change ended up in. Using some combination of describe and log will make figuring those kinds of things out much easier.

Comments (27 posted)

The Managed Runtime Initiative

By Jonathan Corbet
June 16, 2010
The Managed Runtime Initiative has recently announced its existence. This group is dedicated to making "managed runtime" code (Java programs in particular) run faster on Linux systems. MRI's effort might not seem like a suitable topic for the Kernel Page, except for one thing: this group has just released thousands of lines of questionable code which, it claims, it plans to push upstream.

The specific problem that the MRI people (actually Azul Systems employees) have set out to solve appears to be application pauses caused by garbage collection. Their solution is implemented at several levels, some of which are found in the kernel. For the curious, the patches can be found on the MRI download page, helpfully packaged as a tarball filled with source-RPM files. They have also thoughtfully included all of Red Hat's patches; look for files containing "az" to pick the new stuff out of the noise.

The first kernel patch adds an interface for loadable memory management modules. With this in place, loadable modules can create and claim their own VMAs which they manage. The Azul-supplied module creates a special device which provides a few dozen ioctl() operations for the management of memory within those VMAs. What is actually done by this module is on the obscure side; it involves dividing memory into "accounts" with names like "GC Pause Prevention." There appears to be code to provide transparent hugepage access to interested applications. There is also some sort of relaxed locking done within the special VMAs designed to improve scalability there.

Then, there is the pluggable scheduler patch, creating a new SCHED_ALT scheduling class which sits between CFS and the realtime classes. The actual scheduler module's purpose is described as:

The Azul scheduler is designed to provide a cpu resource guarantee on Linux: specifically that any process with 'committed' cpus and runnable threads available for those cpus will have its threads running on those cpus within 10ms.

It allows the partitioning of the system into "committed" and ordinary CPUs, with special applications getting priority access to the committed CPUs.

The MRI web page claims that "it is the initiative's goal to upstream those related contributions into existing and complementary OSS projects (e.g. kernel.org and openjdk.org)," but the kernel-related code has never, to your editor's knowledge, been seen on any kernel-related mailing list. It is heavy with #ifdefs, light on comments, and it adds exports for large numbers of low-level functions in the scheduler and VM code. Plus there is the little detail that the development community is unlikely to agree with this code's fundamental purpose. Pluggable schedulers have been rejected in the past; until now nobody has even dared to suggest pluggable memory management modules.

In other words, we have a bunch of hackish code which was developed in total isolation; one wonders how many customers it has been shipped to. If Azul Systems and the MRI are serious about wanting to upstream it, they might just want to start talking with the development community fairly soon. One expects that they might just have a few changes to make.

Comments (100 posted)

Kernel development news

Improving lost and spurious IRQ handling

By Jonathan Corbet
June 15, 2010
Interrupts are a device's way of telling the kernel that something interesting has happened. One of the key benefits of using interrupts is that they free the kernel from the need to poll a device to learn what its state is. Like any other part of a computer, though, interrupts can go wrong, leading to situations where the system is overwhelmed by a flood of spurious interrupts - or, instead, left waiting for an interrupt which will never arrive. The kernel has some defensive mechanisms in its generic interrupt layer for dealing with situations like these; Tejun Heo has now posted a patch series intended to improve those mechanisms. As it happens, the necessary response when interrupts go bad is returning to polling.

One problem which is familiar to driver authors is missing interrupts. A driver will typically set up an I/O operation, get it started, then wait until an interrupt indicating completion arrives. If that interrupt never shows up, the driver can end up waiting for a very long time. Missing interrupts can have a number of causes, including flaky devices or an interrupt routing problem somewhere in the system. Either way, if the driver author has not anticipated this situation and taken the appropriate measures - setting a timeout, for example - things will not end well.

Waiting for interrupt timeouts will slow a device's performance considerably, though. That problem can be mitigated by polling the device state frequently, but rapid polling has its own costs. In an attempt to obtain the best results consistently, Tejun's patch adds a new driver API:

    #include <linux/interrupt.h>

    struct irq_expect *init_irq_expect(unsigned int irq, void *dev_id);
    void expect_irq(struct irq_expect *exp);
    void unexpect_irq(struct irq_expect *exp, bool timedout);

A call to init_irq_expect() will allocate an opaque token to be used with the other two functions; it should be passed the interrupt number of interest and the same dev_id value as was used to allocate the interrupt initially. When the driver initiates an action which should result in a device interrupt, it should make a call to expect_irq(). When the operation is completed, unexpect_irq() should be called, with timedout indicating whether the operation timed out (the interrupt did not arrive). Note that it's not necessary for the driver to free the struct irq_expect structure; that will happen automatically when the interrupt is released.

A call to expect_irq() will initiate polling on the given interrupt line, where "polling" means making an occasional call to the device's interrupt handler. Initially, that polling is quite slow. If it turns out that the device is dropping interrupts (as indicated by the timedout parameter to unexpect_irq()), the polling frequency will be increased - up to once every millisecond. Working devices should interrupt before the slow poll period passes, so the result should be no real polling at all on reliable devices. If there is a problem with interrupt delivery, though, the kernel will automatically take responsibility for poking the interrupt handler when interrupts are expected.

This interface works well if the driver knows when to expect interrupts, but not all devices work that way. For hardware which can interrupt at any time, there is an "IRQ watching" API instead:

    void watch_irq(unsigned int irq, void *dev_id);

This function will begin polling of the specified interrupt line; it will also initiate tracking of interrupt delivery status. If it determines that interrupts are being lost (as determined by an IRQ_HANDLED return status from a polled call to the handler), it will continue to poll at a higher frequency. Otherwise, eventually, interrupt delivery will be deemed to be reliable and polling will be turned off.

Tejun's patch also changes the way that the kernel responds to spurious interrupts - those which no driver is interested in. Current kernels count the number of interrupts on each line for which no handler returned IRQ_HANDLED; if 99,000 out of 100,000 interrupts are spurious, the kernel loses patience, disables the interrupt line forevermore, and starts polling the line instead. There is a real cost to this action, which is why the kernel allows spurious interrupts to get to such a high proportion of the total. Once the response is triggered, there is no going back, even if the spurious interrupts were the result of a brief hardware glitch.

With the adaptive polling mechanisms put into place to support the above features, the kernel is also able to take a more flexible approach to handling of spurious interrupts. 9,900 bad interrupts out of 10,000 are now enough to cause the spurious interrupt handling mechanism to kick in; as before, it disables the interrupt and begins polling. After a period, though, the new code will reenable the interrupt line, just to see what happens. If the source of spurious interrupts has stopped, the interrupt can be used as before. If, instead, spurious interrupts are still being delivered, the line will be blocked again for a longer period of time.

There has not been a lot of discussion of this patch set so far; one comment worried that polling could cause users not to realize that there are problems in their systems. But Tejun says that this kind of response is required to get reasonably solid behavior out of flaky hardware, and nobody seems to want to challenge that claim. So it seems fairly likely that a future version of this patch will find its way into the mainline at some point.

Comments (4 posted)

The state of realtime Linux

By Jonathan Corbet
June 15, 2010
Since 2005, the realtime preemption project has worked to provide deterministic response times in stock Linux kernels. Over that time, though, it has come to appear that there is no guaranteed latency with regard to when all of this code will actually be merged. At LinuxTag 2010, realtime hacker Thomas Gleixner talked about the state of this patch set, what's coming, and, yes, when it might actually be merged in its entirety. Don't hold your breath.

In truth, the realtime preemption code has been going into the mainline, piece by piece, for years. Some recently-merged pieces include threaded interrupt handlers and the sleeping spinlock precursor patches. The threaded handlers make a number of driver tasks simpler (regardless of any realtime needs) by eliminating much of the need for tasklets and workqueues. They have also proved to be useful in providing support for some strange i2c-attached interrupt controller hardware. The spinlock changes do not affect the generated code (in mainline kernels), but they are useful for annotating the type of each lock.

Recent movements of code into the mainline notwithstanding, the realtime patchset isn't getting any smaller. It seems that the realtime developers have an interesting problem: the realtime kernel is a really good place to try out a wide variety of new features. So, despite the fact that code occasionally moves to the mainline, new stuff keeps getting added to the realtime tree.

This tree's attractiveness for the testing of new code comes from the fact that it tends to reveal scalability problems much more quickly than mainline kernels do. The extra preemptibility offered by this kernel comes at a cost: the price for lock contention is much higher. So the realtime tree shows scalability issues at lower levels of contention than non-realtime kernels. The important point is that the scalability bottlenecks encountered by realtime kernels are not unique to realtime; they just come sooner than the same bottlenecks will show up with the mainline. So realtime kernels can be used to look forward to the problems that the mainline kernel will be experiencing next year.

Thus, for example, realtime kernels exhibit scalability problems in the virtual filesystem layer that are otherwise only seen in big-iron torture-test labs. That makes them useful for testing features, and especially useful for testing scalability improvements. That is why code like the VFS scalability patch set currently makes its home in that tree. Eventually, most of these pieces will get merged into the mainline. Thomas says that it will all be in by the end of the year - but which year is not something he is willing to commit to.

The next patch set to move to the mainline might be Peter Zijlstra's memory management preemptibility series, which solves some long latencies in the memory management code; the current plan is to push these patches for 2.6.36. Another bit of code which might make the move is an option to force all drivers to use threaded interrupt handlers regardless of whether they explicitly request them. This option would almost certainly not be turned on for most production kernels, but it makes the testing of drivers with involuntarily threaded handlers easier.

The realtime tree also suffers from a few unsolved problems. One of them is latencies in the slab allocator, which runs with preemption disabled for long periods of time. The SLQB allocator had raised hopes for a while, but it appears that it will not be pushed for merging anytime soon. So the realtime hackers have to find a way to fix one of the existing allocators, or give up and write a slab allocator of their own. Thomas noted that there are still a few letters left in the SL?B namespace, so there might just be an SLRB in the future. That is all quite vague at this point, though; Thomas admitted that he has no idea how this problem will be resolved.

Another ongoing problem is the increasing use of per-CPU data. In throughput-oriented environments, per-CPU data increases scalability by eliminating contention between processors. But use of per-CPU data necessarily requires that preemption be disabled while the data is being manipulated; to do otherwise is to risk that the process working with that data will be preempted or moved to another processor, making a mess of things. Disabling preemption is anathema in an environment where everything is always supposed to be preemptable, though. So the realtime patch set currently puts a lock around per-CPU data accesses, eliminating the preemption problem but wrecking scalability. Here, too, a real solution has not yet been found.

Thomas finished with a bit of talk about testing of the realtime tree. Quite a bit of "enterprise-class" testing is done in the well-furnished labs at companies like IBM and Red Hat. At the embedded level, the Open Source Automation Development Lab has a modest testing lab of its own. But there's another interesting source of testing: the Linux audio community has been enthusiastic in its use of the realtime kernel and has helped find a number of issues. There's also a growing set of tools maintained in the rt-tests collection.

All told, the picture painted by Thomas was one of a healthy project, even if we still don't know when it will all get into the mainline. Even in the realtime world, there are things we simply have to wait for.

Comments (5 posted)

ARM and defconfig files

By Jake Edge
June 16, 2010

The kernel tree for the ARM architecture is large and fairly complicated. Because of the large number of ARM system-on-chip (SoC) variants, as well as different versions of the ARM CPU itself, there is something of a combinatorial explosion occurring in the architecture tree. That, in turn, led to something of an explosion from Linus Torvalds as he is getting tired of "pointless churn" in the tree.

A pull request from Daniel Walker for some updates to arch/arm/mach-msm was the proximate cause of Torvalds's unhappiness, but it goes deeper than that. He responded to Walker's request, by pointing out a problem he sees with ARM:

There's something wrong with ARM development. The amount of pure noise in the patches is incredibly annoying. Right now, ARM is already (despite me not reacting to some of the flood) 55% of all arch/ changes since 2.6.34, and it's all pointless churn in
	arch/arm/configs/
	arch/arm/mach-xyz
	arch/arm/plat-blah
and at a certain point in the merge window I simply could not find it in me to care about it any more.

He goes on to note that the majority of the diffs are "mind-deadening" because they aren't sensibly readable by humans. He further analyzes the problem by comparing the sizes of the x86 and ARM trees, with the latter being some 800K lines of "code"—roughly three times the size of x86. Of that, 200K lines are default config (i.e. defconfig) files for 170+ different SoCs. To Torvalds, those files are "pure garbage".

In fact, he is "actually considering just getting rid of all the 'defconfig' files entirely". Each of those files represents the configuration choices someone made when building a kernel for a specific ARM SoC, but keeping them around is just a waste, he said:

And I suspect that it really is best to just remove the existing defconfig files. People can see them in the history to pick up what the heck they did, but no way will any sane model ever look even _remotely_ like them, so they really aren't a useful basis for going forward.

Another problem that Torvalds identified is the proliferation of platform-specific drivers, which could very likely be combined into shared drivers in the drivers/ tree or coalesced in other ways. Basically, "we need somebody who cares, and doesn't just mindlessly aggregate all the crud". Ben Dooks agreed that there is a problem, but that "many of the big company players have yet to really see the necessity" of combining drivers. He also noted that at least some of the defconfig files were being used in automated build testing, but did agree that there are older defconfigs that should be culled.

Dooks also had a longer description of the problems that ARM maintainers have in trying to support so many different SoCs, while also trying to reduce the size and complexity of the sub-architecture trees. Essentially, the maintainers are swamped and "until it hits these big companies in the pocket it [is] very difficult to get them to actually pay" for cleaning up the ARM tree and keeping it clean in the future.

Because Torvalds said that he was planning to remove the ARM (and other) defconfig files, ARM maintainer Russell King posted a warning to the linux-arm-kernel mailing list:

Linus doesn't appear to be listening to reason, so I see now this as a fait accompli. It'll [apparently] happen at the next merge window.

So, don't send anything which contains a defconfig file or updates to it. It's pointless.

That set off a separate discussion on that mailing list—King's and others' attempts to redirect it back to linux-kernel notwithstanding—about ways to reduce the amount of mostly redundant information carried around in the defconfig files. Ryan Mallon is in favor of proactively eliminating some defconfigs, while others discussed various ways to only keep the deltas between the config files for various SoCs.

Based on Torvalds's comments on linux-kernel, some kind of delta scheme is unlikely to fly. His main complaint is that the defconfig files are neither readable nor writable by humans, as they are generated by various tools. He made some specific suggestions of alternatives that would still allow the generation of those config files, using Kconfig files that are usable by humans.

Reducing the number of defconfigs, as Mallon suggested, may be helpful, but King at least is convinced that it doesn't go far enough. He believes that Torvalds has already made up his mind to remove the defconfigs in the next merge window and that the ARM community better be ready with something else:

I believe the only acceptable solution is to get an [alternative] method in place - no matter what it is - and remove the all but one of the defconfig files from the mainline kernel. _And_, most importantly, kautobuild needs to be fixed so that we still get build coverage.

The loss of kautobuild is a major concern here, and I believe it trumps everything else for the next merge window. Kautobuild is an extremely important resource that we simply can not afford to lose.

The discussion ranged from possible solutions to the immediate defconfig problem to the larger issue of reducing the duplication throughout the ARM trees. There is an effort underway to produce a single kernel that would support multiple ARM platforms for Ubuntu 10.10, which will likely help consolidate various sub-architectures. Given that Canonical is working closely with the newly formed Linaro organization—founded to simplify ARM Linux—there is reason to believe that things will get better.

Meanwhile, though, back on linux-kernel, Torvalds started a new thread to flesh out his ideas for a hierarchical collection of Kconfig files that would essentially take the place of the defconfigs. After some back and forth, Torvalds gave an example of exactly what he is suggesting:

Let's say that I want a x86 configuration that has USB enabled. I can basically _ask_ the Kconfig machinery to generate that with something like this:

- create a "Mykconfig" file:

	config MYCONFIG
		bool
		default y
		select USB

	source arch/x86/Kconfig
and then I just do
	KBUILD_KCONFIG=Mykconfig make allnoconfig
and look what appears in the .config file.

He goes on to describe a theoretical Kconfig.omap3_evm file that sets the specific requirements for that platform and then includes Kconfig.omap3. That file sets up whatever is required for the OMAP3 platform and includes Kconfig.arm. That would allow developers or tools like kautobuild to generate the necessary config files without having to carry them around in the kernel tree. Those Kconfig files would also be much more readable and any diffs would be understandable, which is important to Torvalds.

That solves a significant subset of the problem, but there is still a fly in the ointment: dependencies. In Torvalds's example, CONFIG_USB requires CONFIG_USB_SUPPORT, so that would need to be added to Mykconfig. Not accounting for dependencies will get you a kernel that doesn't build or, worse yet, won't run correctly. There are a number of possible solutions to the dependency problem, though, ranging from Catalin Marinas's patch to track unmet dependencies of options used in select statements to Vegard Nossum's summer of code project to add a satisfiability solver into the configuration editors (menuconfig, etc.).

It certainly seems likely that defconfig files will be removed from the kernel tree in the 2.6.36 merge window. Whether there is another solution—based on Torvalds's ideas or something else—to replace them is really up to the architecture teams as Torvalds is perfectly happy to move on without them. ARM, PowerPC, MIPS, and others all have lots of defconfig files, but unless he changes his mind, they won't in a few short months. They can keep maintaining those files in a separate repository somewhere, or find an acceptable method to generate them. While it may be painful in the short term, it will reduce the size of the kernel tree and make Torvalds's job easier, both of which are worth striving for.

Comments (8 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds