Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.29-rc6, released on February 22. The list of changes is still pretty long, but, with luck, the problems are getting fixed. See the announcement for the short-form changelog, or see the full changelog for all the details.

As of this writing, a few dozen post-rc6 patches have found their way into the mainline repository. They include more fixes, but also new drivers for Atheros L1C gigabit Ethernet adapters and FireDTV IEEE1394 adapters.

The current stable 2.6 kernel is 2.6.28.7, released (without announcement) on February 20. It contains the usual long list of fixes, many of which are for the ext4 filesystem; the changelog has the details. 2.6.27.19 was also released on the 20th without an announcement; see the changelog for the list of patches included there.

Comments (4 posted)

Kernel development news

Quotes of the week

Especially for developers who are just starting out with submitting patches to a project, it's rare that a patch is of sufficiently high quality that it can be applied directly into the repository without needing fixups of one kind or another. The patch might not have the right coding style compared to the surrounding code, or it might be fundamentally buggy because the patch submitter didn't understand the code completely. Indeed, more often than not, when someone submits a patch to me, it is more useful for indicating the location of the bug more than anything else, and I often have to completely rewrite the patch before it enters into the e2fsprogs mainline repository.

-- Ted Ts'o

I personally find it reprehensible that the attitude that network communications ought to be exempt from access controls is so pervasive, but I bend to the will of the people.

-- Casey Schaufler

A better approach would be to design simple, robust kernel interfaces which make sense and which aren't made all complex by putting the user interface in kernel space. And to maintain corresponding userspace tools which manipulate and present the IO from those kernel interfaces.

But we don't do that, because userspace is hard, because we don't have a delivery process. But nobody has even tried!

-- Andrew Morton

Comments (none posted)

Speeding up the page allocator

By Jonathan Corbet
February 25, 2009

It is a rare kernel operation that does not involve the allocation and freeing of memory. Beyond all of the memory-management requirements that would normally come with a complex system, kernel code must be written with extremely tight stack limits in mind. As a result, variables which would be declared as automatic (stack) variables in user-space code require dynamic allocation in the kernel. So the efficiency of the memory management subsystem has a pronounced effect on the performance of the system as a whole. That is why the kernel currently has three slab-level allocators (the original slab allocator, SLOB, and SLUB), with another one (SLQB) waiting for the 2.6.30 merge window to open. Thus far, nobody has been able to create a single slab allocator which provides the best performance in all situations, and the stakes are high enough to make it worthwhile to keep trying.

While many kernel memory allocations are done at the slab level (using kmem_cache_alloc() or kmalloc()), there is another layer of memory management below the slab allocators. In the end, all dynamic memory management comes down to the page allocator, which hands out memory in units of full pages. The page allocator must manage memory without allowing it to become overly fragmented; it also must deal with details like CPU and NUMA node affinity, DMA accessibility, and high memory. It also clearly needs to be fast; if it is slowing things down, there is little that the higher levels can do to make things better. So one might do well to be concerned when memory management hacker Mel Gorman writes:

The complexity of the page allocator has been increasing for some time and it has now reached the point where the SLUB allocator is doing strange tricks to avoid the page allocator. This is obviously bad as it may encourage other subsystems to try avoiding the page allocator as well.

As might be expected, Mel has come up with a set of patches designed to speed up the page allocator and do away the the temptation to try to work around it. The result appears to be a significant cleaning-up of the code and a real improvement in performance; it also shows the kind of work which is necessary to keep this sort of vital subsystem in top shape.

Mel's 20-part patch (linked with the quote, above) attacks the problem in a number of ways. Many of them are small tweaks; for example, the core page allocation function (alloc_pages_node()) includes the following test:

    if (unlikely(order >= MAX_ORDER))
	return NULL;

But, as Mel puts it, no proper user of the page allocator should be allocating something larger than MAX_ORDER in any case. So his patch set removes this test from the fast path of the allocator, replacing it with a rather more attention-getting test (VM_BUG_ON) in the slow path. The fast allocation path gets a little faster, and misuse of the interface should eventually be caught (and complained about) anyway.

Then, there is the little function gfp_zone(), which takes the flags passed to the allocation request and decides which memory zone to try to allocate from. Different requests must be satisfied from different regions of memory, depending on factors like whether the memory will be used for DMA, whether high memory is acceptable, or whether the memory can be relocated if needed for defragmentation purposes. The current code accomplishes this test with a series of four if tests, but lots of jumps can be expensive in fast-path code. So Mel's patch replaces the tests with a table lookup.

There are a number of other changes along these lines - seeming micro-optimizations that one would not normally bother with. But, in fast-path code deep within the system, this level of optimization can be worth doing. The patch set also reorganizes things to make the fast path more explicit and contiguous; that, too, can speed things up, but it also helps ensure that developers know when they are working with performance-critical code.

The change which provoked the most discussion, though, was the removal of the distinction between hot and cold pages. This feature, merged for 2.5.45, attempts to track which pages are most likely to be present in the processor's caches. If the memory allocator can give cache-warm pages to requesters, memory performance should improve. But, notes Mel, it turns out that very few pages are being freed as "cold," and that, in general, the decisions on whether to tag specific pages as being hot or cold are questionable. This feature adds some complexity to the page allocator and doesn't seem to improve performance, so Mel decided to take it out. After running some benchmarks, though, he concluded that, in fact, he has no idea whether the feature helps or not. So the second version of the patch has left out the hot/cold removal, but this topic will be revisited in the future.

Mel claims some good results:

Running all of these through a profiler shows me the cost of page allocation and freeing is reduced by a nice amount without drastically altering how the allocator actually works. Excluding the cost of zeroing pages, the cost of allocation is reduced by 25% and the cost of freeing by 12%. Again excluding zeroing a page, much of the remaining cost is due to counters, debugging checks and interrupt disabling. Of course when a page has to be zeroed, the dominant cost of a page allocation is zeroing it.

A number of standard user-space benchmarks also show improvements with this patch set. The reviews are generally good, so the chances are that these changes could avoid the lengthy delays that characterize memory management patches and head for the mainline in the relatively near future. Then there should be no excuse for trying to avoid the page allocator.

Comments (22 posted)

Checkpoint/restart tries to head towards the mainline

By Jake Edge
February 25, 2009

In kernel development, there is always tension between the needs of a new feature versus the needs of the kernel as a whole. Projects generally want to get their code merged as early as possible, for a variety of reasons, while the rest of the kernel community needs to be comfortable that the feature is sensible, desirable, and, perhaps most importantly, maintainable. The current push for inclusion of a feature to checkpoint and restart processes highlights this tension.

In late January, Oren Laadan posted the latest version of his kernel-based checkpoint and restart code with the notation: "Aiming for -mm". There are many possible uses for checkpoints, but it is an extremely complex problem. Laadan's current version is quite minimal, implementing only a fairly small subset of the features envisioned, but he would like to get the kind of review and testing that goes along with pushing it towards the mainline.

After two weeks without much in the way of comments, another proponent, Dave Hansen asked what, if anything, was holding the patchset back from -mm inclusion. Andrew Morton replied that he had raised some concerns which were "inconclusively waffled at" a few months back. Morton's opinion carries a fair amount of weight—not least because he runs the targeted tree. He is looking to the future and trying to ensure that the patches make sense:

I am concerned that this implementation is a bit of a toy, and that we don't know what a sufficiently complete implementation will look like. There is a risk that if we merge the toy we either:

a) end up having to merge unacceptably-expensive-to-maintain code to make it a non-toy or

b) decide not to merge the unacceptably-expensive-to-maintain code, leaving us with a toy or

c) simply cannot work out how to implement the missing functionality.

Morton asked for answers to several questions regarding what features are available in the current implementation, as well as information on what needs to be added. He also asked for indications that Laadan and Hansen had some thoughts on the design for required, but not yet implemented, features. In short, he wants to avoid any of the scenarios he outlined. In response to further questions from Ingo Molnar, Hansen outlined some of the shortcomings of the current implementation:

Right now, it is good for very little. An app has to basically be either specifically designed to work, or be pretty puny in its capabilities. Any fds that are open can only be restored if a simple open();lseek(); would have been sufficient to get it back into a good state. The process must be single-threaded. Shared memory, hugetlbfs, VM_NONLINEAR are not supported.

Hansen also had a more detailed answer to Morton's questions, which showed a lot of work still to be done. The current code only works for x86 architectures, for example, and only for basic file types, essentially just pipes and regular files. He likened the progress of checkpoint/restart to that of kernel scalability; it is a work in progress, not something that will ever be complete:

We intend to make core kernel functionality checkpointable first. We'll move outwards from there as we (and our users) deem things important, but we'll certainly never be done.

One of the main concerns is not that there is a lot still to be done, but that there may be lurking problems that either don't have solutions or can only be solved by very intrusive kernel changes. Matt Mackall looked at Hansen's list of additional features needing to be implemented and summed up the worries this way:

I think the real questions is: where are the dragons hiding? Some of these are known to be hard. And some of them are critical [for] checkpointing typical applications. If you have plans or theories for implementing all of the above, then great. But this list doesn't really give any sense of whether we should be scared of what lurks behind those doors.

There is, however, a free out-of-tree implementation of checkpoint/restart in the OpenVZ project. OpenVZ is a virtualization scheme using its own implementation of containers—different from that in more recent kernels—that supports checkpointing and migrating those containers. But it is a large patch, which Morton looked at several years ago and concluded that it would not be welcome in the mainline. Hansen sees OpenVZ as a useful example, but "with all the input from the OpenVZ folks and at least three other projects, I bet we can come up with something better".

An incremental approach to implementing checkpoints is reasonable, but Morton is concerned that by merging the current patches, the kernel developers will be committed to merging something that looks a lot like—and is as intrusive as—the OpenVZ patches. Molnar is more upbeat: he sees it as an important feature without "many long-term dragons". He does see one potential problem area in the incremental approach, though:

There is _one_ interim runtime cost: the "can we checkpoint or not" decision that the kernel has to make while the feature is not complete.

That, if this feature takes off, is just a short-term worry - as basically everything will be checkpointable in the long run.

That is one of the technical issues still to be resolved with the current patchset: how does a process programmatically determine whether it is able to be checkpointed? If the process has performed some action while running on a kernel that does not support checkpointing the state caused by that action, there is a need to be able to decide that. Molnar suggested overloading the LSM security checks such that performing those actions sets a one-way "not checkpointable" flag as appropriate. That flag could be checked by the process or by some other program that was interested. Overloading the LSM hooks is not completely uncontroversial, but it does hook the kernel in many of the right places—adding an additional call to those same places for checkpointing is not likely to fly.

There was also some question about whether the "not checkpointable" flag needs to be a one-way flag, as it could be cleared once the process has returned to a state that is able to be checkpointed. Molnar argued that the one-way flag is desirable: "uncheckpointable functionality should be as painful as possible, to make sure it's getting fixed". Users who run into problems checkpointing their applications will then apply pressure to get the requisite state added to checkpoints. As a starting point, Hansen has posted a patch that would add a one-way flag based on the kinds of files a process had opened.

Checkpoints are a useful feature that could be used for migrating processes to different machines, protecting long-running processes against kernel crashes or upgrades, system hibernation, and more. It is a difficult problem that may never really be completely finished and it touches a lot of core kernel code. For these reasons, caution is certainly justified, but one gets the sense that some kind checkpoint/restart feature will eventually make its way into the mainline. Whether it is Laadan's version, something derived from OpenVZ, or some other mechanism entirely remains to be seen.

Comments (9 posted)

On the management of the Video4Linux subsystem tree

By Jonathan Corbet
February 24, 2009

Once upon a time, the Video4Linux (V4L) development community was seen as a discordant group which hung out in its own playpen and which had not managed to implement support for much of the available hardware. Times have changed; the V4L community is energetic and productive, disruptive flame wars have all but disappeared from the V4L mailing lists, and Linux now supports a large majority of the hardware which can be found on the market. As this community moves forward, it is reorganizing things on many fronts; among other things, they are working on the creation of the first true framework for video capture devices. The V4L developers are also having to look at their code management practices; in the process they are encountering a number of issues which have been faced by other subsystems as well.

The discussion started with this RFC from Hans Verkuil. Hans points out that the size of the V4L subsystem (as found under drivers/media in the kernel source) has grown significantly in recent years - it is 2.5 times larger now than it was in the 2.6.16 kernel. This growth is a sign of success: V4L has added features and support for a vast array of new hardware in this time. But it has its costs as well - that is a lot of code to maintain.

As it happens, the V4L developers make that maintenance even harder by incorporating backward compatibility into their tree. The tree run by V4L maintainer Mauro Carvalho Chehab does not support just the current mainline kernel; instead, it can be built on any kernel from 2.6.16 forward. This is not a small trick, considering that the majority of that code did not exist when 2.6.16 was released. There have been some major internal kernel API changes over that time; supporting all those kernels requires a complicated array of #ifdefs, compatibility headers, and more. It takes a lot of work to keep this compatibility structure in place. Additionally, this kind of compatibility code is not welcome in the mainline kernel, so it must all be stripped out prior to sending code upstream.

The reason for this practice is relatively straightforward: the V4L developers would like to make it possible for testers to try out new drivers without forcing them to install a leading-edge mainline kernel. This is the same reasoning that the DRM developers gave at the 2008 Kernel Summit: allowing testers to build modules for older kernels makes life easier for them. And that, in turn, leads to more testing of current code. But the cost of this compatibility is high, so Hans is proposing a few changes.

One of those would be in how the subsystem tree is managed. Currently, this tree is maintained in a Mercurial repository which represents only the V4L subsystem (it is not a full kernel tree), and which contains the backward compatibility patches. This organization makes interaction with the kernel development process harder in a number of ways. Beyond the effort required to maintain backward compatibility, the separate tree makes it harder to integrate patches written against the mainline kernel, and there is no way for this tree to contain patches which affect kernel code outside of drivers/media. Life would be easier if developers could simply work against an ordinary mainline kernel tree.

So Hans suggests moving to a tree organization modeled on the techniques developed by the ALSA project. The ALSA maintainers (who also keep backward compatibility patches) use as their primary tree a clone of the mainline git repository. Backward compatibility changes are then retrofitted into a separate tree which exists just for that purpose. By working against a mainline tree, the ALSA developers interact more smoothly with the rest of the kernel development process. The down side is that creating the backward-compatible tree requires more work; a team of V4L developers would have to commit to putting time toward that goal.

And that leads, of course, to the biggest question: what is the real value of the backward compatibility work, and how far back should the project go? There seems to be little interest in dropping compatibility with older kernels altogether; the value to testers and developers both seems to be too high. But it is not clear that it is really necessary to support kernels all the way back to 2.6.16. So, asks Hans, what is the oldest kernel that the project should support?

Hans has a clear objective here: the i2c changes which were merged for 2.6.22 create a boundary beyond which backward compatibility gets significantly harder. If kernels before 2.6.22 could be dropped, a lot of backward compatibility hassles would go away. But convenience is not the only thing to bear in mind when dropping support; one must also consider whether that change will significantly reduce the number of testers who can try out the code. It would also be good to have some sort of objective policy on backward compatibility support so that older kernels could be dropped in the future without the need for extensive discussions.

The proposed policy is this: V4L backward compatibility should support the oldest kernels supported by "the three major distros" (Fedora, openSUSE, and Ubuntu). For the moment, that kernel, conveniently, happens to be 2.6.22, which will be supported by Ubuntu 7.10 until April, 2009. (Interestingly, Hans seems to have skipped over the 6.06 "Dapper Drake" release - supported until June, 2009 - which runs a bleeding-edge 2.6.15 kernel). A quick poll run by Hans suggests that there is little opposition to removing support for kernels prior to 2.6.22.

There is some, though: John Pilkington points out:

I think you should be aware that the mythtv and ATrpms communities include a significant number of people who have chosen to use the CentOS_5 series in the hope of getting systems that do not need to be reinstalled every few months. I hope you won't disappoint them.

CentOS 5 (like the RHEL5 distribution it is built from) shipped with a 2.6.18 kernel. It seems, though, that there is little sympathy for CentOS (or any other "enterprise" distribution) in the development community. Running a distribution designed to be held stable for several years and wanting the latest hardware support are seen to be contradictory goals. So it seems unlikely that the V4L tree will be managed with the needs of enterprise distributions in mind.

Thus far, no actual decisions have been made. Mauro, who as the subsystem maintainer would be expected to have a strong voice in any such decision, has not yet shown up in the discussion. Given the lack of any strong opposition to the proposals, though, it would be surprising if those proposals are not adopted in some form.

Comments (8 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.29-rc6 ?

Thomas Gleixner 2.6.29-rc6-rt2 ?

Architecture-specific

Eric Miao ARM: add base support for Marvell PXA168 ?

Martin Schwidefsky s390 features patches for 2.6.30 ?

Core kernel code

Arun R Bharadwaj timers: framework for migration between CPU ?

Tim Abbott Ksplice: Rebootless kernel updates ?

Oleg Nesterov [RFC, PATCH] introduce pid_for_each_task() to replace do_each_pid_task() ?

Development tools

Frank Ch. Eigler systemtap release 0.9 ?

Frank Ch. Eigler utrace-based ftrace "process" engine, v2 ?

Jason Baron new irq tracer ?

Jiaying Zhang [RFC PATCH] Resubmit the patch to add mmap support to the unified trace buffer ?

Jiaying Zhang Propose a new kernel tracing framework ?

Steven Rostedt event tracer ?

Frederic Weisbecker tracing/core: introduce per cpu tracing files ?

Device drivers

Felipe Balbi input: keyboard: introduce lm8323 driver ?

Wolfgang Grandegger can: CAN network device driver interface and drivers ?

Geert Uytterhoeven Generic RTC class driver ?

Rafael J. Wysocki Rework disabling of interrupts during suspend-resume ?

Ira Snyder virtio: add virtio-over-PCI driver ?

David Brownell NAND: davinci_nand driver ?

unsik Kim mflash: Initial support ?

Matthew Wilcox ATA support for 4k sector size ?

Documentation

hooanon05@yahoo.co.jp Aufs2 documents ?

Daniel Phillips Tux3 Report: Tux3 Debut at SCALE 7x ?

Michael Kerrisk man-pages-3.19 is released ?

Filesystems and block I/O

Daniel Phillips Tux3 Report: Tux3 boots up as root ?

Dave Hansen create fs flag to mark c/r supported fs's ?

Memory management

Mel Gorman Cleanup and optimise the page allocator ?

Tejun Heo improve the first percpu chunk allocation ?

Networking

Andy Grover [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 ?

Virtualization and containers

Sukadev Bhattiprolu Container-init signal semantics ?

Bharata B Rao CPU controller statistics - v5 ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.29-rc6: Reported regressions from 2.6.28 ?

Rafael J. Wysocki 2.6.29-rc6: Reported regressions 2.6.27 -> 2.6.28 ?

Miscellaneous

Kay Sievers udev 138 release ?

Pablo Neira Ayuso conntrack-tools 0.9.11 released ?

Daniel Lezcano lxc : linux containers tool 0.6.0 release ?

Geert Uytterhoeven Partial (de)compression Crypto API ?

Page editor: Jonathan Corbet
Next page: Distributions>>