Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.11.8, released on April 29.

The current 2.6 prepatch remains 2.6.12-rc3.

Linus's git repository contains a number of new "sparse" annotations, a CIFS update, various architecture updates, resource limits for niceness and realtime scheduling (see below), a new valid_signal() function (for testing signal numbers), a JFS update, some networking tweaks, and lots of fixes.

The current -mm tree is 2.6.12-rc3-mm2. Recent changes to -mm include a number of new git trees, a cpufreq update, a new /proc/zoneinfo file, some preparatory patches for Xen, and some ext3 latency reduction work.

Comments (none posted)

Quote of the week

We're still miles away from 2.6.12.

-- Andrew Morton

Comments (none posted)

A web interface to git

Further evidence that the the kernel source code management situation is slowly stabilizing: there is now a web interface to the kernel.org git repositories. Most people, perhaps, will be interested in Linus's tree, where the latest patches merged into the mainline can be viewed, but there are several developer trees available as well. (Thanks to Steven Cole).

Comments (16 posted)

Audio latency - resource limits win

The long debate on how to provide preferential scheduling for audio applications would appear to have come to an end. The realtime Linux security module has not been merged; instead, the mainline now includes a version of the rlimit patch. This is not the outcome which was most favored by the audio development community, but it will still be useful for them.

The patch creates two new resource limits. RLIMIT_NICE controls the maximum "niceness" that the process can set for itself in the normal timesharing scheduler. The limit has a range of 0..39, with 39 corresponding to an internal niceness value of -20 - the highest priority. The difference between the resource limit value and the actual niceness values may seem confusing, but apparently it's unavoidable: the Single Unix Standard specifies that resource limits must be unsigned values.

The other limit is RLIMIT_RTPRIO; it can have a range of 0..100. If it is nonzero, the process is empowered to use the realtime scheduling classes up to the indicated priority.

The problem with this approach, from the point of view of the audio community, is that it is not currently supported by any distribution. It is easy to set up PAM to give expanded limits to specific users or groups - once PAM has been patched to understand the new limits. Shells, too, must be patched before their ulimit commands can be used to change the limits. So it will be some time before an "out of the box" Linux system will be able to take advantage of this new capability.

In the long term, however, the rlimit patch looks like a minimally invasive way of making realtime scheduling available, in a relatively safe way, to ordinary users. Anybody wanting to play with the new mechanism before their distribution catches up can find instructions and patches on this web page.

Comments (3 posted)

API change: synchronize_kernel() deprecated

The read-copy-update mechanism works with the fundamental assumption that, if no pointer to an RCU-protected data structure exists, there will be no references to that structure after every processor on the system has scheduled at least once. This assumption works because the rules require that accesses to RCU-protected data structures be atomic; scheduling while holding such a reference is not legal. When RCU was added to the kernel, it brought with it a function called synchronize_kernel() which would wait for every processor to schedule. Since it seemed that this capability could be useful outside of RCU itself, synchronize_kernel() was exported to the world.

A quick grep of the 2.6.12-rc kernel shows a fair number of synchronize_kernel() calls. The module loader uses it to let things calm down when an attempted load fails. The AT keyboard driver calls it at disconnect time to ensure that no processor is still trying to work with the device. The kernel profiling code uses synchronize_kernel() to ensure that all processors notice the unregistration of its timer hook. And so on.

The external uses of synchronize_kernel() have reached a point where they are putting extra demands on the RCU code. RCU, after all, does not really have to wait until every processor has scheduled; the important constraint, instead, is that every processor running within rcu_read_lock() exits from the critical section. This distinction has become more important as the kernel developers have sought ways to make RCU more compatible with the low-latency work.

So, as of 2.6.12-rc4, synchronize_kernel() will be officially deprecated. Its replacements will be synchronize_sched(), which retains the current "wait for all processors to schedule" semantics, and synchronize_rcu(), which is only guaranteed to wait until any processors executing within rcu_read_lock() critical sections have exited those sections. Most external users probably need to be switched over to synchronize_sched(). The comments suggest that a synchronize_irq() variant is also envisioned, but it has not been added as of this writing.

One other significant change: unlike synchronize_kernel(), the two replacements are exported GPL-only.

Comments (none posted)

Defending against fork bombs

Standard wisdom says that the proper defense against fork bomb attacks (where a simple script forks children until the system chokes under the load) is to use resource limits. Put a cap on the number of processes which can be created, and the problem goes away. In reality it's not quite so simple; the limit can be softened by logging in multiple times. And, in any case, some people feel that the system should not collapse when faced with such an attack. A Linux system, it is said, should not be so easy to bring down in its default configuration.

The last defense against fork bombs is typically the out-of-memory (OOM) killer. As the system fills up with processes, it will eventually run out of memory and, in its desperation, start looking for processes to kill. The OOM killer has a set of heuristics which attempt to choose the "best" process to kill. These rules help the system to avoid (sometimes) killing processes which are vital to the continued operation of the system. They are not particularly helpful in dealing with fork bombs, however.

Coywolf Qi Hunt has posted a patch which tries to do a better job of defending against fork bombs in the OOM killer. It works by extending the task structure to keep better track of a process's "biological" parent and children. These lists are maintained separately from the regular process hierarchy pointers, and are not actually used during normal system operation. They are, in other words, pure overhead most of the time.

Things change, however, when an out-of-memory situation hits. When the OOM killer starts up, it will select its first victim in the usual way. When a second process is chosen for an untimely death, however, the new lists come into play. For both the current and previous victim, the OOM killer will traverse the "biological parent" pointers to create a path through the process hierarchy. Using those paths, the code can select the "least common ancestor," the lowest process which is an ancestor to both victims. Then, rather than killing the second chosen victim directly, the OOM killer goes after the ancestor - and all of its children. If the OOM situation persists, the killer should be able to quickly work its way up the process hierarchy until it finds (and eliminates) the process responsible for the whole mess.

Coywolf has a set of test cases and a system he is willing to run them on; for all but the nastiest of the three, the patched system was able to put an end to the fork bomb attack without any ill effects beyond a temporary slowdown. In the worst case, the system still recovered, but with some collateral damage. The patch adds some significant overhead (one pointer and two list_head structures) to each process in the system, so it may encounter some resistance - most systems will pay that overhead, but never actually need to run the OOM killer. But, for systems which are exposed to that sort of attack, this patch could be a useful last line of defense.

Comments (2 posted)

The Philips webcam driver - again

The 2.6.12-rc kernels include, among many other things, the long-awaited return of the Philips web camera driver. This driver, remember, was removed at the original author's request; that author (known as "Nemosoft Unv") objected to the removal of a special-purpose hook which allowed a non-free decompression module to be loaded into the kernel. After the removal, Luc Saillard took over the driver, with the goal of getting it back into the mainline. As part of that process, he reverse engineered the image decompression code and included it in the GPL-licensed module. It would appear that this episode has led to a good result: the Philips driver is back, and more free than before.

Nemosoft has recently resurfaced, however, to make the claim that things may not be quite as good as they seem. According to Nemosoft, no real reverse engineering job was done. Instead:

In case you hadn't noticed, that code has been reverse compiled (I would not even call it "reverse engineered"), and is simply illegal. Maybe not in every country, but certainly in some. There are still some intellectual property rights being violated here, you know, and I'm surprised at the contempt you and Linux kernel maintainers show in this regard for a few lines of the law.

Mr. Saillard has been silent on how he performed the reverse engineering task. A look at the code (example - pwc-kiara.c) is somewhat unenlightening - the decompression code consists mostly of a set of tables filled with mysterious numbers. It is hard to imagine how those tables could be created in any way other than extracting them from the binary decompressor module.

If the code was truly decompiled and relicensed, there could be a copyright issue here. On the other hand, the tables used for decompression will be hard to protect if they are truly the only way to interpret images produced by the camera. Alan Cox (who forwarded the PWC patches for merging) acknowledges that there could be an issue with the decompression code, but he is not overly worried about it:

The legal position on reverse engineering is in general fairly clear. What you describe might not be. If so then we need to find someone who hasn't read the code to rewrite it from the algorithm description of the current code. Shouldn't take more than a week.

Alan also points out an issue others have raised: by Nemosoft's admission, the non-disclosure agreement which forced the decompression code to be proprietary ran out some time ago. Nemosoft could thus resolve the licensing issues by simply releasing the decompression code under a free license.

Comments (3 posted)

Andrew Morton 2.6.12-rc3-mm1 ?

Andrew Morton 2.6.12-rc3-mm2 ?

Greg KH Linux 2.6.11.8 ?

Con Kolivas 2.6.11-ck6 ?

Arnd Bergmann ppc64: Introduce BPA platform ?

Arnd Bergmann ppc64: add BPA platform type ?

Arnd Bergmann ppc64: Add SPU file system ?

Arnd Bergmann ppc64: Add driver for BPA iommu ?

Arnd Bergmann ppc64: Add driver for BPA interrupt controllers ?

john stultz new timeofday arch specific hooks (v A4) ?

john stultz [RFC][PATCH (3/4)] new timeofday arch specific timesource drivers (v A4) ?

Hyok S. Choi release of the ARM MPU/noMMU support 2.6.11.8-hsc0 ?

Eric Piel ARTiS, an asymmetric real-time scheduler - x86 ?

Eric Piel ARTiS, an asymmetric real-time scheduler - IA-64 ?

Benjamin LaHaise unify semaphore implementations ?

john stultz new timeofday core subsystem (v A4) ?

john stultz new timeofday vsyscall proof of concept (v A4) ?

Nishanth Aravamudan new timeofday-based soft-timer subsystem ?

Dinakar Guniguntala Dynamic sched domains (v0.5) ?

Eric Piel ARTiS, an asymmetric real-time scheduler ?

Tejun Heo gitkdiff 0.1 ?

Paul Mackerras Quick git command reference ?

Matt Mackall Mercurial v0.4c ?

Matt Mackall Mercurial v0.4d ?

Pavel Machek kernel hacker's git howto ?

Greg KH kernel maintainer's HOWTO for quilt and -mm ?

Mingming Cao Adding multiple block allocation to current ext3 ?

Robert Love latest inotify. ?

Robert Love 2.6-mm inotify update ?

Miklos Szeredi [PATCH] unprivileged mount/umount ?

Christoph Lameter Page Fault Scalability V20: Overview ?

Christoph Lameter Page Fault Scalability V20: Avoid spurious page faults ?

Christoph Lameter Page Fault Scalability V20: Avoid lock for anonymous write fault ?

Christoph Lameter Page Fault Scalability V20: Avoid first acquisition of lock ?

Rik van Riel non resident page management, #4 ?

David Gibson Hugepage consolidation ?

Rik van Riel cleanup of use-once ?

Mel Gorman Avoiding external fragmentation with a placement policy Version 10 ?

Evgeniy Polyakov : Asynchronous IPsec processing. ?

Kristian =?iso-8859-1?q?S=F8rensen?= Umbrella is now feature complete! (v0.7 released) ?

Coywolf Qi Hunt oom lca -- fork bombing killer v2.2 ?

Aneesh Kumar OpenSSI 1.9.0 released ?

Dave Kleikamp jfsutils-1.1.8 ?

Erik van Konijnenburg yaird 0.0.6, a mkinitrd based on hotplug concepts ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

A web interface to git

Audio latency - resource limits win

API change: synchronize_kernel() deprecated

Defending against fork bombs

The Philips webcam driver - again

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous