LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.32-rc6, released on November 3. Linus says:

There's been a number of other nasty regressions since 2.6.31 that got fixed too (largely drivers, several of them suspend/resume related or in some cases apparently most easily triggered that way), so I'm hoping the delay resulted in a better -rc all around. And I'm obviously hopeful that we didn't introduce any major new regressions.

The short-form changelog is in the announcement, or see the full changelog for all the details.

There have been no stable kernel updates in the last week.

Comments (none posted)

Quotes of the week

Unfortunately, our biggest competitors are our previous kernels, and we (were?) really good at writing really fast kernels. And most of our users who are running the competition are completely satisfied with all the features it has, so an "upgrade" that causes a slowdown does not go down well. A feature that 0.01% of people might use but causes a 0.1% slowdown for everyone else... may not actually be a good idea. Performance is a feature too, and every time we do this, we trade off a little bit of that for things most people don't need.
-- Nick Piggin

The fact is, maintainership does _not_ mean ownership. It means that you should be _responsible_ for the code, and you get credit for it, but if problems happen you do NOT "own" it. Not at all.

If you don't understand that, you shouldn't be a maintainer.

-- Linus Torvalds

It looks like the Linux kernel maintainers are frowning on the FatELF patches. Some got the idea and disagreed, some didn't seem to hear what I was saying, and some showed up just to be rude.

I didn't really expect to be walking into the buzzsaw that I did. I imagined people would discuss the merits and flaws of the idea and we'd work towards an agreeable solution that improves Linux for everyone. It sure seemed to be going that way at first. Ultimately, I got hit over the head with package management, the bane of third-party development, as a panacea for everything.

-- Ryan Gordon

If anyone wants a choice quote from me about the recent Linux holes, this is what I have to say: Linus is too busy thinking about masturabating [sic] monkeys, he doesn't have time to care about Linux security.
-- Theo de Raadt

Comments (19 posted)

Another null pointer exploit

By Jonathan Corbet
November 4, 2009
Back in mid-October, Earl Chew reported a null pointer crash in the kernel pipe code. Initial response to his report was somewhat slow, partly because the kernel he was running was based on 2.6.21. Earl took the time to dig through the code and identify the problem, though; it turns out to be an old vulnerability which is still present in current kernels.

What it comes down to is that there is a race condition in the pipe code. Prior to 2.6.32-rc6, the code which opens a pipe (for write-only access, in this case) looks like:

    static int
    pipe_write_open(struct inode *inode, struct file *filp)
    {
	mutex_lock(&inode->i_mutex);
	inode->i_pipe->writers++;
	mutex_unlock(&inode->i_mutex);

	return 0;
    }

The problem is that if the final close of this pipe slips in at the wrong time, inode->i_pipe may have been set to null. So this is yet another null pointer vulnerability; the rest is just a matter of writing the exploit. That exploit must face the challenge that the window of opportunity is quite short, but computers are very good at continually trying things until something works.

The fix makes the code much more careful about checking the current status of the pipe and refusing new opens if the final close has already happened. Distributors are shipping updates.

This particular bug is attracting attention because it is in the core kernel and (relatively) straightforward to trigger. But it is far from unique. A quick look at commits since 2.6.31 turns up no fewer than 34 which explicitly fix null pointer dereference bugs. Quite a few more fix things that could be null pointer bugs, and there's no telling how many more were fixed without an explicit mention in the commit title. Null pointer bugs are common, and are likely to remain so for quite some time.

What is surprising about this bug is that some distributions are still vulnerable to it. We have had the ability to keep null pointer bugs from being exploitable for some time, but certain distributions - generally of the "enterprise" variety - disable that protection by default. Sites running such distributions might want to be sure that they have the vm.mmap_min_addr knob set to a reasonable value; either that or expect to be vulnerable to more null pointer exploits in the future.

Comments (9 posted)

Deprecating IDE?

The IDE drivers have been a relative backwater for a while now; most distributions have made the transition to the newer libata-based PATA driver set. But IDE remains in the kernel with no indication that it's no longer the preferred way of doing things. This can be a problem because, among other things, it encourages developers to submit new IDE-based drivers, only to be told that such drivers are no longer being accepted.

To help head off such problems, Robert Hancock has submitted a patch to mark IDE as deprecated. David Miller has accepted the patch for 2.6.33, but it might not yet actually get there. David sees a couple of things which need to be fixed first:

  • He would like to see libata create IDE-style device names (/dev/hdX) so that systems using those names in their fstab files will continue to work. One might argue that any such change is a few years late - most systems have been through the pain of that change already. At this point, mounting by label or UUID is common, so few users should be affected by the loss of old-style device names. And, as Alan Cox pointed out, udev rules can always be written to create those names if need be. So this requirement may not stick.

  • There are some IDE devices which are not yet supported in libata; the "pmac" driver (for PowerMac on-board IDE devices) is the most-cited example. Until these devices have support in libata, the IDE layer clearly cannot be deprecated or removed.

Alan has also suggested that IDE will die of its own accord, and that there is no need for additional pressure for users to move from it. The warning may go in anyway, though, just for those who don't get the message in other ways. If it prevents one developer from spending time on a new IDE driver, it's probably worthwhile.

Comments (9 posted)

Kernel development news

JLS: Increasing VFS scalability

By Jonathan Corbet
November 3, 2009
It can be tempting to dismiss scalability work as being of interest mainly to companies running massive server systems; most "ordinary" Linux users are not running into the kind of problems that scalability-oriented developers are trying to fix. But, of course, the truth of the matter is that those users haven't encountered those problems yet. The past work of scalability-oriented developers is what makes our current desktop and laptop systems work as well as they do; their current work will enable next year's consumer-level systems. So Nick Piggin's Japan Linux Symposium talk on virtual filesystem scalability will be of interest to anybody who anticipates using Linux in the future.

That said, one of the key constraints on scalability work is that it must not worsen performance on current systems. So Nick is taking care that his VFS work will improve scalability with no impact on single-threaded performance. Beyond that, he is aiming to improve scalability within a single filesystem - forcing system administrators to split their filesystems to get better performance would be cheating. To get there, he has identified five specific bottlenecks which must be addressed.

The first of those is files_lock; it is, he says, the easiest to fix. This global lock protects a per-superblock list of open files; it is needed by the file open and close paths. As the number of threads grows, this lock limits the scalability of filesystem-oriented workloads. The lock itself is only part of the problem; the real issue is that a single list_head is never going to be scalable in multiprocessor situations. In this case, it turns out that the kernel almost never needs to read the full list of open files; that only happens at unmount time. So turning the single list into a per-CPU list is a viable option; it eliminates the locking altogether and makes the management of the list scalable. The only tricky part is when files are removed; that requires cross-CPU access to the list.

Next on the list is vfsmount_lock, which is used when finding mounts from directory entry ("dentry") structures. This lock is taken when crossing mount points in the path lookup process; it is also used at mount and unmount time. Pathname lookup is clearly a performance-critical path in the kernel, so getting rid of a global lock can only be a good thing. Nick considered using read-copy-update (RCU) for pathname lookup, but he found it to still be too slow. Part of the problem is the need to block all readers at unmount time, something that RCU cannot do on its own.

The solution is to go to per-CPU locks. Nick has introduced a variant on per-CPU locks called brlocks, or "big reader locks." These locks share the name and goal of the 2.4.x brlocks which were removed in the 2.5 development cycle, but the implementation is different. Essentially, a brlock is per-CPU for read access, but write access excludes all other users on all CPUs. Since pathname lookup is a read-only operation, brlocks will be fast where the kernel needs them to be; unmounts will be slow, but those are relatively rare operations.

mnt_count is a per-filesystem reference count, incremented for each open and decremented for each close. Like the global list described above, this global counter limits the scalability of opens and closes. Once again, going per-CPU is the obvious solution here, with the minor problem that a [Nick Piggin] put() operation must check whether the (global) count is zero. But, as it happens, that case only comes about when the filesystem is not actually mounted, so this check need not be performed most of the time.

The hardest one to fix is dcache_lock. Most VFS operations need it, with the sole exception of name lookup, which has used RCU for a while now. Some operations - LRU scanning and reclaim in the dentry cache in particular - can hold the lock for a long time. And the lock covers a whole bunch of different - and sometimes unknown - things. The exporting of dcache_lock to filesystems has not helped here; individual filesystems are using it for their own, not always clear, ends. So a developer trying to bring dcache_lock under control must start by trying to figure out what it is being used to protect.

Nick has done his best to split apart the various locking cases; these include the dentry cache hash, the dentry LRU list, the inode dentry alias list, various statistics, etc. Some of this stuff is moved under the protection of the per-dentry spinlock (d_lock); other things, like the dentry hash and LRU, get new locks. There are a lot of problems still, starting with lock-ordering challenges. Nick is working around some of these using non-blocking "trylock" operations, but that kind of code tends to be hard to merge. The various locking cases are still not truly independent from each other; among other things, that imposes more ordering requirements. And walking up the directory tree (trying to determine a path name from a dentry, usually) becomes much harder in the absence of a global lock.

In summary, cleaning up dcache_lock looks like a long and messy project. This is just the lock which is showing up as the worst bottleneck in some situations, though, so the work needs to be done.

Finally, there is the matter of inode_lock, which is needed by most inode operations (lookup, creation, destruction, writeback, sync, etc). As with dcache_lock, Nick has split the locking into a number of independent classes - the inode itself, the inode hash, the LRU list, and so on. Some of these classes are moved under the per-inode lock, while specific locks have been added for some cases. The per-superblock inode list has been made into a per-CPU variable, as have the counters used to generate statistics. Nick has also made the allocation of inode numbers into a per-CPU operation by assigning a range of numbers to each processor. This means that inode numbers are no longer allocated sequentially; it's not clear whether that will be a problem or not.

So what comes of all this work? Nick claims "great" open/close scalability, and "good" create/unlink scalability. He showed the results of running a microbenchmark which just did close(open(path)) repeatedly; with current mainline, he was able to get 450 operations/second on each of 64 CPUs. With the scalability patches added, that rate went up to over 300,000 operations/second - a significant improvement. Running unlink(creat(path)) shows better scalability even with two CPUs - but it does, for some reason, impose a cost on single-threaded workloads on the ia-64 architecture.

The VFS scalability work is clearly worth doing; we'll all be glad that these problems have been ironed out someday. But there's still some messy things to clean up, so this patch set (or the gnarlier parts of it, anyway) may take a while on their way into the mainline.

Comments (none posted)

Relicensing tracepoints and markers

By Jake Edge
November 4, 2009

Sharing code where it is possible is normally considered a good thing, but there are some limits to what can be shared. One of the limiting factors is often license compatibility; GPL code, in particular, often cannot be combined with code under other licenses and then distributed. The kernel is licensed under the GPL, but, since it's rare that anyone wants to combine its code with user-space applications, license incompatibilities have not been much of a problem.

There is, however, some kernel tracing infrastructure that could be shared with user-space tracing applications—likely benefiting both—if those parts of the kernel were available under more permissive licenses. Mathieu Desnoyers, who has developed much of that infrastructure, has set out to try to relicense some fairly small portions of the kernel under dual licenses, so that the code can be shared.

Essentially, Desnoyers would like to be able to use the kernel tracing infrastructure in the Linux Trace Toolkit Next Generation (LTTng) user-space tracer (UST). He describes the need as follows:

The intent is to allow the tracer code developed both on the kernel-side as part of Ftrace and LTTng and on the userspace side within UST to be shared when appropriate. As a result, we can consider userland-only solutions to user-space tracing without rewriting all the kernel tracing infrastructure from scratch.

All of the files are currently licensed under the GPLv2, but Desnoyers would like to see the C files available under a dual GPLv2/LGPLv2.1 license, and the header files under a dual GPLv2/BSD license. In order to do that—at least under the most inclusive interpretation of copyright—he must get permission for the relicensing from each contributor to those files. His message to linux-kernel listed the few remaining contributors that he had not yet heard from.

The files of interest are kernel/marker.c and kernel/tracepoint.c, along with the corresponding header files in include/linux. For 2.6.32, kernel markers have been removed, with all users converted over to use trace events, but marker.[ch] are still used by UST. The idea is that the C files could be turned into a user-space library that could be dynamically linked to applications that required it, while the header files (with an even more permissive license) could be used to add static tracepoints to any application, proprietary or free.

For the most part, the relicensing has been met with approval from the developers who responded, with several saying that they didn't think their contributions warranted requiring their approval, but they gave it anyway. Steven Rostedt ran the C file relicensing by Red Hat's legal department and was granted permission for all of the Red Hat contributions to be dual licensed under the GPLv2/LGPLv2.1. The header file GPLv2/BSD dual licensing is still pending with Red Hat, according to Desnoyers.

There are still a few developers who have not responded, but their contributions are quite small, and could be rewritten rather easily if necessary. A bigger stumbling block may be opposition from Ingo Molnar, who seems to consider the relicensing process to be legally dubious: "the legality of such relicensing is questionable as that code was never developed outside of the kernel but as part of the kernel". In addition, he has technical concerns:

But i also disagree with it on a technical level: code duplication is _bad_. Why does the code have to be duplicated in user-space like that? I'd like Linux tracing code to be in the kernel repo. Why isn't this done properly, as part of the kernel project - to make sure it all stays in sync?

So for those two grounds i cannot give my permission for this relicensing, sorry.

Whether Molnar's permission is actually required is something of an open question as his employer (Red Hat) has already given permission for his work to be relicensed. But, if there are serious concerns that lead to a "nack" from him on the relicensing patch, things get rather murky. It may be that there is a disconnect between Desnoyers and Molnar such that Desnoyers's intent is not clear. As Pierre-Marc Fournier points out, not relicensing the code leads to code duplication as well:

So the GPL code will have to be rewritten. And this will result in the exact same drawbacks you are trying to avoid by being against dual-licensing. The goal of dual-licensing is to make it possible to keep the code in sync between kernel and userspace, not the opposite!

Essentially, Desnoyers wants user-space applications to be able to contain tracepoints that are based on the same code that is used now in the kernel. Those applications may be under a variety of free or proprietary licenses, but the tracepoints are just a static instrumentation technique that could be shared. As Rostedt puts it:

But what I think is trying to be done here is to use the same types of MACROS that we have in the kernel to do tracing in userspace. That a userspace program can add their own "TRACE_EVENT" and that the headers there will create a tracepoint for them the same way we currently do in the kernel.

Molnar has gone quiet on the topic, as has the thread, but the idea, overall, seems reasonable. While it does expose a kernel interface to user space, it doesn't tie the kernel to any ABI/API for the future. If the kernel needs to change, either the user-space libraries will change right along with it, or there will be a fork. Given that the players involved work on both the kernel and user-space sides of the problem, that seems somewhat unlikely to happen, but it certainly doesn't seem like that split need happen now.

Comments (3 posted)

Toward a smarter OOM killer

By Jonathan Corbet
November 4, 2009
The Linux memory management code does its best to ensure that memory will always be available when some part of the system needs it. That effort notwithstanding, it is still possible for a system to reach a point where no memory is available. At that point, things can grind to a painful halt, with the only possible solution (other than rebooting the system) being to kill off processes until a sufficient amount of memory is freed up. That grim task falls to the out-of-memory (OOM) killer. Anybody who has ever had the OOM killer unleashed on a system knows that it does not always pick the best processes to kill, so it is not surprising that making the OOM killer smarter is a recurring theme in Linux virtual memory development.

Before looking at the latest attempt to improve the OOM killer, it is worth mentioning that it is possible to configure a Linux system in a way which all but guarantees that the OOM killer will never make an appearance. OOM situations are caused by the kernel's willingness to overcommit memory. As a general rule, processes only use a portion of the address space they have allocated, so limiting allocations to the total amount of RAM and swap space on the system would lead to underutilization of system memory. But that limitation can be imposed on systems which can never be allowed to go into an OOM state; simply set the vm.overcommit_memory sysctl knob to 2. Individual processes are much more likely to see allocation failures in this mode, but the system as a whole will not overcommit its resources.

Most systems will allow overcommitted memory, though, because the alternative is too limiting. Overcommit works almost always, but the threat of a day when the Firefox developers add one memory leak too many always looms. When that sad occasion comes to be, it would be nice if the OOM killer would target that leaky Firefox process instead of, say, the X server and PostgreSQL. Many attempts have been made to add smarts to the OOM killer over the years; there's also a means by which the system administrator can steer the OOM killer toward or away from specific processes. But manual configuration is only suitable for certain, relatively static workloads; for the rest, the OOM killer often proves less discriminating than one would like.

The latest attempt to fix the OOM killer comes from Hiroyuki Kamezawa. This patch makes a number of fundamental changes to the selection of OOM victims. The result is an OOM killer which is smarter in some ways, but which takes a somewhat different approach to the selection of its victims.

One of the factors that the current OOM killer takes into account, naturally, is the amount of memory being used by each process. But the measure used (mm->total_vm) is somewhat crude: it penalizes processes using a lot of shared memory and says little about how much physical memory the process is using. Hiroyuki's patch tries to move away from total_vm in most situations, looking at the actual resident set size (RSS) and possibly taking into account the amount of swap space used as well.

Figuring in swap usage is controversial. A program which is using a lot of swap is clearly putting pressure on memory, but, if that program has been mostly swapped out, killing it will not immediately free much RAM. Eventually other processes can be shifted into the newly-freed swap space, but it might make more sense to just do away with those other processes at the outset. Even so, Hiroyuki's patch, for now, will figure in swap space if specific constraints do not force the use of other criteria.

One constraint which can change the calculation is when the memory shortage is specific to low memory - the region of memory which can be directly addressed by the kernel. When a low-memory allocation is required, nothing else will do, so there is little value in killing processes which are not hogging low-memory pages. With Hiroyuki's patch, the VM subsystem tracks how much low memory each process is using as a separate statistic. If the OOM situation is caused by an attempt to allocate low memory, the OOM killer's "badness" function will focus on processes holding large amounts of low memory.

Killing gnome-session is likely to free substantial amounts of memory, but the user's gratitude may be surprisingly limited. The current OOM killer makes an attempt to target "fork bomb" processes by adding half of each child's "badness" value to its parent. A process with a lot of children will thus have a high badness and will thus come under the OOM killer's baleful gaze sooner. The problem here, of course, is that some processes legitimately have lots of children - the session manager for the user's desktop environment is a good example. Killing gnome-session is likely to free substantial amounts of memory, but the user's gratitude may be surprisingly limited.

The patch changes the fork bomb detector significantly. The new code counts only the child processes which have been running for less than a specific amount of time (five minutes in the posted patch). If one process has newborn children which make up at least 1/8 of the processes on the system, that process is deemed to be a fork bomb; it is duly rewarded with a spot at the top of the OOM killer's short list.

Finally, the current OOM killer tries to kill newly-created processes, while allowing long-running processes to continue. Hiroyuki feels that this approach creates a loophole for long-running processes which slowly leak memory. That web browser may have been running for a long time and is thus a high-value process, but it has been dropping memory on the floor for that long time and is also the cause of the problem. So the new code changes the calculation to look at how long it has been since the process has expanded its virtual memory size. A process which has been running for a long time, but which has not grown in that time, will look better than one which has been expanding.

There seems to be little disagreement with the idea that the OOM killer needs a rework, but not everybody is sold on this approach yet. It looks like a very large change, which makes some people nervous. It also shifts the focus of the OOM killer's attention in a significant way: the current heuristics were designed to be as unsurprising to the user as possible, while the new ones are focused more strongly on freeing RAM quickly. But, given that the existing heuristics are still clearly producing plenty of surprises, perhaps a more goal-oriented approach makes sense.

(Naturally, no article on the OOM killer is complete without a link to this 2004 comment from Andries Brouwer).

Comments (105 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds