LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.36-rc3; no new prepatches have been released over the last week. Linus has returned from his trip to Brazil and resumed merging changes, so 2.6.36-rc4 can probably be expected in the near future.

Stable updates: there have been no 2.6 stable updates released over the last week.

Comments (none posted)

Stable kernel 2.4.37.10

The 2.4 kernel lives - for a little while longer, at least. Willy Tarreau has just released the 2.4.37.10 update, with a small set of important fixes. This might just be the last update in this series, unless some sort of important fix comes in. "If nothing happens before September 2011, it's possible that there won't be any 2.4.37.11 at all. By that time, the 2.6 kernel will have been available for almost 8 years, this should have been enough for anyone to have a look at it. Users now have one year to migrate or to report critical bugs. I think that's an honest deal." See the announcement for the full description of his planned policy.

Comments (4 posted)

Quotes of the week

A lot of my conversations about union mounts with Al [Viro] go like this:

Al: "Rewrite it this way."
Val: "But then how do we get the nameidata?"
Al: "Arrrrrrrrrrrrrggggh."

-- Valerie Aurora

Well damn, good detective work. I wonder how many of those nasty random bad-page-state bug reports just got fixed. I dub thee September's "Hero of the Linux kernel"!
-- Andrew Morton (to Jiri Slaby)

Kernel developers are paid to work on features, yes. They are not paid to fix bugs for random folks who want run the latest stable kernel.
-- Ted Ts'o

Comments (none posted)

Prefetching considered harmful

By Jonathan Corbet
September 8, 2010
As anybody who has read What every programmer should know about memory knows, performance on contemporary systems is often dominated by cache behavior. A single cache miss can cause a processor stall lasting for hundreds of cycles. The kernel employs many tricks and techniques to optimize cache behavior, but, as is often the case with low-level optimization, it turns out that some of those tricks are not as helpful as had been thought.

The kernel's linked list macros include a set of operators for iterating through a list. At the top of a list-processing loop, the macros will issue a prefetch operation for the next entry in the list. The hope is that, by the time one entry has been processed, the CPU will have fetched the following entry into its cache, avoiding a stall at the beginning of the next trip through the loop. It seems like the sort of micro-optimization which can only help, and nobody has looked closely at these prefetch operations for a long time - until now. Andi Kleen has just posted a patch removing most of those prefetches.

Andi's contention is that, on contemporary processors, the prefetch operations are actually making things worse. These processors already prefetch everything they can get their hands on, so the explicit prefetch is unlikely to help. Even if that prefetch does start a memory cycle earlier than it would have otherwise happened, list processing loops tend to be so short that the amount of additional parallelism gained is quite small. Meanwhile the prefetch operations bloat the kernel image, increase register use, and cause the compiler to generate worse code. So, he says, we are better off without them.

With the prefetch operations removed, Andi's kernel image ends up being 10KB smaller. It also shows no performance regressions over mainline kernels. Unless somebody else gets different results, that seems like enough to justify putting this patch into the mainline.

Comments (11 posted)

Too many Cc's

By Jonathan Corbet
September 8, 2010
In kernel-related email communications, the general rule is to err on the side of sending copies to too many recipients rather than too few. The volume on the mailing lists is such that one can never assume that interested people will see any specific message there, so it's customary to copy people explicitly. Recently, though, a number of kernel developers have started to complain that they are getting copies of patches that they have no interest in. Often, the selection of recipients seems entirely random.

The culprit is the get_maintainer.pl script shipped with the kernel source. This script is actually a useful tool; it will look in the MAINTAINERS file to find the people who might be interested in a specific patch. Potentially less usefully, it will also dig through the repository history and list other developers who have made changes to the files modified by a given patch. So anybody who has tweaked a given file in the recent past - possibly making only trivial changes - will be listed as people to copy on any other patches to the same file.

Looking at the file revision history can, indeed, be a useful way to find the "real" maintainers; the information in MAINTAINERS, instead, can be incomplete or outdated. But, clearly, one needs to look at what a developer has actually done in a given area; fixing a file for an API change does not mean that the developer is actively working on that code. Many developers don't perform that check and, instead, just send mail to everybody listed by the script.

The level of grumpiness caused by widely-broadcast patches seems to be on the rise. Developers who don't want to receive an irritated response to their postings might want to take a little care or, at least, use the --nogit option to get_maintainer.pl.

Comments (6 posted)

Kernel development news

Further notes on stable kernels

By Jonathan Corbet
September 8, 2010
Last week's article on stable kernels drew a number of comments, both public and private. Those comments suggested a couple of other ways of looking at how the stable tree works and how patches get into it. That, in turn, has inspired this follow up look at the stable kernel process.

A certain amount of unhappiness was expressed regarding the tables of the most active stable contributors. Those tables attribute stable contributions to the individuals who write the patches. Things were done this way for two reasons: (1) the patch author is, indeed, the person who fixed the bug, and (2) that is the information which is available in the stable kernel repository. It made sense, your editor thought, to assign credit - for the fix, but, also, potentially, for the bug which required the fix - in this way.

It turns out that a number of people see stable contributions in a different way. The real credit, they say, belongs to the person who notices that a patch fixes a bug in stable kernels and ensures that the fix gets directed to the stable kernel maintainer. There are people, often those working on maintaining distributor kernels, who spend a lot of time watching the patch stream and looking for just this kind of fix. It is a lot of work, and the people who do that work certainly deserve credit for the service they are performing for the community.

Your editor would be delighted to be able to produce a table crediting this type of stable contributor. Unfortunately, the electronic trail needed to create this table simply does not exist. One could try to play games by looking at how the patch tags differ between the mainline and stable versions of the fix; there will often be an extra signoff or Cc: tag naming the person who forwarded the patch to the stable tree. But such schemes will be approximate and error-prone. If we really want to track and credit developers who flag patches for the stable tree, we almost certainly need to add a new patch tag making that credit explicit in every patch.

A related complaint came, via private mail, from a subsystem maintainer; his point of view was that the subsystem maintainers are the people doing the real legwork to get important fixes into the stable tree. A diligent maintainer will be evaluating all patches as they are merged into the subsystem tree, catching those which have stable kernel implications and directing them accordingly. He suggested a study to evaluate the percentage of stable patches coming out of each subsystem tree as a way to identify which maintainers are on top of things.

Your editor, intrigued by that idea, ran a quick study. The table below shows numbers for some selected subsystems for the 2.6.32 stable series. Since 2.6.32 is still under maintenance, it will have received patches from all of the mainline releases from 2.6.33 to the present. For each subsystem, we can look at how many patches have gone into the mainline (through 2.6.36-rc3) and how many of those went into the stable series. The results look like this:

Subsystem Patches Pct
(mainline)(stable)
fs/ext4 216 90 42%
fs/btrfs 155 42 27%
drivers/usb 1003 112 11%
arch/x86 1877 176 9%
drivers/acpi 291 24 8%
mm 602 48 8%
kernel 1471 96 7%
sound 1369 88 6%
fs/ext3 58 3 5%
drivers/scsi 1054 51 5%
net 2324 98 4%
drivers/input 381 13 3%
arch/powerpc 917 18 2%
drivers/media 1705 26 2%
block 182 3 2%
arch/arm 3221 19 <1%
tools 873 3 <1%

At the upper end of the table, it is unsurprising to find the ext4 and btrfs filesystems showing a high percentage of stable patches. Both of those filesystems are undergoing heavy stabilization work at the present, so it makes sense that the bulk of the changes merged will be important fixes. The relatively small percentage of ext3 changes going into the stable tree was interesting; a quick check shows that many of the ext3 changes which did not go to stable reflect API changes in the VFS and disk quota code. That said, it also appears that a small number of fixes might have fallen through the cracks.

It's hard to draw conclusions from much of the rest of the table; different subsystems will naturally vary in the ratio of fixes to new features, so they will never have the same percentage of patches going into the stable tree. That said, there do seem to be some real variations in how many fixes are being directed to stable by the subsystem maintainers. One might, for example, wonder if a few more than 19 of the 3221 changes to the ARM architecture could have qualified for the stable tree.

This maintainer also pointed out one other aspect of the problem: the maintainer's real job is often to say "no" in the same way as with mainline patches. It seems that some developers have an expansive view of of which changes are suitable for the stable tree, so they flag patches which are too large and invasive, or which do not actually fix serious bugs. In these cases, the maintainer must remove the stable tag and keep the patch from going in that direction. Needless to say, this kind of activity is even harder to track, so there will be no "stable rejections" table.

In any case, maintainers needing to turn away marginal stable patches seems like the right kind of problem to have. Bugs are annoying in the best of times, but they are doubly annoying when a fix exists but is not distributed to people who need it. The stable tree seems to be doing a good job of getting those fixes out; that makes Linux better for all of us.

Comments (none posted)

Working on workqueues

By Jonathan Corbet
September 7, 2010
One of the biggest internal changes in 2.6.36 will be the adoption of concurrency-managed workqueues. The short-term goal of this work is to reduce the number of kernel threads running on the system while simultaneously increasing the concurrency of tasks submitted to workqueues. To that end, the per-workqueue kernel threads are gone, replaced by a central set of threads with names like [kworker/0:0]; workqueue tasks are then dispatched to the threads via an algorithm which tries to keep exactly one task running on each CPU at all times. The result should be better use of the CPU for workqueue tasks and less memory tied up by the workqueue machinery.

That is a worthwhile result in its own right, but it's really only a beginning. The 2.6.36 workqueue patches were deliberately designed to minimize the impact on the rest of the kernel, so they preserved the existing workqueue API. But the new code is intended to do more than replace workqueues with a cleverer implementation; it is really meant to be a general-purpose task management system for the kernel. Making full use of that capability will require changes in the calling code - and in code which does not yet use workqueues at all.

In kernels prior to 2.6.36, workqueues are created with create_workqueue() and a couple of variants. That function will, among other things, start up one or more kernel threads to handle tasks submitted to that workqueue. In 2.6.36, that interface has been preserved, but the workqueue it creates is a different beast: it has no dedicated threads and really just serves as a context for the submission of tasks. The API is considered deprecated; the proper way to create a workqueue now is with:

    int alloc_workqueue(char *name, unsigned int flags, int max_active);

The name parameter names the queue, but, unlike in the older implementation, it does not create threads using that name. The flags parameter selects among a number of relatively complex options on how work submitted to the queue will be executed; its value can include:

  • WQ_NON_REENTRANT: "classic" workqueues guaranteed that no task would be run by two threads simultaneously on the same CPU, but made no such guarantee across multiple CPUs. If it was necessary to ensure that a task could not be run simultaneously anywhere in the system, a single-threaded workqueue had to be used, possibly limiting concurrency more than desired. With this flag, the workqueue code will provide that systemwide guarantee while still allowing different tasks to run concurrently.

  • WQ_UNBOUND: workqueues were designed to run tasks on the CPU where they were submitted in the hope that better memory cache behavior would result. This flag turns off that behavior, allowing submitted tasks to be run on any CPU in the system. It is intended for situations where the tasks can run for a long time, to the point that it's better to let the scheduler manage their location. Currently the only user is the object processing code in the FS-Cache subsystem.

  • WQ_FREEZEABLE: this workqueue will be frozen when the system is suspended. Clearly, workqueues which can run tasks as part of the suspend/resume process should not have this flag set.

  • WQ_RESCUER: this flag marks workqueues which may be involved in memory reclaim; the workqueue code responds by ensuring that there is always a thread available to run tasks on this queue. It is used, for example, in the ATA driver code, which always needs to be able to run its I/O completion routines to be sure it can free memory.

  • WQ_HIGHPRI: tasks submitted to this workqueue will put at the head of the queue and run (almost) immediately. Unlike ordinary tasks, high-priority tasks do not wait for the CPU to become available; they will be run right away. That means that multiple tasks submitted to a high-priority queue may contend with each other for the processor.

  • WQ_CPU_INTENSIVE: tasks on this workqueue can be expected to use a fair amount of CPU time. To keep those tasks from delaying the execution of other workqueue tasks, they will not be taken into account when the workqueue code determines whether the CPU is available or not. CPU-intensive tasks will still be delayed themselves, though, if other tasks are already making use of the CPU.

The combination of the WQ_HIGHPRI and WQ_CPU_INTENSIVE flags takes this workqueue out of the concurrency management regime entirely. Any tasks submitted to such a workqueue will simply run as soon as the CPU is available.

The final argument to alloc_workqueue() (we are still talking about alloc_workqueue(), after all) is max_active. This parameter limits the number of tasks which can be executing simultaneously from this workqueue on any given CPU. The default value (used if max_active is passed as zero) is 256, but the actual maximum is likely to be far lower, given that the workqueue code really only wants one task using the CPU at any given time. Code which requires that workqueue tasks be executed in the order in which they are submitted can use a WQ_UNBOUND workqueue with max_active set to one.

(Incidentally, much of the above was cribbed from Tejun Heo's in-progress document on workqueue usage).

The long-term plan, it seems, is to convert all create_workqueue() users over to an appropriate alloc_workqueue() call; eventually create_workqueue() will be removed. That task may take a little while, though; a quick grep turns up nearly 300 call sites.

An even longer-term plan is to merge a number of other kernel threads into the new workqueue mechanism. For example, the block layer maintains a set of threads with names like flush-8:0 and bdi-default; they are charged with getting data written out to block devices. Tejun recently posted a patch to replace those threads with workqueues. This patch has made some developers a little nervous - problems with writeback could create no end of trouble when the system is under memory pressure. So it may be slow to get into the mainline, but it will probably get there eventually unless regressions turn up.

After that, there is no end of special-purpose kernel threads elsewhere in the system. Not all of them will be amenable to conversion to workqueues, but quite a few of them should be. Over time, that should translate to less system resource use, cleaner "ps" output, and a better-running system.

Comments (4 posted)

Another old security problem

By Jonathan Corbet
September 8, 2010
In August, a longstanding kernel security hole related to overflowing the stack area was closed. But it turns out there are other problems in this area, at least one of which has been known about since late last year. Fixes are in the works, but it's hard not to wonder if we are not handling security issues as well as we should be.

Once again, the problem was reported by Brad Spengler, who posted a short program demonstrating how easily things can be made to go wrong. The program allocates a single 128KB array, which is filled as a long C string. Then, an array of over 24,000 char * pointers is allocated, with each entry pointing to the large string. The final step is to call execv(), using this array as the arguments to the program to be run. In other words, the exploit is telling the kernel to run a program with as many huge arguments as it can.

Once upon a time, the kernel had a limit on the maximum number of pages which could be used by a new program's arguments. This limit would have prevented any problems resulting from the sort of abuse shown by Brad's program, but it was removed for 2.6.23; it seems that any sort of limit made life difficult for Google. In its place, a new check was put in which looks like this (from fs/exec.c):

	/*
	 * Limit to 1/4-th the stack size for the argv+env strings.
	 * This ensures that:
	 *  - the remaining binfmt code will not run out of stack space,
	 *  - the program will have a reasonable amount of stack left
	 *    to work from.
	 */
	rlim = current->signal->rlim;
	if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) {
		put_page(page);
		return NULL;
	}

The reasoning was clear: if the arguments cannot exceed one quarter of the allowed size for the process's stack, they cannot get completely out of control. It turns out that there's a fundamental flaw in that reasoning: the stack size may well not be subject to a limit at all. In that case, the value of the limit is -1 (all ones, in other words), and the size check becomes meaningless. The end result is that, in some situations, there is no real limit on the amount of stack space which can be consumed by arguments to exec(). And, unfortunately, the consequences are not limited to the offending process.

At a minimum, Brad's exploit is able to oops the system once the stack tries to expand too far. He mentioned the possibility of expanding the stack down to address zero - thus reopening the threat of null-pointer exploits - but has not been able to figure out a way to make such exploits work. The copying of all those arguments will, naturally, consume large amounts of system memory; due to another glitch, that memory use is not properly accounted for, so, if the out-of-memory killer is brought in to straighten things out, it will not target the process which is actually causing the problem. And, as if that were not enough, the counting and copying of the argument strings is not preemptible or killable; given that it can run for a very long time, it can be very hard on the performance of the rest of the system.

Brad says that he first reported this problem in December, 2009, but got no response. More recently, he sent a note to Kees Cook, who posted a partial fix in response. That fix had some technical problems and was not applied, but Roland McGrath has posted a new set of fixes which gets closer. Roland has taken a minimal approach, not wanting to limit argument sizes more than absolutely necessary. So his patch just ensures that the stack will not grow below the minimum allowed user-space memory address (mmap_min_addr). That check, combined with the guard page added to the stack region by the August fix, should prevent the stack from growing into harmful areas. Roland has also added a preemption point to the argument-copying code to improve interactivity in the rest of the system, and a signal check allowing the process to be killed if necessary. He has not addressed the OOM killer issue, which will need to be fixed separately.

Roland's patch seems likely to fix the worst problems, though some commenters feel that it does not go far enough. One assumes that fixes will be headed toward distribution kernels in the near future. But there are a couple of discouraging things to note from this episode:

  • It seems that the code which is intended to block runaway resource use in a core Linux system call was never really tested at its extremes. The Linux kernel community does not have a whole lot of people who do this kind of auditing and testing, unfortunately; that leaves the task to the people who have an interest (either benign or malicious) in security issues.

  • It took some nine months after the initial report before anybody tried to fix the problem. That is not the sort of rapid response that this community normally takes pride in.

The problem may indicate a key shortcoming in how Linux kernel development is supported. There are thousands of developers who are funded to spend at least some of their time doing kernel work. Some of those are paid to work in security-related areas like SELinux or AppArmor. But it's not at all clear that anybody is funded simply to make sure that the core kernel is secure. That may make it easier for security problems to slip into the kernel, and it may slow down the response when somebody points out problems in the code. There is a strong (and increasing) economic interest in exploiting security issues in the kernel; perhaps we need to find a way to increase the level of interest in preventing these issues in the first place.

Comments (26 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds