User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.38-rc8, released on March 7. Linus indicated that we should see the final 2.6.38 release in a week or so. "I would have been ok with releasing this as the final 38, but I'm going to be partially unreachable for several days next week, so I felt there was no point in opening the merge window yet. Also, we do have regressions. Some of them hopefully fixed here, but it won't hurt to give it another week." The short-form changelog is in the announcement, or see the full changelog for all the details.

Stable updates:, containing a single build fix, was released on March 3. It was followed by and on March 7; each of those releases contains a larger set of important fixes.

The realtime patch set was released on March 4.

Comments (none posted)

Quotes of the week

Usually is not a good answer for correctness and security with firmware. It's really really not a good answer when flashing new firmware. "Usually your expensive hardware is not turned into a valueless brick" lacks a certain something.
-- Alan Cox

What I'm trying to say is that it's _ALWAYS_ about balances and trade offs. Sticking to some or any rules in fundamentalistic manner is a guaranteed way to horrible code base which is not only painful to develop and maintain but also will deliver a lot of WTF moments to its users too in the long run.

So, let's balance it. Avoiding changes to the userland visible behaviors does have a lot of weight but its mass is NOT infinite.

-- Tejun Heo

Comments (4 posted)

Removed directories and st_nlink

By Jonathan Corbet
March 9, 2011
Al Viro was doing an audit of portions of the virtual filesystem layer when he noticed a bit of inconsistent behavior. If an application uses rename() to move one directory on top of another, the link count on the just-removed directory might not drop to zero. Some filesystems get the link count right, while others don't. Al saw that behavior as possibly confusing, so he put together a set of patches to fix things up.

Linus pulled the patches for 2.6.38-rc8, but wondered if it was really worthwhile. There are filesystems out there which do not track link counts at all, so applications cannot count on seeing correct link counts on removed directories. Any application which depends on such behavior is, he says, inherently buggy. Arguments that some applications do seem to care and that inotify will not work properly if the link count is wrong left him unmoved - but it does not matter that much, since the fixes went in anyway.

What may have been most surprising, even to experienced filesystem developers, is that the kernel allows the destructive renaming of one directory on top of another. A number of rules apply, but the POSIX standard explicitly specifies that such renames have to work. It has been that way since the early BSD days and, it does indeed work with Linux.

Comments (15 posted)

Kernel development news

Protecting /proc/slabinfo

By Jake Edge
March 9, 2011

User-space access to internal kernel information is always something of a balancing act. That information can be useful for debugging or diagnosing problems, but it can also be used by attackers to simplify exploiting security vulnerabilities. At first glance, protecting /proc/slabinfo so that it can't be read by non-root users seems like a reasonable restriction to help reduce targeted heap corruption exploits, but as a patch to do so was discussed, it became clear that there are some more subtle issues to consider.

Memory for kernel objects is managed by the "slab" allocator, which allocates those objects into separate slabs based on their size. The kernel has several different slab allocators available, and users (or distributions) can choose the one they want to use when configuring the kernel. The exploits discussed in the thread were mostly aimed at the SLUB allocator because it mixes multiple object types of the same size into slabs.

The output from /proc/slabinfo lists the various slabs in use, the size of objects that they contain, the number of objects currently used in the slab, and so on. That information can be used to manipulate the heap layout in a way that's favorable to attackers.

Dan Rosenberg proposed changing the permissions of /proc/slabinfo to 0400, noting that "nearly all recent public exploits for heap issues rely on feedback from /proc/slabinfo to manipulate heap layout into an exploitable state". As he points out, normal users shouldn't really need access to that file, and on systems where that is desirable, the administrator can set the permissions appropriately.

While few argued that unprivileged users need access to the information (at least by default), there were immediate questions about whether shutting off the access would really be an obstacle for attackers. Matt Mackall put it this way:

Looking at a couple of these exploits, my suspicion is that looking at slabinfo simply improves the odds of success by a small factor (ie 10x or so) and doesn't present a real obstacle to attackers. All that appears to be required is to arrange that an overrunnable object be allocated next to one that is exploitable when overrun. And that can be arranged with fairly high probability on SLUB's merged caches.

That "10x" factor is important. If it were several order of magnitudes higher, it would be clearer that possibly inconveniencing some users is worthwhile. But making an attacker only work ten times harder may not be worth it as Mackall observes:

There are thousands of attackers and millions of users. Most of those millions are on single-user systems with no local attackers. For every attacker's life we make trivially more difficult, we're also making a few real user's lives more difficult. It's not obvious that this is a good trade-off.

Mackall describes the basic idea behind these "heap smashing" attacks well, but Rosenberg gives a more detailed description of how they work:

The most common known techniques involve overflowing into an allocated object with useful contents such as a function pointer and then triggering these (various IPC-related structs are often used for this). It's also possible to overflow into a free object and overwrite its free pointer, causing subsequent allocations to result in a fake heap object residing in userland being under an attacker's control.

So an attacker that can observe /proc/slabinfo can gain a great deal of information about slabs that would allow them to arrange the situation they need on the heap. But in the discussion, it became clear that there are other ways to gain some of that information (via observing /proc/meminfo for instance), though it turns out that an attacker can rely on probability to arrange objects in the "right" order.

One could, as Mackall suggested, pre-allocate a large number of objects such that a new page is allocated to the slab. The contents of that page are then largely under the control of the attacker, so freeing one of those objects and allocating the other kind will very likely result in the needed arrangement.

That led Pekka Enberg to propose a patch that would randomize the object layout within a slab. The idea is that attackers couldn't depend on the current sequential allocation of objects in the slab, as the free list in a new slab would be in a random order. When freeing one object and then allocating another, there would be no guarantee that the two would be sequential. That approach, too, falls by the wayside because of probability.

Even with a randomized slab, an attacker can half fill the slab page with overrunnable objects, then allocate an exploitable object; it will then have a roughly 50% chance of being in the right place. One could also fill the slab page with overrunnable objects, then free an object—or every tenth object. The holes left behind will have a high probability being in the right place. Essentially, Mackall and others in the thread showed that the current attacks using /proc/slabinfo output were insufficiently imaginative.

Mackall noted that the real underlying problem is that it is too easy for programmers to copy the wrong amount of data from user space (which is how most of these object overruns occur). He suggested that a copy_from_user() interface that was harder to get wrong might help reduce these kinds of problems:

I think the real issue here is that it's too easy to write code that copies too many bytes from userspace. Every piece of code writes its own bound checks on copy_from_user, for instance, and gets it wrong by hitting signed/unsigned issues, alignment issues, etc. that are on the very edge of the average C coder's awareness.

We need functions that are hard to abuse and coding patterns that are easy to copy, easy to review, and take the tricky bits out of the hands of driver writers.

There was general agreement that better interfaces should be added to reduce these kinds of problems. Alan Cox mentioned that Arjan van de Ven had created some "copy_from_user validation code [that] already does verification checks on the copies using gcc magic". While Rosenberg is still interested in pursuing the /proc/slabinfo protection along with Enberg's slab randomization patch, it's not clear that there will be much support for it from other kernel developers. Given that it imposes a performance penalty along with potentially inconveniencing users, without any real benefit, it may be rather hard to sell.

Comments (6 posted)

Improving ptrace()

By Jonathan Corbet
March 8, 2011
The ptrace() system call is rarely held up as one of the better parts of the Unix interface. This call, which is used for the tracing and debugging of processes, is gifted with strange semantics (it reparents the traced process to the tracer), numerous interface warts, and occasionally unpredictable behavior. It is also hard to implement within the kernel; there are few developers who are willing to get into the depths of the ptrace() implementation. So it's not surprising that there has been occasional talk of simply replacing ptrace() with something better; see this 2010 article for a description of one such discussion.

While some developers think that ptrace() is beyond repair, Tejun Heo disagrees. To back that up, he has posted a set of proposals for the improvement of the interface, saying:

ptrace currently is in a pretty bad shape and I think one of the biggest reasons is a lot of effort has been spent trying to come up with something completely new instead of concentrating on improving what's already there. I think the existing principles are pretty sound. They just need some love and attention here and there.

The bulk of the "love and attention" Tejun means to apply is addressed at the interaction between tracing and job control. In an untraced process, job control is used by the kernel and the shell to stop and restart processes, possibly moving them between the foreground and the background. Adding tracing to the picture confuses things for a number of reasons. For example, reparenting the traced process deprives the real parent of the ability to get notifications when that process is stopped or started. There are also some strange internal transitions between the TASK_STOPPED and TASK_TRACED states which lead to unpredictable and sometimes surprising behavior. For example, a task which is running under strace can be stopped with ^Z as usual, but the shell will be unable to restart it.

Tejun has a series of concrete proposals to improve the situation. The first of these is that a traced process should always, when stopped, be in the TASK_TRACED state. The current strange transitions between that state and TASK_STOPPED would go away. He would fix things so that notifications when a process stops or starts would always go to the real parent, even when a process has been reparented for tracing. Some edge cases, such as what happens when a traced process is detached, would be fixed so that process's behavior matches the untraced case.

To fix the "can't start a stopped, traced process" problem, Tejun would further enshrine the rule that the tracing process has total control over the traced process's state. So it's up to the tracer to start a stopped process if the shell wants that done. Currently, tracers have no way to know that the real parent has tried to start a stopped process, so a notification mechanism needs to be added. That would be done by extending the STOPPED notification that can currently be obtained with one of the variants of the wait() system call.

Finally, Tejun would like to fix the behavior of the PTRACE_ATTACH operation, which attaches to a process and sends a SIGSTOP signal to put it into the stopped state. The signal confuses things, and the stopped state is undesirable; it is not really possible, though, to change the semantics of PTRACE_ATTACH in this way without breaking applications. So he would create a new PTRACE_SEIZE operation which would attach to a process (if it's not already attached) and put the process immediately into the TASK_TRACED state.

These changes, Tejun thinks, are enough to turn ptrace() into something rather more predictable and civilized. He'd like to go forward into the implementation with a 2.6.40 target for merging. In the following discussion, it seems that most developers agree with these changes, modulo a quibble or two. The one big exception was Roland McGrath, who has done a lot of work in this area. Roland has some different ideas, especially with regard to PTRACE_SEIZE.

Roland's alternative to PTRACE_SEIZE (if it can truly be called an "alternative," having been suggested first) is to add two new commands: PTRACE_ATTACH_NOSTOP and PTRACE_INTERRUPT. The former would attach to a process but not change its running state in any way, while the latter would stop the process and put it into the TASK_TRACED state. He sees a number of advantages to this approach, including the ability to trace a process without ever stopping it. There are cases (strace comes to mind) where there is no need to stop the process; avoiding doing so allows the process to be traced while minimizing the effects on its behavior.

Roland also foresaw a variant of PTRACE_INTERRUPT which would only stop a process when it's running in user space. That would avoid the occasional "interrupted system call" failure that current tracing can cause. He also worries about what happens when PTRACE_SEIZE is, itself, interrupted; handling that situation in a way that supports the writing of robust applications, he says, would be hard. Finally, he raises the issue of scalability; he does not think that PTRACE_SEIZE will work well for the debugging of highly threaded applications. In summary, he said:

None of this means at all that PTRACE_SEIZE is worthless. But it is certainly inadequate to meet the essential needs that motivate adding new interfaces in this area. The PTRACE_ATTACH_NOSTOP idea I suggested is far from complete for all the issues as well, but it is a more versatile building block than PTRACE_SEIZE.

Unfortunately, it seems that Roland is changing jobs and stopping work in this area, so his thoughts may carry less weight than they normally would have. As of this writing, there have been few responses to his post; Tejun has mostly dismissed Roland's concerns. Tejun has also posted a patch series implementing parts of his proposal, but not, yet, PTRACE_SEIZE. The uncontroversial parts of this work will almost certainly be merged; how PTRACE_ATTACH will be fixed in the end remains to be seen.

Comments (none posted)

Delaying the OOM killer

By Jonathan Corbet
March 9, 2011
The out-of-memory (OOM) killer is charged with killing off processes in response to a severe memory shortage. It has been the source of considerable discussion and numerous rewrites over the years. Perhaps that is inevitable given its purpose; choosing the right process to kill at the right time is never going to be an easy thing to code. The extension of the OOM killer into control groups has added to its flexibility, but has also raised some interesting issues of its own.

Normally, the OOM killer is invoked when the system as a whole is catastrophically out of memory. In the control group context, the OOM killer comes into play when the memory usage by the processes within that group exceeds the configured maximum and attempts to reclaim memory from those processes have failed. An out-of-memory situation which is contained to a control group is bad for the processes involved, but it should not threaten the rest of the system. That allows for a little more flexibility in how out-of-memory situations are handled.

In particular, it is possible for user space to take over OOM-killer duties in the control group context. Each group has a control file called oom_control which can be used in a couple of interesting ways:

  • Writing "1" to that file will disable the OOM killer within that group. Should an out-of-memory situation come about, the processes in the affected group will simply block when attempting to allocate memory until the situation improves somehow.

  • Through the use of a special eventfd() file descriptor, a process can use the oom_control file to sign up for notifications of out-of-memory events (see Documentation/cgroups/memory.txt for the details on how that is done). That process will be informed whenever the control group runs out of memory; it can then respond to address the problem.

There are a number of ways that this user-space OOM killer can fix a memory issue that affects a control group. It could simply raise the limit for that group, for example. Alternatives include killing processes or moving some processes to a different control group. All told, it's a reasonably flexible way of allowing user space to take over the responsibility of recovering from out-of-memory disasters.

At Google, though, it seems that it's not quite flexible enough. As has been widely reported, Google does not have very many machines to work with, so the company has a tendency to cram large numbers of tasks onto each host. That has led to an interesting problem: what happens if the user-space OOM killer is, itself, so starved for memory that it is unable to respond to an out-of-memory condition? What happens, it turns out, is that things just come to an unpleasant halt.

Google operations is not overly fond of unpleasant halts, so an attempt has been made to find another solution. The outcome was a patch from David Rientjes adding another control file to the control group called oom_delay_millisecs. Like oom_control, it holds off the kernel's OOM killer in favor of a user-space alternative. The difference is that the administrator can provide a time limit for the kernel OOM killer's patience; if the out-of-memory situation persists after that much time, the kernel's OOM killer will step in and resolve the situation with as much prejudice as necessary.

To David, this delay looks like a useful new feature for the memory control group mechanism. To Andrew Morton, instead, it looks like a kernel hack intended to work around user-space bugs, and he is not that thrilled by it. In Andrew's view, if user space has set itself up as the OOM handler for a control group, it needs to ensure that it is able to follow through. Adding the delay looks like a way to avoid that responsibility which could have long-term effects:

My issue with this patch is that it extends the userspace API. This means we're committed to maintaining that interface *and its behaviour* for evermore. But the oom-killer and memcg are both areas of intense development and the former has a habit of getting ripped out and rewritten. Committing ourselves to maintaining an extension to the userspace interface is a big thing, especially as that extension is somewhat tied to internal implementation details and is most definitely tied to short-term inadequacies in userspace and in the kernel implementation.

Andrew would rather see development effort put into fixing any kernel problems which might be preventing a user-space OOM killer from doing its job. David, though, doesn't see a way to work without this feature. If it doesn't get in, Google may have to carry it separately; he predicted, though, that other users will start asking for it as usage of the memory controller increases. As of this writing, that's where the discussion stands.

Comments (36 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds