Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.38-rc8, released on March 7. Linus indicated that we should see the final 2.6.38 release in a week or so. "I would have been ok with releasing this as the final 38, but I'm going to be partially unreachable for several days next week, so I felt there was no point in opening the merge window yet. Also, we do have regressions. Some of them hopefully fixed here, but it won't hurt to give it another week." The short-form changelog is in the announcement, or see the full changelog for all the details.
Stable updates: 2.6.32.31, containing a single build fix, was released on March 3. It was followed by 2.6.32.32 and 2.6.37.3 on March 7; each of those releases contains a larger set of important fixes.
The 2.6.34.8-rt realtime patch set was released on March 4.
Quotes of the week
So, let's balance it. Avoiding changes to the userland visible behaviors does have a lot of weight but its mass is NOT infinite.
Removed directories and st_nlink
Al Viro was doing an audit of portions of the virtual filesystem layer when he noticed a bit of inconsistent behavior. If an application uses rename() to move one directory on top of another, the link count on the just-removed directory might not drop to zero. Some filesystems get the link count right, while others don't. Al saw that behavior as possibly confusing, so he put together a set of patches to fix things up.Linus pulled the patches for 2.6.38-rc8, but wondered if it was really worthwhile. There are filesystems out there which do not track link counts at all, so applications cannot count on seeing correct link counts on removed directories. Any application which depends on such behavior is, he says, inherently buggy. Arguments that some applications do seem to care and that inotify will not work properly if the link count is wrong left him unmoved - but it does not matter that much, since the fixes went in anyway.
What may have been most surprising, even to experienced filesystem developers, is that the kernel allows the destructive renaming of one directory on top of another. A number of rules apply, but the POSIX standard explicitly specifies that such renames have to work. It has been that way since the early BSD days and, it does indeed work with Linux.
Kernel development news
Protecting /proc/slabinfo
User-space access to internal kernel information is always something of a balancing act. That information can be useful for debugging or diagnosing problems, but it can also be used by attackers to simplify exploiting security vulnerabilities. At first glance, protecting /proc/slabinfo so that it can't be read by non-root users seems like a reasonable restriction to help reduce targeted heap corruption exploits, but as a patch to do so was discussed, it became clear that there are some more subtle issues to consider.
Memory for kernel objects is managed by the "slab" allocator, which allocates those objects into separate slabs based on their size. The kernel has several different slab allocators available, and users (or distributions) can choose the one they want to use when configuring the kernel. The exploits discussed in the thread were mostly aimed at the SLUB allocator because it mixes multiple object types of the same size into slabs.
The output from /proc/slabinfo lists the various slabs in use, the size of objects that they contain, the number of objects currently used in the slab, and so on. That information can be used to manipulate the heap layout in a way that's favorable to attackers.
Dan Rosenberg proposed changing the permissions of /proc/slabinfo
to 0400, noting that "nearly all recent public exploits for heap
issues rely on feedback from /proc/slabinfo to manipulate heap layout
into an exploitable state
". As he points out, normal users
shouldn't really need access to that file, and on systems where that is
desirable, the administrator can set the permissions appropriately.
While few argued that unprivileged users need access to the information (at least by default), there were immediate questions about whether shutting off the access would really be an obstacle for attackers. Matt Mackall put it this way:
That "10x" factor is important. If it were several order of magnitudes higher, it would be clearer that possibly inconveniencing some users is worthwhile. But making an attacker only work ten times harder may not be worth it as Mackall observes:
Mackall describes the basic idea behind these "heap smashing" attacks well, but Rosenberg gives a more detailed description of how they work:
So an attacker that can observe /proc/slabinfo can gain a great deal of information about slabs that would allow them to arrange the situation they need on the heap. But in the discussion, it became clear that there are other ways to gain some of that information (via observing /proc/meminfo for instance), though it turns out that an attacker can rely on probability to arrange objects in the "right" order.
One could, as Mackall suggested, pre-allocate a large number of objects such that a new page is allocated to the slab. The contents of that page are then largely under the control of the attacker, so freeing one of those objects and allocating the other kind will very likely result in the needed arrangement.
That led Pekka Enberg to propose a patch that would randomize the object layout within a slab. The idea is that attackers couldn't depend on the current sequential allocation of objects in the slab, as the free list in a new slab would be in a random order. When freeing one object and then allocating another, there would be no guarantee that the two would be sequential. That approach, too, falls by the wayside because of probability.
Even with a randomized slab, an attacker can half fill the slab page with overrunnable objects, then allocate an exploitable object; it will then have a roughly 50% chance of being in the right place. One could also fill the slab page with overrunnable objects, then free an object—or every tenth object. The holes left behind will have a high probability being in the right place. Essentially, Mackall and others in the thread showed that the current attacks using /proc/slabinfo output were insufficiently imaginative.
Mackall noted that the real underlying problem is that it is too easy for programmers to copy the wrong amount of data from user space (which is how most of these object overruns occur). He suggested that a copy_from_user() interface that was harder to get wrong might help reduce these kinds of problems:
We need functions that are hard to abuse and coding patterns that are easy to copy, easy to review, and take the tricky bits out of the hands of driver writers.
There was general agreement that better interfaces should be added to reduce
these kinds of problems. Alan Cox mentioned
that Arjan van de Ven had created some "copy_from_user validation
code [that] already does verification checks
on the copies using gcc magic
". While Rosenberg is still interested
in pursuing the /proc/slabinfo protection along with Enberg's slab
randomization patch, it's not clear that there will be much support for it
from other kernel developers. Given that it imposes a performance penalty
along with potentially inconveniencing users, without any real benefit, it
may be rather hard to sell.
Improving ptrace()
The ptrace() system call is rarely held up as one of the better parts of the Unix interface. This call, which is used for the tracing and debugging of processes, is gifted with strange semantics (it reparents the traced process to the tracer), numerous interface warts, and occasionally unpredictable behavior. It is also hard to implement within the kernel; there are few developers who are willing to get into the depths of the ptrace() implementation. So it's not surprising that there has been occasional talk of simply replacing ptrace() with something better; see this 2010 article for a description of one such discussion.While some developers think that ptrace() is beyond repair, Tejun Heo disagrees. To back that up, he has posted a set of proposals for the improvement of the interface, saying:
The bulk of the "love and attention" Tejun means to apply is addressed at the interaction between tracing and job control. In an untraced process, job control is used by the kernel and the shell to stop and restart processes, possibly moving them between the foreground and the background. Adding tracing to the picture confuses things for a number of reasons. For example, reparenting the traced process deprives the real parent of the ability to get notifications when that process is stopped or started. There are also some strange internal transitions between the TASK_STOPPED and TASK_TRACED states which lead to unpredictable and sometimes surprising behavior. For example, a task which is running under strace can be stopped with ^Z as usual, but the shell will be unable to restart it.
Tejun has a series of concrete proposals to improve the situation. The first of these is that a traced process should always, when stopped, be in the TASK_TRACED state. The current strange transitions between that state and TASK_STOPPED would go away. He would fix things so that notifications when a process stops or starts would always go to the real parent, even when a process has been reparented for tracing. Some edge cases, such as what happens when a traced process is detached, would be fixed so that process's behavior matches the untraced case.
To fix the "can't start a stopped, traced process" problem, Tejun would further enshrine the rule that the tracing process has total control over the traced process's state. So it's up to the tracer to start a stopped process if the shell wants that done. Currently, tracers have no way to know that the real parent has tried to start a stopped process, so a notification mechanism needs to be added. That would be done by extending the STOPPED notification that can currently be obtained with one of the variants of the wait() system call.
Finally, Tejun would like to fix the behavior of the PTRACE_ATTACH operation, which attaches to a process and sends a SIGSTOP signal to put it into the stopped state. The signal confuses things, and the stopped state is undesirable; it is not really possible, though, to change the semantics of PTRACE_ATTACH in this way without breaking applications. So he would create a new PTRACE_SEIZE operation which would attach to a process (if it's not already attached) and put the process immediately into the TASK_TRACED state.
These changes, Tejun thinks, are enough to turn ptrace() into something rather more predictable and civilized. He'd like to go forward into the implementation with a 2.6.40 target for merging. In the following discussion, it seems that most developers agree with these changes, modulo a quibble or two. The one big exception was Roland McGrath, who has done a lot of work in this area. Roland has some different ideas, especially with regard to PTRACE_SEIZE.
Roland's alternative to PTRACE_SEIZE (if it can truly be called an "alternative," having been suggested first) is to add two new commands: PTRACE_ATTACH_NOSTOP and PTRACE_INTERRUPT. The former would attach to a process but not change its running state in any way, while the latter would stop the process and put it into the TASK_TRACED state. He sees a number of advantages to this approach, including the ability to trace a process without ever stopping it. There are cases (strace comes to mind) where there is no need to stop the process; avoiding doing so allows the process to be traced while minimizing the effects on its behavior.
Roland also foresaw a variant of PTRACE_INTERRUPT which would only stop a process when it's running in user space. That would avoid the occasional "interrupted system call" failure that current tracing can cause. He also worries about what happens when PTRACE_SEIZE is, itself, interrupted; handling that situation in a way that supports the writing of robust applications, he says, would be hard. Finally, he raises the issue of scalability; he does not think that PTRACE_SEIZE will work well for the debugging of highly threaded applications. In summary, he said:
Unfortunately, it seems that Roland is changing jobs and stopping work in this area, so his thoughts may carry less weight than they normally would have. As of this writing, there have been few responses to his post; Tejun has mostly dismissed Roland's concerns. Tejun has also posted a patch series implementing parts of his proposal, but not, yet, PTRACE_SEIZE. The uncontroversial parts of this work will almost certainly be merged; how PTRACE_ATTACH will be fixed in the end remains to be seen.
Delaying the OOM killer
The out-of-memory (OOM) killer is charged with killing off processes in response to a severe memory shortage. It has been the source of considerable discussion and numerous rewrites over the years. Perhaps that is inevitable given its purpose; choosing the right process to kill at the right time is never going to be an easy thing to code. The extension of the OOM killer into control groups has added to its flexibility, but has also raised some interesting issues of its own.Normally, the OOM killer is invoked when the system as a whole is catastrophically out of memory. In the control group context, the OOM killer comes into play when the memory usage by the processes within that group exceeds the configured maximum and attempts to reclaim memory from those processes have failed. An out-of-memory situation which is contained to a control group is bad for the processes involved, but it should not threaten the rest of the system. That allows for a little more flexibility in how out-of-memory situations are handled.
In particular, it is possible for user space to take over OOM-killer duties in the control group context. Each group has a control file called oom_control which can be used in a couple of interesting ways:
- Writing "1" to that file will disable the OOM killer within
that group. Should an out-of-memory situation come about, the
processes in the affected group will simply block when attempting to
allocate memory until the situation improves somehow.
- Through the use of a special eventfd() file descriptor, a process can use the oom_control file to sign up for notifications of out-of-memory events (see Documentation/cgroups/memory.txt for the details on how that is done). That process will be informed whenever the control group runs out of memory; it can then respond to address the problem.
There are a number of ways that this user-space OOM killer can fix a memory issue that affects a control group. It could simply raise the limit for that group, for example. Alternatives include killing processes or moving some processes to a different control group. All told, it's a reasonably flexible way of allowing user space to take over the responsibility of recovering from out-of-memory disasters.
At Google, though, it seems that it's not quite flexible enough. As has been widely reported, Google does not have very many machines to work with, so the company has a tendency to cram large numbers of tasks onto each host. That has led to an interesting problem: what happens if the user-space OOM killer is, itself, so starved for memory that it is unable to respond to an out-of-memory condition? What happens, it turns out, is that things just come to an unpleasant halt.
Google operations is not overly fond of unpleasant halts, so an attempt has been made to find another solution. The outcome was a patch from David Rientjes adding another control file to the control group called oom_delay_millisecs. Like oom_control, it holds off the kernel's OOM killer in favor of a user-space alternative. The difference is that the administrator can provide a time limit for the kernel OOM killer's patience; if the out-of-memory situation persists after that much time, the kernel's OOM killer will step in and resolve the situation with as much prejudice as necessary.
To David, this delay looks like a useful new feature for the memory control group mechanism. To Andrew Morton, instead, it looks like a kernel hack intended to work around user-space bugs, and he is not that thrilled by it. In Andrew's view, if user space has set itself up as the OOM handler for a control group, it needs to ensure that it is able to follow through. Adding the delay looks like a way to avoid that responsibility which could have long-term effects:
Andrew would rather see development effort put into fixing any kernel problems which might be preventing a user-space OOM killer from doing its job. David, though, doesn't see a way to work without this feature. If it doesn't get in, Google may have to carry it separately; he predicted, though, that other users will start asking for it as usage of the memory controller increases. As of this writing, that's where the discussion stands.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
