The current development kernel is 2.6.38-rc8
on March 7. Linus indicated
that we should see the final 2.6.38 release in a week or so. "I
would have been ok with releasing this as the final 38, but I'm going to be
partially unreachable for several days next week, so I felt there was no
point in opening the merge window yet. Also, we do have regressions. Some
of them hopefully fixed here, but it won't hurt to give it another
" The short-form changelog is in the announcement, or see the
for all the details.
Stable updates: 126.96.36.199,
containing a single build fix, was released on March 3. It was
followed by 188.8.131.52 and 184.108.40.206 on March 7; each of those
releases contains a larger set of important fixes.
The 220.127.116.11-rt realtime patch set was
released on March 4.
Comments (none posted)
Usually is not a good answer for correctness and security with
firmware. It's really really not a good answer when flashing new
firmware. "Usually your expensive hardware is not turned into a
valueless brick" lacks a certain something.
-- Alan Cox
What I'm trying to say is that it's _ALWAYS_ about balances and
trade offs. Sticking to some or any rules in fundamentalistic
manner is a guaranteed way to horrible code base which is not only
painful to develop and maintain but also will deliver a lot of WTF
moments to its users too in the long run.
So, let's balance it. Avoiding changes to the userland visible
behaviors does have a lot of weight but its mass is NOT infinite.
-- Tejun Heo
Comments (4 posted)
Al Viro was doing an audit of portions of the virtual filesystem layer when
a bit of inconsistent behavior.
If an application uses rename()
to move one directory on top of
another, the link count on the just-removed directory might not drop to
zero. Some filesystems get the link count right, while others don't. Al
saw that behavior as possibly confusing, so he put together a set of
patches to fix things up.
Linus pulled the patches for 2.6.38-rc8, but wondered if it was really worthwhile. There
are filesystems out there which do not track link counts at all, so
applications cannot count on seeing correct link counts on removed
directories. Any application which depends on such behavior is, he says,
inherently buggy. Arguments that some applications do seem to care and
that inotify will not work properly if the link count is wrong left him
unmoved - but it does not matter that much, since the fixes went in anyway.
What may have been most surprising, even to experienced filesystem developers, is that the
kernel allows the destructive renaming of one directory on top of another.
A number of rules apply, but the POSIX
standard explicitly specifies that such renames have to work. It has been
that way since the early BSD days and, it does indeed work with Linux.
Comments (15 posted)
Kernel development news
User-space access to internal kernel information is always something of a
balancing act. That information can be useful for debugging or diagnosing
problems, but it can also be used by attackers to simplify exploiting
security vulnerabilities. At first glance, protecting
/proc/slabinfo so that it can't be read by non-root users seems
like a reasonable
restriction to help reduce targeted heap corruption exploits, but as a patch to do so was discussed, it became clear
that there are some more subtle issues to consider.
Memory for kernel objects is managed by the "slab" allocator, which
allocates those objects into separate slabs based on their size. The kernel
has several different slab allocators available, and users (or
distributions) can choose the
one they want to use when configuring the kernel. The exploits
discussed in the thread were mostly aimed at the SLUB allocator because it mixes
multiple object types of the same size into slabs.
from /proc/slabinfo lists the various slabs in use, the size of
objects that they contain, the number of objects currently used in the
slab, and so on. That information can be used to manipulate the heap
layout in a way that's favorable to attackers.
Dan Rosenberg proposed changing the permissions of /proc/slabinfo
to 0400, noting that "nearly all recent public exploits for heap
issues rely on feedback from /proc/slabinfo to manipulate heap layout
into an exploitable state". As he points out, normal users
shouldn't really need access to that file, and on systems where that is
desirable, the administrator can set the permissions appropriately.
While few argued that unprivileged users need access to the information (at
least by default), there were immediate questions about whether shutting
off the access would really be an obstacle for attackers. Matt Mackall put it this way:
Looking at a couple of these exploits, my suspicion is that looking at
slabinfo simply improves the odds of success by a small factor (ie 10x
or so) and doesn't present a real obstacle to attackers. All that
appears to be required is to arrange that an overrunnable object be
allocated next to one that is exploitable when overrun. And that can be
arranged with fairly high probability on SLUB's merged caches.
That "10x" factor is important. If it were several order of magnitudes
higher, it would be clearer that possibly inconveniencing some users is
worthwhile. But making an attacker only work ten times harder may not be
worth it as Mackall observes:
There are thousands of attackers and millions of users. Most of those
millions are on single-user systems with no local attackers. For every
attacker's life we make trivially more difficult, we're also making a
few real user's lives more difficult. It's not obvious that this is a
Mackall describes the basic idea behind these "heap smashing" attacks well,
but Rosenberg gives a more detailed
description of how they work:
The most common known
techniques involve overflowing into an allocated object with useful
contents such as a function pointer and then triggering these (various
IPC-related structs are often used for this). It's also possible to
overflow into a free object and overwrite its free pointer, causing
subsequent allocations to result in a fake heap object residing in
userland being under an attacker's control.
So an attacker that can observe /proc/slabinfo can gain a great
deal of information about slabs that would allow them to arrange the
situation they need on the heap. But in the discussion, it became clear
that there are other ways to gain some of that information (via observing
/proc/meminfo for instance), though it turns out that an attacker
can rely on probability to arrange objects in the "right" order.
One could, as Mackall suggested, pre-allocate a large number of objects
such that a new page is allocated to the slab. The contents of that page
are then largely under the control of the attacker, so freeing one of those
objects and allocating the other kind will very likely result in the needed
That led Pekka Enberg to propose a patch
that would randomize the object layout within a slab. The idea is that
attackers couldn't depend on the current sequential allocation of objects
in the slab, as the free list in a new slab would be in a random order.
When freeing one object and then allocating another, there would be no
guarantee that the two would be sequential. That approach, too, falls by
the wayside because of probability.
Even with a randomized slab, an attacker can half fill the
slab page with overrunnable objects, then allocate an exploitable object;
it will then have a roughly 50% chance of being in the right place. One
could also fill
the slab page with overrunnable objects,
then free an object—or every tenth object.
The holes left behind will have a high probability being in the right
place. Essentially, Mackall and others in the thread showed that the
current attacks using /proc/slabinfo output were insufficiently
Mackall noted that the real underlying
problem is that it is too easy for programmers to copy the wrong amount of
data from user space (which is how most of these object overruns occur).
He suggested that a copy_from_user() interface that was harder to
get wrong might help reduce these kinds of problems:
I think the real issue here is that it's too easy to write code that
copies too many bytes from userspace. Every piece of code writes its own
bound checks on copy_from_user, for instance, and gets it wrong by
hitting signed/unsigned issues, alignment issues, etc. that are on the
very edge of the average C coder's awareness.
We need functions that are hard to abuse and coding patterns that are
easy to copy, easy to review, and take the tricky bits out of the hands
of driver writers.
There was general agreement that better interfaces should be added to reduce
these kinds of problems. Alan Cox mentioned
that Arjan van de Ven had created some "copy_from_user validation
code [that] already does verification checks
on the copies using gcc magic". While Rosenberg is still interested
in pursuing the /proc/slabinfo protection along with Enberg's slab
randomization patch, it's not clear that there will be much support for it
from other kernel developers. Given that it imposes a performance penalty
along with potentially inconveniencing users, without any real benefit, it
may be rather hard to sell.
Comments (6 posted)
system call is rarely held up as one of the better
parts of the Unix interface. This call, which is used for the tracing and
debugging of processes, is gifted with strange semantics (it reparents the
traced process to the tracer), numerous interface warts, and occasionally
unpredictable behavior. It is also hard to implement within the kernel;
there are few developers who are willing to get into the depths of the
implementation. So it's not surprising that there has
been occasional talk of simply replacing ptrace()
better; see this 2010 article
description of one such discussion.
While some developers think that ptrace() is beyond repair, Tejun
Heo disagrees. To back that up, he has posted a set of proposals for the improvement of the
ptrace currently is in a pretty bad shape and I think one of the
biggest reasons is a lot of effort has been spent trying to come up
with something completely new instead of concentrating on improving
what's already there. I think the existing principles are pretty
sound. They just need some love and attention here and there.
The bulk of the "love and attention" Tejun means to apply is addressed at
the interaction between tracing and job control. In an untraced process,
job control is used by the kernel and the shell to stop and restart
processes, possibly moving them between the foreground and the background.
Adding tracing to the picture confuses things for a number of reasons.
For example, reparenting the traced process deprives the real parent of the
ability to get notifications when that process is stopped or started.
There are also some strange internal transitions between the
TASK_STOPPED and TASK_TRACED states which lead to
unpredictable and sometimes surprising behavior. For example, a task which
is running under strace can be stopped with ^Z as usual,
but the shell will be unable to restart it.
Tejun has a series of concrete proposals to improve the situation. The
first of these is that a traced process should always, when stopped, be in
the TASK_TRACED state. The current strange transitions between
that state and TASK_STOPPED would go away. He would fix
things so that notifications when a process stops or starts would always go
to the real parent, even when a process has been reparented for tracing.
Some edge cases, such as what happens when a traced process is detached,
would be fixed so that process's behavior matches the untraced case.
To fix the "can't start a stopped, traced process" problem, Tejun would
further enshrine the rule that the tracing process has total control over
the traced process's state. So it's up to the tracer to start a stopped
process if the shell wants that done.
Currently, tracers have no way to know that the real
parent has tried to start a stopped process, so a notification mechanism
needs to be added. That would be done by extending the STOPPED
notification that can currently be obtained with one of the variants of the
wait() system call.
Finally, Tejun would like to fix the behavior of the PTRACE_ATTACH
operation, which attaches to a process and sends a SIGSTOP signal
to put it into the stopped state. The signal confuses things, and the
stopped state is undesirable; it is not really possible, though, to change
the semantics of PTRACE_ATTACH in this way without breaking
applications. So he would create a new PTRACE_SEIZE operation
which would attach to a process (if it's not already attached) and put the
process immediately into the TASK_TRACED state.
These changes, Tejun thinks, are enough to turn ptrace() into
something rather more predictable and civilized. He'd like to go forward
into the implementation with a 2.6.40 target for merging. In the following
discussion, it seems that most developers agree with these changes, modulo
a quibble or two. The one big exception
was Roland McGrath, who has done a lot of work in this area. Roland has
some different ideas, especially with regard to PTRACE_SEIZE.
Roland's alternative to PTRACE_SEIZE (if it can truly be called an
"alternative," having been suggested first) is to add two new commands:
PTRACE_ATTACH_NOSTOP and PTRACE_INTERRUPT. The former
would attach to a process but not change its running state in any way,
while the latter would stop the process and put it into the
TASK_TRACED state. He sees a number of advantages to this
approach, including the ability to trace a process without ever stopping
it. There are cases (strace comes to mind) where there is no need
to stop the process; avoiding doing so allows the process to be traced
while minimizing the effects on its behavior.
Roland also foresaw a variant of PTRACE_INTERRUPT which would only
stop a process when it's running in user space. That would avoid the
occasional "interrupted system call" failure that current tracing can
cause. He also worries about what happens when PTRACE_SEIZE is,
itself, interrupted; handling that situation in a way that supports the
writing of robust applications, he says, would be hard. Finally, he raises
the issue of scalability; he does not think that PTRACE_SEIZE will
work well for the debugging of highly threaded applications. In summary,
None of this means at all that PTRACE_SEIZE is worthless. But it
is certainly inadequate to meet the essential needs that motivate
adding new interfaces in this area. The PTRACE_ATTACH_NOSTOP idea
I suggested is far from complete for all the issues as well, but it
is a more versatile building block than PTRACE_SEIZE.
Unfortunately, it seems that Roland is changing jobs and stopping work in
this area, so his thoughts may carry less weight than they normally would
have. As of this writing, there have been few responses to his post; Tejun
has mostly dismissed Roland's concerns.
Tejun has also posted a patch series implementing parts of his
proposal, but not, yet, PTRACE_SEIZE. The uncontroversial parts
of this work will almost certainly be merged; how PTRACE_ATTACH
will be fixed in the end remains to be seen.
Comments (none posted)
The out-of-memory (OOM) killer is charged with killing off processes in
response to a severe memory shortage. It has been the source of
considerable discussion and numerous rewrites over the years. Perhaps that
is inevitable given its purpose; choosing the right process to kill at the
right time is never going to be an easy thing to code. The extension of
the OOM killer into control groups has added to its flexibility, but has
also raised some interesting issues of its own.
Normally, the OOM killer is invoked when the system as a whole is
catastrophically out of memory. In the control group context, the OOM
killer comes into play when the memory usage by the processes within that group
exceeds the configured maximum and attempts to reclaim memory from those
processes have failed. An out-of-memory situation which is contained to a
control group is bad for the processes involved, but it should not threaten
the rest of the system. That allows for a little more flexibility in how
out-of-memory situations are handled.
In particular, it is possible for user space to take over OOM-killer duties
in the control group context. Each group has a control file called
oom_control which can be used in a couple of interesting ways:
- Writing "1" to that file will disable the OOM killer within
that group. Should an out-of-memory situation come about, the
processes in the affected group will simply block when attempting to
allocate memory until the situation improves somehow.
- Through the use of a special eventfd() file descriptor, a
process can use the oom_control file to sign up for
notifications of out-of-memory events (see Documentation/cgroups/memory.txt for the
details on how that is done). That process will be informed whenever
the control group runs out of memory; it can then respond to
address the problem.
There are a number of ways that this user-space OOM killer can fix a
memory issue that affects a control group. It could simply raise the limit
for that group, for example. Alternatives include killing processes or
moving some processes to a different control group. All told, it's a
reasonably flexible way of allowing user space to take over the
responsibility of recovering from out-of-memory disasters.
At Google, though, it seems that it's not quite flexible enough. As has
been widely reported, Google does not have very many machines to work with,
so the company has a tendency to cram large numbers of tasks onto each
host. That has led to an interesting problem: what happens if the
user-space OOM killer is, itself, so starved for memory that it is unable
to respond to an out-of-memory condition? What happens, it turns out, is
that things just come to an unpleasant halt.
Google operations is not overly fond of unpleasant halts, so an attempt has
been made to find another solution. The outcome was a patch from David Rientjes adding another
control file to the control group called oom_delay_millisecs.
Like oom_control, it holds off the kernel's OOM killer in favor of
a user-space alternative. The difference is that the administrator can
provide a time limit for the kernel OOM killer's patience; if the
out-of-memory situation persists after that much time, the kernel's OOM
killer will step in and resolve the situation with as much prejudice as
To David, this delay looks like a useful new feature for the memory control
group mechanism. To Andrew Morton, instead, it looks like a kernel hack
intended to work around user-space bugs, and he is not that thrilled by
it. In Andrew's view, if user space has
set itself up as the OOM handler for a control group, it needs to ensure
that it is able to follow through. Adding the delay looks like a way to
avoid that responsibility which could have long-term effects:
My issue with this patch is that it extends the userspace API.
This means we're committed to maintaining that interface *and its
behaviour* for evermore. But the oom-killer and memcg are both
areas of intense development and the former has a habit of getting
ripped out and rewritten. Committing ourselves to maintaining an
extension to the userspace interface is a big thing, especially as
that extension is somewhat tied to internal implementation details
and is most definitely tied to short-term inadequacies in userspace
and in the kernel implementation.
Andrew would rather see development effort put into fixing any kernel
problems which might be preventing a user-space OOM killer from doing its
job. David, though, doesn't see a way to
work without this feature. If it doesn't get in, Google may have to carry
it separately; he predicted, though, that other users will start asking for
it as usage of the memory controller increases. As of this writing, that's
where the discussion stands.
Comments (36 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>