The current development kernel is 2.6.33-rc3
on January 5.
The bulk of the patches are some SH defconfig updates
(40%), but ignoring those we have the normal 'half drivers, half everything
else' pattern. On the driver front, the perhaps most notable change is not
so much a code change, but the small change of marking the "new" firewire
stack as being the recommended one.
The short-form changelog is in
the announcement, or see the
full changelog for the details.
2.6.33-rc2 was released on
December 24. It included a number of fixes, the Nouveau "ctxprogs"
generator for nv40 chipsets, and a Silicon Motion sm712 video card driver;
this release also saw the removal of the unused and abandoned distributed storage subsystem.
Full details are in the
Stable updates: the 188.8.131.52, 184.108.40.206, and 220.127.116.11 stable kernel updates were
released on January 6. All three contain a mixture of fixes;
18.104.22.168 is relatively small while
the other two are large. Updates for 2.6.31 probably end with 22.214.171.124.
Comments (none posted)
And we do make plenty of mistakes. And when we fix them, we have
to maintain bug-compatibility to allow live migration from the
broken version to the good version. If you're ever feeling overly
happy, do some compat work in qemu and it will suck a year's worth
or two of your life force a pop.
-- Avi Kivity
Application developers have historically been intolerant of systems
that change their security policy on the fly. No, let me say what I
really mean. They hate them with a flaming passion. Sometimes the
system requirements make it necessary, but please don't think the
application developers will thank you for it.
-- Casey Schaufler
Its always easier short term to pee in the pond than install a
toilet - it's just not a good long term plan.
-- Alan Cox
If you start a benchmark and you don't know what the answer should
be, at the very least within a factor of 10 and ideally within a
factor of 2, you shouldn't be running the benchmark. Well, maybe
you should, they are fun. But you sure as heck shouldn't be
publishing results unless you know they are correct.
-- Larry McVoy
Comments (none posted)
The kernel has long had a set of standard functions for the manipulation of
linked lists. What it has lacked, though, is a function for sorting those
lists. Actually, that's not quite true: it has two of them: one in the
direct rendering code, and one in the UBIFS filesystem. When Dave Chinner
found himself needing the same functionality for XFS, he decided that
adding a third implementation was probably not the best idea.
So, instead, Dave grabbed the UBIFS version and reworked it into a generic list_sort()
patch. The result is this function:
void list_sort(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a, struct list_head *b));
This function behaves like many generic sort utilities - the cmp()
function will be called with pairs of list elements (and the given
priv pointer); it should return an integer value indicating
whether the first item should sort ahead of or behind the second.
The existing users of this functionality have acknowledged the change, so
it will almost certainly make an appearance in 2.6.34.
Comments (2 posted)
Kernel development news
New security features can be affected by the "law of
consequences", because a seemingly simple restriction runs afoul of
unanticipated interactions with other parts of the system—often other
security mechanisms. These interactions can be difficult to spot
immediately, which makes kernel hackers very careful about adding new security
features. A recent proposal to provide a means for processes to restrict
their network access—something that would be useful for a process
sandbox for instance—ran into unintended consequences. But the
somewhat ad hoc nature of the feature, and its tuning for a fairly
specific use case, also caused objections from some.
The basic idea is fairly simple. Michael Stone would like to have a means
for a process to reduce its privileges such that it can no
network connections. It would be a one way gate for a process (and any
children) that would restrict network usage to previously opened connections.
Because Stone's use case is for the desktop—specifically some parts of
the OLPC Bitfrost security model—there would be an exception made for
connecting to named AF_UNIX sockets, which would allow restricted
processes to still be able to talk to the X server.
When he initially proposed the
an RFC in January 2009, Stone took a straightforward approach using
resource limits. He added a new boolean limit (RLIMIT_NETWORK)
that could be set by a process to turn off further network activity.
There was a problem in that scheme in
that it didn't actually limit the process because it didn't stop it
from using ptrace(). A subverted process
could still do networking via another process by using
ptrace() on it.
In addition, James Morris noted that
network namespaces might be a possible solution to
the problem. After that round of comments, Stone came back with an updated
patchset in December. He
addressed the ptrace() issue by adding a test for the resource
limit in __ptrace_may_access() that would prevent processes that
are network-limited from using ptrace(). He also noted that using
network namespaces didn't support one part of his use case: processes in a
private namespace could no longer connect to the X server using
limits as the interface was not very well received by glibc maintainer
Ulrich Drepper ("it's a pain to deal with
rlimit extensions"), who suggested using prctl() instead.
Stone quickly turned around another version of
the patch that used prctl(), but a few problems cropped up
along the way.
At first blush, removing further network access seems like a harmless way
for a process to voluntarily give up some portion of its privileges. But,
when coupled with setuid() binaries that expect to be able to
access the network, things get murkier. As Eric W. Biederman put it: "You can in theory
confuse a suid root application and cause it to take action with it's
elevated privileges that violate the security policy." That is why
privileges are required for entering a new network namespace (as well as
for things like chroot()), because they can violate the
assumptions made by setuid() programs.
Stone is looking for a mechanism that doesn't require a privileged process,
however, which is why he proposed resource limits or prctl() as
the interface. But those don't alleviate the problem with suid programs.
The so-called "sendmail capabilities bug" was brought up several times in
the conversation about Stone's feature as a concrete example of how the
interaction between security mechanisms can go awry. That bug was really
in the kernel, but by manipulating the Linux capabilities of a process
before spawning sendmail (which runs as setuid(0)), attackers
could bypass the privilege separation that sendmail tries to
enforce. Adding a new security mechanism (capabilities) suddenly—mistakenly—changed the behavior of a well-established security technique.
Implementation bugs aside, there are concerns about sprinkling support for
this feature in various places in the kernel: ptrace() and the
networking stack, particularly since the
changes have the AF_UNIX exception as a special case. Alan Cox
puts it this way:
This is a security model, it belongs as a security model using LSM. You
can already do it with SELinux and the like as far as I can see but
that's not to say you shouldn't submit it also as a small handy
standalone security module for people who don't want to load the big
Otherwise you end up putting crap in fast paths that nobody needs but
everyone pays for and weird tests and hacks for address family and like
into core network code.
The fact the patches look utterly ugly should be telling you something -
which is that you are using the wrong hammer
Unfortunately, switching to an LSM-based solution opens the "stacking-LSM can of worms
again", as Valdis Kletnieks calls
it. Currently, there is no general way to run multiple LSMs
kernel. The idea has come up multiple times, but there are serious
concerns about allowing it. Any new LSM is much less likely to be used, at
least in distributions that already use one of the "monolithic" security
modules like SELinux, TOMOYO, or the out-of-tree AppArmor. In another
thread Stone queried linux-kernel on the use of LSM and
expressed that concern:
Unfortunately, I don't feel that I can make effective use of these hooks
because they seem to be "occupied" by the large mandatory access control
Smack developer Casey Schaufler essentially agreed, but
encouraged Stone to go forward with an LSM-based solution:
You're arguing for stacking a set of small security modules. This
is a direction that has gotten slammed pretty hard in the past but
that crops up every time someone like you comes along with a
module that serves a specific purpose. Mostly the objections have
come from people who will tell you that something else already
does what you're trying to do, and that all you have to do is take
on the entirety of their monolithic approach and you'll be happy.
I'm behind you 100%. Use the LSM. Your module is exactly why we have
the blessed thing. Once we get a collection of otherwise unrelated
LSMs the need for a stacker will be sufficiently evident that we'll
be able to get one done properly.
There are good reasons to be concerned about stacking security modules, but
they mostly stem from trying to combine things like SELinux and TOMOYO
rather than small single-purpose modules. Serge E. Hallyn warned that "the problem is that
composing any two security policies can quickly have
subtle, unforeseen, but dangerous effects." But he also pointed out
that there are ways to "hardcode" stacking with the assistance of the other
So with your module, I'd recommend following the route of the capabilities
LSM. You can provide an optional stand-alone LSM which only hooks your
functions. Then smack, for instance, can call the functions in your LSM
from within its own hooks, or it can simply explicitly assign its hooks to
your functions in smack_ops. Selinux can do the same thing, although I
suspect they would more likely implement their own functions for your newly
While not opposed to that approach in principle, Stone notes that it requires others to change their
code, something he has been trying to avoid:
Doesn't it seem a bit strange to you to be recommending that everyone else
using the five security hooks I want to use modify their code *in detail* to
support my functionality given that my functionality is explicitly intended not
to require any such work on their part?
This seems frankly silly to me, not to mention expensive and error-prone.
Another alternative would be to use SELinux to do the restriction as Kyle
Moffett suggested: "If you aren't using SELinux at this time (and therefore have no
existing policy), then it's actually pretty straightforward
(relatively speaking) to set up for your particular goals." He
outlined an SELinux policy scheme to enforce the networking restrictions. Schaufler was skeptical of that approach—while noting
his amusement that an SELinux advocate would call the default polices "fantastically
complicated" as Moffett did. Schaufler expects the full policy to
support Stone's use case to
be rather complicated itself:
I'm willing to bet all the beers you can drink in a sitting that
the policy would be bigger than the proposed LSM. You can count that
in either bytes or lines.
Meanwhile, Stone proposed yet another version that uses the LSM
hooks. The feature is still enabled through prctl(PR_SET_NETWORK,
PR_NETWORK_OFF), but the implementation is done via a
disablenetwork LSM. But there is still the problem of removing
the network for setuid() programs that are spawned from the
restricted, unprivileged program. Some don't see that as a real problem,
because the network could go away for other reasons (network cable pulled,
open file limit set sufficiently low, and so forth), but others like Pavel
Machek, who NAKed the patch, disagree,
envisioning plausible, if unlikely, scenarios where it could cause a problem.
That led Biederman to propose
a mechanism that would allow processes to call
prctl(PR_SET_NOSUID) to permanently revoke their ability to
execute setuid() programs (in much the same manner as the
MNT_NOSUID mount option). Any process that did that would then
be eligible to also revoke its network access. In addition, it would
potentially allow entering private namespaces to become a non-privileged
operation as namespaces suffer from the some of the same issues regarding
But, once again, Biederman's patch implements a security model of sorts,
and belongs in an LSM, at least according to
Cox: "Another fine example of why we have security hooks so that we don't get a
kernel full of other 'random security idea of the day' hacks."
Which leads right back to the problem of stacking security modules. Like
Schaufler, though, Cox seems to think LSM stacking will eventually come to
Yes it might mean the hooks need tweaking, yes it probably means the
people who want these need to do some trivial stacking work, but if as
many people are actually really interested as are having random 'lets add
a button to disable reading serial ports on wednesday' ideas there should
be no shortage of people to do the job right.
Part of the problem is the whole raft of security mechanisms that Linux
supports: setuid(), capabilities, LSMs, monolithic LSMs like
SELinux, securebits (which was mentioned as a possible solution for
PR_SET_NOSUID), seccomp, and more. As the sendmail capabilities
bug showed, these can interact in unexpected ways. Adding a specific knob,
whether it be disabling the network or setuid(), only addresses
that particular problem, while potentially impacting the whole system in a
It is rather counter-intuitive that allowing non-root programs to
voluntarily drop some portion of their privileges should lead to other
security problems. The root cause may really be setuid(), but
that mechanism is so ingrained into Unix programming that there is
little to be done but live with it—warts and all. But there will be
more and more pressure to provide ways for processes to sandbox themselves
(and others). The seccomp
changes proposed by Google for its Chrome browser in May are another
way of approaching the
Even with all of the competing—sometimes clashing—security
gets the sense that there is more infrastructural work to be done in Linux
security. If the concern about generalized LSM stacking is only for the
security models, one could imagine some kind of "LSM lite" that used the
same hooks but had restrictions on behavior such that modules could stack.
Perhaps some of these restrictions could be implemented as some kind of
trusted user space daemon that changed the capabilities of running
processes. So far, it's not clear where things are headed, but it does
seem clear that sandboxing is something that folks want to be able to do,
and that there are some approaches to that problem that Linux does not yet
Comments (6 posted)
The longstanding memory fragmentation problem has been covered many times
in these pages. In short: as the system runs, pages tend to be scattered
between users, making it hard to find groups of physically-contiguous pages
when they are needed. Much work has gone into avoiding the need for
higher-order (multi-page) memory allocations whenever possible, with the
result that most kernel functionality is not hurt by page fragmentation.
But there are still situations where higher-order allocations are needed;
code which needs such allocations can fail on a fragmented system.
It's worth noting that, in one way, this problem is actually getting
worse. Contemporary processors are not limited to 4K pages; they can work
with much larger pages ("huge pages") in portions of a process's address
space. There can be real performance advantages to using huge pages,
mostly as a result of reduced pressure on the processor's translation
lookaside buffer. But the use of huge pages requires that the system be
able to find physically-contiguous areas of memory which are not only big
enough, but which are properly aligned as well. Finding that kind of space
can be quite challenging on systems which have been running for any period
Over the years, the kernel developers have made various attempts to
mitigate this problem; techniques like ZONE_MOVABLE and lumpy reclaim have been the
result. There is still more that can be done, though, especially in the
area of fixing fragmentation to recover larger chunks of memory. After
taking a break from this area, Mel Gorman has recently returned with a new
patch set implementing memory
compaction. Here we'll take a quick look at how this patch works.
Imagine a very small memory zone which looks like this:
Here, the white pages are free, while those in red are allocated to some
use. As can be seen, the zone is quite fragmented, with no contiguous
blocks of larger than two pages available; any attempt to allocate, for
example, a four-page block from this zone will fail. Indeed, even two-page
allocations will fail, since none of the free pairs of pages are properly
It's time to call in the compaction code. This code runs as two separate
algorithms; the first of them starts at the bottom of the zone and builds a
list of allocated pages which could be moved:
Meanwhile, at the top of the zone, the other half of the algorithm is
creating a list of free pages which could be used as the target of page
Eventually the two algorithms will meet somewhere toward the middle of the
zone. At that point, it's mostly just a matter of invoking the page migration code (which is
not just for NUMA systems anymore) to shift the used pages to the free
space at the top of the zone, yielding a pretty picture like this:
We now have a nice, eight-page, contiguous span of free space which can be
used to satisfy higher-order allocations if need be.
Of course, the picture given here has been simplified considerably from
what happens on a real system. To begin with, the memory zones will be
much larger; that means there's more work to do, but the resulting free
areas may be much larger as well.
But all this only works if the pages in
question can actually be moved. Not all pages can be moved at will; only
those which are addressed through a layer of indirection and which are not
otherwise pinned down are movable. So most user-space pages - which are
accessed through user virtual addresses - can be moved; all that is needed
is to tweak the relevant page table entries accordingly. Most memory used
by the kernel directly cannot be moved - though some of it is reclaimable,
meaning that it can be freed entirely on demand.
It only takes one non-movable page to ruin a contiguous segment of memory.
The good news here is that the kernel already takes care to separate
movable and non-movable pages, so, in reality, non-movable pages should be
a smaller problem than one might think.
The running of the compaction algorithm can be triggered in either of two
ways. One is to write a node number to /proc/sys/vm/compact_node,
causing compaction to happen on the indicated NUMA node. The other is for
the system to fail in an attempt to allocate a higher-order page; in this
case, compaction will run as a preferable alternative to freeing pages
through direct reclaim. In the absence of an explicit trigger, the
compaction algorithm will stay idle; there is a cost to moving pages around
which is best avoided if it is not needed.
Mel ran some simple tests showing that, with compaction enabled, he was
able to allocate over 90% of
the system's memory as huge pages while
simultaneously decreasing the amount of reclaim activity needed. So it
looks like a useful bit of work. It is memory management code, though, so
the amount of time required to get into the mainline is never easy to
predict in advance.
Comments (7 posted)
The sysctl mechanism has seen a lot of work in recent kernel development
cycles, resulting in the removal of a lot of code and a reduction in big
kernel lock usage. It turns out, though, that this work has also introduced some
subtle and rare race conditions into the handling of string data exported
to user space. In response, Andi Kleen has put together a new concept
called "RCU strings," using the read-copy-update mechanism to eliminate the
races without the introduction of new locks on the read path.
There are a number of strings managed through sysctl. As an example,
consider request_module(), which is used by kernel code to ask
user space to load a module. A call to request_module() will
result in an invocation of modprobe, but nobody wants to wire
the name or location of modprobe in kernel code. So the sysctl
variable /proc/sys/kernel/modprobe is used to contain the location
of this utility. It will be set to "/sbin/modprobe" on almost any
Linux system, but an administrator can change it if need be.
Consider the case of a request_module() call which happens at
exactly the same time as a change to /proc/sys/kernel/modprobe
from user space. It is entirely possible that request_module()
could end up with the path to modprobe which has been partially
modified. The most likely result is a failed attempt to load the module,
but worse things could happen. This situation is well worth avoiding.
(One should note that these races are not, in general, potential security
problems. The changing of sysctl variables is a privileged operation, so
it cannot be done from arbitrary user accounts.)
The read-copy-update mechanism was designed to ensure that data -
especially data which is frequently read but rarely modified - remains
stable while it is being used. So it seems well suited to the protection
of sysctl strings which, likely as not, will never be changed over the
lifetime of the system. RCU can be a bit tricky to use, though; the RCU
string type is designed to make things a bit easier.
The creation of an RCU string is accomplished through:
char *alloc_rcu_string(int size, gfp_t gfp);
The size parameter should be the maximum size that the string can
be - null byte included.
Following the normal RCU pattern, read access to this string is
accomplished by way of a pointer to that string. Atomic readers - those
which do not sleep - need only use rcu_read_lock() and
rcu_dereference() to mark their
use of the RCU-protected pointer. Any code which might sleep will have to
take other measures, since the string could change while the code
is not running. In this case, a copy of the string should be made with:
char *access_rcu_string(char **str, int size, gfp_t gfp);
Here, str is a pointer to the string pointer, and size is
the size of the originally-allocated string. Using strlen() to
get size would be a serious mistake, since the string could
possibly change before the copy is made. The new string is allocated with
kmalloc(); the given gfp flags are used for the
allocation. The copied string should be freed with kfree() when
it is no longer needed.
Code changing an RCU string should use alloc_rcu_string() to
allocate a replacement string, copy the data into it, then use
rcu_assign_pointer() to make the new string visible to the rest of
the system. The old string should be passed to free_rcu_string(),
which will use RCU to free the memory once it is known that no references
to that string can still exist.
String variables tend to be exported through sysctl using
proc_dostring(). To make life easier, Andi has added a new
function, proc_rcu_string(), which handles most of the details of
exporting an RCU string. It's a simple matter of initializing the
appropriate ctl_table structure with a char **
pointer to the string pointer and setting the proc_handler entry
to proc_rcu_string(). The initial value of the string is allowed
to be a compile-time constant string; anything else is expected to be an
This code has been through a couple rounds of review and seems likely to be
merged in the 2.6.34 development cycle.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>