Brief itemsreleased on January 5. Linus says:
The short-form changelog is in the announcement, or see the full changelog for the details.
2.6.33-rc2 was released on December 24. It included a number of fixes, the Nouveau "ctxprogs" generator for nv40 chipsets, and a Silicon Motion sm712 video card driver; this release also saw the removal of the unused and abandoned distributed storage subsystem. Full details are in the full changelog.
Stable updates: the 220.127.116.11, 18.104.22.168, and 22.214.171.124 stable kernel updates were released on January 6. All three contain a mixture of fixes; 126.96.36.199 is relatively small while the other two are large. Updates for 2.6.31 probably end with 188.8.131.52.
So, instead, Dave grabbed the UBIFS version and reworked it into a generic list_sort() patch. The result is this function:
void list_sort(void *priv, struct list_head *head, int (*cmp)(void *priv, struct list_head *a, struct list_head *b));
This function behaves like many generic sort utilities - the cmp() function will be called with pairs of list elements (and the given priv pointer); it should return an integer value indicating whether the first item should sort ahead of or behind the second.
The existing users of this functionality have acknowledged the change, so it will almost certainly make an appearance in 2.6.34.
Kernel development news
New security features can be affected by the "law of unintended consequences", because a seemingly simple restriction runs afoul of unanticipated interactions with other parts of the system—often other security mechanisms. These interactions can be difficult to spot immediately, which makes kernel hackers very careful about adding new security features. A recent proposal to provide a means for processes to restrict their network access—something that would be useful for a process sandbox for instance—ran into unintended consequences. But the somewhat ad hoc nature of the feature, and its tuning for a fairly specific use case, also caused objections from some.
The basic idea is fairly simple. Michael Stone would like to have a means for a process to reduce its privileges such that it can no longer make network connections. It would be a one way gate for a process (and any subsequent children) that would restrict network usage to previously opened connections. Because Stone's use case is for the desktop—specifically some parts of the OLPC Bitfrost security model—there would be an exception made for connecting to named AF_UNIX sockets, which would allow restricted processes to still be able to talk to the X server.
When he initially proposed the idea in an RFC in January 2009, Stone took a straightforward approach using resource limits. He added a new boolean limit (RLIMIT_NETWORK) that could be set by a process to turn off further network activity. There was a problem in that scheme in that it didn't actually limit the process because it didn't stop it from using ptrace(). A subverted process could still do networking via another process by using ptrace() on it.
In addition, James Morris noted that network namespaces might be a possible solution to the problem. After that round of comments, Stone came back with an updated patchset in December. He addressed the ptrace() issue by adding a test for the resource limit in __ptrace_may_access() that would prevent processes that are network-limited from using ptrace(). He also noted that using network namespaces didn't support one part of his use case: processes in a private namespace could no longer connect to the X server using AF_UNIX sockets.
Using resource limits as the interface was not very well received by glibc maintainer Ulrich Drepper ("it's a pain to deal with rlimit extensions"), who suggested using prctl() instead. Stone quickly turned around another version of the patch that used prctl(), but a few problems cropped up along the way.
At first blush, removing further network access seems like a harmless way for a process to voluntarily give up some portion of its privileges. But, when coupled with setuid() binaries that expect to be able to access the network, things get murkier. As Eric W. Biederman put it: "You can in theory confuse a suid root application and cause it to take action with it's elevated privileges that violate the security policy." That is why privileges are required for entering a new network namespace (as well as for things like chroot()), because they can violate the assumptions made by setuid() programs.
Stone is looking for a mechanism that doesn't require a privileged process, however, which is why he proposed resource limits or prctl() as the interface. But those don't alleviate the problem with suid programs. The so-called "sendmail capabilities bug" was brought up several times in the conversation about Stone's feature as a concrete example of how the interaction between security mechanisms can go awry. That bug was really in the kernel, but by manipulating the Linux capabilities of a process before spawning sendmail (which runs as setuid(0)), attackers could bypass the privilege separation that sendmail tries to enforce. Adding a new security mechanism (capabilities) suddenly—mistakenly—changed the behavior of a well-established security technique.
Implementation bugs aside, there are concerns about sprinkling support for this feature in various places in the kernel: ptrace() and the networking stack, particularly since the changes have the AF_UNIX exception as a special case. Alan Cox puts it this way:
Otherwise you end up putting crap in fast paths that nobody needs but everyone pays for and weird tests and hacks for address family and like into core network code.
The fact the patches look utterly ugly should be telling you something - which is that you are using the wrong hammer
Unfortunately, switching to an LSM-based solution opens the "stacking-LSM can of worms again", as Valdis Kletnieks calls it. Currently, there is no general way to run multiple LSMs in a kernel. The idea has come up multiple times, but there are serious concerns about allowing it. Any new LSM is much less likely to be used, at least in distributions that already use one of the "monolithic" security modules like SELinux, TOMOYO, or the out-of-tree AppArmor. In another thread Stone queried linux-kernel on the use of LSM and expressed that concern:
Smack developer Casey Schaufler essentially agreed, but encouraged Stone to go forward with an LSM-based solution:
I'm behind you 100%. Use the LSM. Your module is exactly why we have the blessed thing. Once we get a collection of otherwise unrelated LSMs the need for a stacker will be sufficiently evident that we'll be able to get one done properly.
There are good reasons to be concerned about stacking security modules, but they mostly stem from trying to combine things like SELinux and TOMOYO rather than small single-purpose modules. Serge E. Hallyn warned that "the problem is that composing any two security policies can quickly have subtle, unforeseen, but dangerous effects." But he also pointed out that there are ways to "hardcode" stacking with the assistance of the other LSM developers:
While not opposed to that approach in principle, Stone notes that it requires others to change their code, something he has been trying to avoid:
This seems frankly silly to me, not to mention expensive and error-prone.
Another alternative would be to use SELinux to do the restriction as Kyle Moffett suggested: "If you aren't using SELinux at this time (and therefore have no existing policy), then it's actually pretty straightforward (relatively speaking) to set up for your particular goals." He outlined an SELinux policy scheme to enforce the networking restrictions. Schaufler was skeptical of that approach—while noting his amusement that an SELinux advocate would call the default polices "fantastically complicated" as Moffett did. Schaufler expects the full policy to support Stone's use case to be rather complicated itself:
Meanwhile, Stone proposed yet another version that uses the LSM hooks. The feature is still enabled through prctl(PR_SET_NETWORK, PR_NETWORK_OFF), but the implementation is done via a disablenetwork LSM. But there is still the problem of removing the network for setuid() programs that are spawned from the restricted, unprivileged program. Some don't see that as a real problem, because the network could go away for other reasons (network cable pulled, open file limit set sufficiently low, and so forth), but others like Pavel Machek, who NAKed the patch, disagree, envisioning plausible, if unlikely, scenarios where it could cause a problem.
That led Biederman to propose a mechanism that would allow processes to call prctl(PR_SET_NOSUID) to permanently revoke their ability to execute setuid() programs (in much the same manner as the MNT_NOSUID mount option). Any process that did that would then be eligible to also revoke its network access. In addition, it would potentially allow entering private namespaces to become a non-privileged operation as namespaces suffer from the some of the same issues regarding setuid() programs.
But, once again, Biederman's patch implements a security model of sorts, and belongs in an LSM, at least according to Cox: "Another fine example of why we have security hooks so that we don't get a kernel full of other 'random security idea of the day' hacks." Which leads right back to the problem of stacking security modules. Like Schaufler, though, Cox seems to think LSM stacking will eventually come to pass:
Part of the problem is the whole raft of security mechanisms that Linux supports: setuid(), capabilities, LSMs, monolithic LSMs like SELinux, securebits (which was mentioned as a possible solution for PR_SET_NOSUID), seccomp, and more. As the sendmail capabilities bug showed, these can interact in unexpected ways. Adding a specific knob, whether it be disabling the network or setuid(), only addresses that particular problem, while potentially impacting the whole system in a negative way.
It is rather counter-intuitive that allowing non-root programs to voluntarily drop some portion of their privileges should lead to other security problems. The root cause may really be setuid(), but that mechanism is so ingrained into Unix programming that there is little to be done but live with it—warts and all. But there will be more and more pressure to provide ways for processes to sandbox themselves (and others). The seccomp changes proposed by Google for its Chrome browser in May are another way of approaching the problem.
Even with all of the competing—sometimes clashing—security mechanisms, one gets the sense that there is more infrastructural work to be done in Linux security. If the concern about generalized LSM stacking is only for the monolithic security models, one could imagine some kind of "LSM lite" that used the same hooks but had restrictions on behavior such that modules could stack. Perhaps some of these restrictions could be implemented as some kind of trusted user space daemon that changed the capabilities of running processes. So far, it's not clear where things are headed, but it does seem clear that sandboxing is something that folks want to be able to do, and that there are some approaches to that problem that Linux does not yet support.
It's worth noting that, in one way, this problem is actually getting worse. Contemporary processors are not limited to 4K pages; they can work with much larger pages ("huge pages") in portions of a process's address space. There can be real performance advantages to using huge pages, mostly as a result of reduced pressure on the processor's translation lookaside buffer. But the use of huge pages requires that the system be able to find physically-contiguous areas of memory which are not only big enough, but which are properly aligned as well. Finding that kind of space can be quite challenging on systems which have been running for any period of time.
Over the years, the kernel developers have made various attempts to mitigate this problem; techniques like ZONE_MOVABLE and lumpy reclaim have been the result. There is still more that can be done, though, especially in the area of fixing fragmentation to recover larger chunks of memory. After taking a break from this area, Mel Gorman has recently returned with a new patch set implementing memory compaction. Here we'll take a quick look at how this patch works.
Imagine a very small memory zone which looks like this:
Here, the white pages are free, while those in red are allocated to some use. As can be seen, the zone is quite fragmented, with no contiguous blocks of larger than two pages available; any attempt to allocate, for example, a four-page block from this zone will fail. Indeed, even two-page allocations will fail, since none of the free pairs of pages are properly aligned.
It's time to call in the compaction code. This code runs as two separate algorithms; the first of them starts at the bottom of the zone and builds a list of allocated pages which could be moved:
Meanwhile, at the top of the zone, the other half of the algorithm is creating a list of free pages which could be used as the target of page migration:
Eventually the two algorithms will meet somewhere toward the middle of the zone. At that point, it's mostly just a matter of invoking the page migration code (which is not just for NUMA systems anymore) to shift the used pages to the free space at the top of the zone, yielding a pretty picture like this:
We now have a nice, eight-page, contiguous span of free space which can be used to satisfy higher-order allocations if need be.
Of course, the picture given here has been simplified considerably from what happens on a real system. To begin with, the memory zones will be much larger; that means there's more work to do, but the resulting free areas may be much larger as well.
But all this only works if the pages in question can actually be moved. Not all pages can be moved at will; only those which are addressed through a layer of indirection and which are not otherwise pinned down are movable. So most user-space pages - which are accessed through user virtual addresses - can be moved; all that is needed is to tweak the relevant page table entries accordingly. Most memory used by the kernel directly cannot be moved - though some of it is reclaimable, meaning that it can be freed entirely on demand. It only takes one non-movable page to ruin a contiguous segment of memory. The good news here is that the kernel already takes care to separate movable and non-movable pages, so, in reality, non-movable pages should be a smaller problem than one might think.
The running of the compaction algorithm can be triggered in either of two ways. One is to write a node number to /proc/sys/vm/compact_node, causing compaction to happen on the indicated NUMA node. The other is for the system to fail in an attempt to allocate a higher-order page; in this case, compaction will run as a preferable alternative to freeing pages through direct reclaim. In the absence of an explicit trigger, the compaction algorithm will stay idle; there is a cost to moving pages around which is best avoided if it is not needed.
Mel ran some simple tests showing that, with compaction enabled, he was able to allocate over 90% of the system's memory as huge pages while simultaneously decreasing the amount of reclaim activity needed. So it looks like a useful bit of work. It is memory management code, though, so the amount of time required to get into the mainline is never easy to predict in advance.
There are a number of strings managed through sysctl. As an example, consider request_module(), which is used by kernel code to ask user space to load a module. A call to request_module() will result in an invocation of modprobe, but nobody wants to wire the name or location of modprobe in kernel code. So the sysctl variable /proc/sys/kernel/modprobe is used to contain the location of this utility. It will be set to "/sbin/modprobe" on almost any Linux system, but an administrator can change it if need be.
Consider the case of a request_module() call which happens at exactly the same time as a change to /proc/sys/kernel/modprobe from user space. It is entirely possible that request_module() could end up with the path to modprobe which has been partially modified. The most likely result is a failed attempt to load the module, but worse things could happen. This situation is well worth avoiding.
(One should note that these races are not, in general, potential security problems. The changing of sysctl variables is a privileged operation, so it cannot be done from arbitrary user accounts.)
The read-copy-update mechanism was designed to ensure that data - especially data which is frequently read but rarely modified - remains stable while it is being used. So it seems well suited to the protection of sysctl strings which, likely as not, will never be changed over the lifetime of the system. RCU can be a bit tricky to use, though; the RCU string type is designed to make things a bit easier.
The creation of an RCU string is accomplished through:
#include <linux/rcustring.h> char *alloc_rcu_string(int size, gfp_t gfp);
The size parameter should be the maximum size that the string can be - null byte included.
Following the normal RCU pattern, read access to this string is accomplished by way of a pointer to that string. Atomic readers - those which do not sleep - need only use rcu_read_lock() and rcu_dereference() to mark their use of the RCU-protected pointer. Any code which might sleep will have to take other measures, since the string could change while the code is not running. In this case, a copy of the string should be made with:
char *access_rcu_string(char **str, int size, gfp_t gfp);
Here, str is a pointer to the string pointer, and size is the size of the originally-allocated string. Using strlen() to get size would be a serious mistake, since the string could possibly change before the copy is made. The new string is allocated with kmalloc(); the given gfp flags are used for the allocation. The copied string should be freed with kfree() when it is no longer needed.
Code changing an RCU string should use alloc_rcu_string() to allocate a replacement string, copy the data into it, then use rcu_assign_pointer() to make the new string visible to the rest of the system. The old string should be passed to free_rcu_string(), which will use RCU to free the memory once it is known that no references to that string can still exist.
String variables tend to be exported through sysctl using proc_dostring(). To make life easier, Andi has added a new function, proc_rcu_string(), which handles most of the details of exporting an RCU string. It's a simple matter of initializing the appropriate ctl_table structure with a char ** pointer to the string pointer and setting the proc_handler entry to proc_rcu_string(). The initial value of the string is allowed to be a compile-time constant string; anything else is expected to be an RCU string.
This code has been through a couple rounds of review and seems likely to be merged in the 2.6.34 development cycle.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds