Brief itemsreleased by Linus immediately after last week's Edition was published. He indicated some uncertainty about whether another -rc release would have been appropriate:
Prominent features in this release include IPset, the media controller subsystem, a couple of new network flow schedulers, the block plugging rework, the long-awaited removal of the big kernel lock, and more. See the KernelNewbies 2.6.39 page, the LWN merge window summaries (part 1, part 2, and part 3) and Thorsten Leemhuis's summary on The H for more information about this release.
This wasn't obvious to me, and it seems it's not obvious to others.
Peter Zijlstra started off the new discussion with a brief note reading: "I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter)." Thomas Gleixner jumped in with a more detailed description of his objections. In his view, adding security features to tracepoints will add overhead to the tracing system, make it harder to change things in the future, and generally mix tasks which should not be mixed. It would be better, he said, to keep seccomp as a separate facility which can share the filtering mechanism once a suitable set of internal APIs has been worked out.
Ingo Molnar, a big supporter of this patch, is undeterred; his belief is that more strongly integrated mechanisms will create a more powerful and useful tool. Since the security decisions need to be made anyway, he would like to see them made using the existing instrumentation to the highest level possible. That argument does not appear to be carrying the day, though; Peter replied:
As of this writing, that's where things stand. Meanwhile, the expanded secure computing mechanism - which didn't use perf in its original form - will miss this merge window and has no clear path into the mainline. Given that Linus doesn't like the original idea either, it's not at all clear that this functionality has a real future.
Kernel development news
int sendmmsg(int fd, struct mmsghdr *mmsg, unsigned int vlen, unsigned int flags);
It is the counterpart to recvmmsg(), allowing a process to send multiple messages with a single system call.
Also added to the staging tree is the user-space support code for the USB/IP subsystem which allows a system to "export" its USB devices over the net.
Changes visible to kernel developers include:
int strtobool(const char *s, bool *res);
Anything starting with one of [yY1] is considered to be true, while strings starting with one of [nN0] are false; anything else gets an -EINVAL error.
int kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res);
These functions take care of safely copying the string from user space and performing the integer conversion.
void *bsearch(const void *key, const void *base, size_t num, size_t size, int (*cmp)(const void *key, const void *elt));
This function will search for key in an array starting at base containing num elements of the given size.
The merge window for this development cycle is likely to end on May 29, just before Linus boards a plane for Japan. At that time, presumably, we will learn what the next release will be called; Linus has made it clear that he thinks the 2.6.x numbers are getting too high and that he thinks it's time for a change. Tune in next week for the conclusion of this merge window and the end of kernel version number suspense.a brief item about the hiding of kernel addresses from user space. This hiding has come under fire from a number of developers who say that it breaks things (perf, for example) and that it does not provide any real additional security. That said, there does seem to be a consensus around the idea that it's better if attackers don't know where the kernel keeps its data structures. As it turns out, there might be a better way to do that than simply hiding pointer values.
There is no doubt that having access to the layout of the kernel in memory is useful to attackers. As Dan Rosenberg put it:
The hiding of kernel addresses is meant to deprive attackers of that extra information, making their task harder. One big problem with that approach is that most systems out there are running stock distribution kernels. Getting the needed address information from the distributor's kernel package is not a particularly challenging task. So, on these systems, there is no real mystery about the layout of the kernel, regardless of whether pointer values are allowed to leak to user space or not.
While all of this was being discussed, another idea came out: why not randomize the location of the kernel in memory at boot time? Address space layout randomization has been used to resist canned attacks for a long time, but the kernel takes no such measure for itself. Given that the kernel image is relocatable, there is no real reason why it always needs to be loaded at the same address. If the kernel calculated a different offset for itself at every boot, it could subtract that offset from pointer values before passing them to user space. Those pointers could then be used by tools like perf, but they would no longer be helpful for anybody seeking to overwrite kernel data structures.
Dan has been looking into kernel-space randomization with some success; it turns out that simply relocating the kernel is not that hard. That said, he has run up against a few potential problems. The first of those is that there is very little entropy available at the beginning of the boot process, so the generation of a sufficiently random base address for the kernel is not entirely straightforward. It seems that enough bits of entropy can be derived from the real-time clock and time stamp counter to make it hard for an attacker to derive the base address later on, but a real random number would be better.
Next, as Linus pointed out, the kernel is not infinitely relocatable. There are a number of alignment requirements which constrain the kernel's placement, so, according to Linus, there is a maximum of 8-12 bits of randomization available. That means that an exploit attempt could find the right offset after a maximum of a few thousand tries. Given that computers can try things very quickly, that does not give a site administrator much time to respond.
As others pointed out, though, that amount of randomness is probably enough. Failed exploit attempts have a high probability of generating a kernel oops; even if an administrator does not notice the oops immediately, it should come to their attention at some point. So the ability to stealthily take over a system is gone. Beyond that, failed exploits may well take down the system entirely (especially if, as is the case with many RHEL systems, the "panic on oops" flag is set) or leave it in a state where further exploit attempts cannot work. There is, it seems, a real advantage to forcing an attacker to guess.
That advantage evaporates, though, if an attacker can somehow figure out what offset a given system used at boot time. Dan noticed one way that could happen: the unprivileged SIDT instruction can be used to locate the system's interrupt descriptor table. That location could, in turn, be used to calculate the kernel's base offset. Dynamic allocation of the table can solve that problem at the cost of messing with some tricky very-early-boot code. There could be other advantages to dynamically allocating the table, though; if the table were put into the per-CPU area, it might make the system a little more scalable.
So this problem can be solved, but there will, beyond doubt, be other places where it will be possible for an attacker to obtain a real kernel-space address. There are simply too many ways in which that information might leak into user space. Plugging all of those leaks looks like one of those long-term tasks that is never really done. It may, however, be possible to get close enough to done that attackers will not be able to count on knowing the true location of the kernel in a running system. That may be a bit of security through obscurity that is worth having.
At the kernel level, performance often comes down to cache behavior. Memory references which must actually be satisfied by memory are extremely slow; good performance requires that needed data be in a CPU cache much of the time. The kernel goes out of its way to use cache-hot memory when possible; there has also been some significant work put into tasks like reordering structures so that fields that are commonly accessed together are found in the same cache line. As a general rule, these optimizations have helped performance in measurable ways.
Cache misses are often unavoidable, but it is sometimes possible to attempt to reduce their cost. If the kernel knows that it will be accessing memory at a particular location in the near future, it can use a CPU-specific prefetch instruction to begin the process of bringing the data into cache. This instruction is made available to kernel code via the generic prefetch() function; developers have made heavy use of it. Consider, for example, this commonly-used macro from <linux/list.h>:
#define list_for_each(pos, head) \ for (pos = (head)->next; prefetch(pos->next), pos != (head); \ pos = pos->next)
This macro (in a number of variants) is used to traverse a linked list. The idea behind the prefetch() call here is to begin the process of fetching the next entry in the list while the current entry is being processed. Hopefully by the time the next loop iteration starts, the data will have arrived - or, at least, it will be in transit. Linked lists are known to be cache-unfriendly data structures, so it makes sense that this type of optimization can help to speed things up.
Except that it doesn't - at least, not on x86 processors.
Andi Kleen may have been the first to question this optimization when he tried to remove the prefetches from list operations last September. His patch generated little discussion, though, and apparently fell through the cracks. Recently, Linus did some profiling on one of his favorite workloads (kernel builds) and found that the prefetch instructions were at the top of the ranking. Performing the prefetching cost time, and that time was not being repaid through better cache behavior; simply removing the prefetch() calls made the build go faster.
Ingo Molnar, being Ingo, jumped in and did a week's worth of research in an hour or so. Using perf and a slightly tweaked kernel, he was able to verify that using the prefetch instructions caused a performance loss of about 0.5%. That is not a headline-inspiring performance regression, certainly, but this is an optimization which was supposed to make things go faster. Clearly something is not working the way that people thought it was.
Linus pointed out one problem at the outset: his test involved a lot of traversals of singly-linked hlist hash table lists. Those lists tend to be short, so there is not much scope for prefetching; in fact, much of the time, the only prefetch attempted used the null pointer that indicates the end of the list. Prefetching with a null pointer seems silly, but it's also costly: evidently every such prefetch on x86 machines (and, seemingly, ARM as well) causes a translation lookaside buffer miss and a pipeline stall. Ingo measured this effect and came to the conclusion that each null prefetch cost about 20 processor cycles.
Clearly, null prefetches are a bad idea. It would be nice if the CPU would simply ignore attempts to prefetch using a null pointer, but that's not how things are, so, as is often the case, one ends up trying to solve the problem in software instead. Ingo did some testing with a version of prefetch() which would only issue prefetch instructions for non-null pointers; that version did, indeed, perform better. But it still performed measurably worse than simply skipping the prefetching altogether.
CPU designers are well aware of the cost of waiting for memory; they have put a great deal of effort into minimizing that cost whenever possible. Among other things, contemporary CPUs have their own memory prefetch units which attempt to predict which memory will be wanted next and start the process of retrieving it early. One thing Ingo noticed in his tests is that, even without any software prefetch operations, the number of prefetch operations run by the CPU was about the same. So the hardware prefetcher was busy during this time - and it was doing a better job than the software at deciding what to fetch. Throwing explicit prefetch operations into the mix, it seems, just had the effect of interfering with what the hardware was trying to do.
Ingo summarized his results this way:
One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September. Chances are good that other prefetch operations will be removed as well. There will still be a place for prefetch() in the kernel, but only in specific situations where it can be clearly shown to help performance. As with other low-level optimizations (likely() comes to mind), tossing in a prefetch because it seems like it might help is often not the right thing to do.
One other lesson to be found in this experience is that numbers matter. Andi was right when he wanted to remove these operations, but he did not succeed in getting his patch merged. One could come up with a number of reasons why things went differently this time, starting with the fact that Linus took an interest in the problem. But it's also true that performance-oriented patches really need to come with numbers to show that they are achieving the desired effect; had Andi taken the time to quantify the impact of his change, he would have had a stronger case for merging it.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds