User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.39 kernel is out; it was, as predicted, released by Linus immediately after last week's Edition was published. He indicated some uncertainty about whether another -rc release would have been appropriate:

However, since I'm going to be at LinuxCon Japan in two weeks, the choice for me ended up whether I should just release, or drag it out *three* more weeks, or have some really messy merge window with a break in between.

Prominent features in this release include IPset, the media controller subsystem, a couple of new network flow schedulers, the block plugging rework, the long-awaited removal of the big kernel lock, and more. See the KernelNewbies 2.6.39 page, the LWN merge window summaries (part 1, part 2, and part 3) and Thorsten Leemhuis's summary on The H for more information about this release.

Stable updates: was released on May 21, followed by and on May 23. Each contains the usual list of important fixes.

Comments (none posted)

Quotes of the week

Quite some time ago I was horrified by the private behaviour of a hacker I deeply respected: malicious, hypocritical stuff. And it caused an internal crisis for me: I thought we were all striving together to make the world a better place. Here are the results I finally derived:

  1. Being a great hacker does not imbue moral or ethical characteristics.
  2. Being a great coder doesn't mean you're not a crackpot.
  3. Working on a great project doesn't mean you share my motivations about it.

This wasn't obvious to me, and it seems it's not obvious to others.

-- Rusty Russell

The more I look at the arguments for why assholes are necessary to good code, the more I have to wonder if some form of Stockholm syndrome is at work.
-- Valerie Aurora

Can we drop most of MCA, EISA and ISA bus if we are going to have a big version change ? A driver spring clean is much overdue and it's all in git in case someone wishes to sneak out at midnight and bring some crawly horror back from the dead.
-- Alan Cox

UEFI stands for "Unified Extensible Firmware Interface", where "Firmware" is an ancient African word meaning "Why do something right when you can do it so wrong that children will weep and brave adults will cower before you", and "UEI" is Celtic for "We missed DOS so we burned it into your ROMs".
-- Matthew Garrett

Comments (14 posted)

Linux wireless support education videos

Wireless networking hacker Luis Rodriguez has put together a set of videos on how the Linux 802.11 layer works aimed at developers writing and supporting wireless drivers. "If you have engineers who need to support the 802.11 Linux subsystem you at times see yourself needing to educate each group through some sessions. In hopes of reusing educational sessions I've decided to record my own series and post it on YouTube." Topics covered include overviews of the 802.11 subsystem, how the development process works, driver debugging, and more.

Full Story (comments: 5)


Linus's message warning that this merge window may be a little shorter than usual ends with an interesting postscript: "The voices in my head also tell me that the numbers are getting too big. I may just call the thing 2.8.0. And I almost guarantee that this PS is going to result in more discussion than the rest, but when the voices tell me to do things, I listen."

Full Story (comments: 92)

Seccomp filters: permission denied

By Jonathan Corbet
May 25, 2011
Last week's article on the idea of expanding the "secure computing" facility by integrating it with the perf/ftrace mechanism mentioned the unsurprising fact that the developers of the existing security module mechanism were not entirely enthusiastic about the creation of a new and completely different security framework. Since then, discussion of the patch has continued, and opposition has come from an entirely different direction: the tracing and instrumentation developers.

Peter Zijlstra started off the new discussion with a brief note reading: "I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter)." Thomas Gleixner jumped in with a more detailed description of his objections. In his view, adding security features to tracepoints will add overhead to the tracing system, make it harder to change things in the future, and generally mix tasks which should not be mixed. It would be better, he said, to keep seccomp as a separate facility which can share the filtering mechanism once a suitable set of internal APIs has been worked out.

Ingo Molnar, a big supporter of this patch, is undeterred; his belief is that more strongly integrated mechanisms will create a more powerful and useful tool. Since the security decisions need to be made anyway, he would like to see them made using the existing instrumentation to the highest level possible. That argument does not appear to be carrying the day, though; Peter replied:

But face it, you can argue until you're blue in the face, but both tglx and I will NAK any and all patches that extend perf/ftrace beyond the passive observing role.

As of this writing, that's where things stand. Meanwhile, the expanded secure computing mechanism - which didn't use perf in its original form - will miss this merge window and has no clear path into the mainline. Given that Linus doesn't like the original idea either, it's not at all clear that this functionality has a real future.

Comments (11 posted)

Kernel development news

What's coming in $NEXT_KERNEL_VERSION, part 1

By Jonathan Corbet
May 25, 2011
As of this writing, some 5400 non-merge changesets have been pulled into the mainline kernel for the next release. The initial indications are that this development cycle will not have a huge number of exciting new features, but there are still some interesting additions. Among the user-visible changes are the following:

  • There are two new POSIX clock types: CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM; they can be used to set timers that will wake the system from a suspended state. See this article for more information on these new clocks.

  • The Quick Fair Queue packet scheduler has been added to the network stack.

  • The just-in-time compiler for BPF packet filters has been merged; only x86-64 is supported for now.

  • There is a new networking system call:

        int sendmmsg(int fd, struct mmsghdr *mmsg, unsigned int vlen,
                     unsigned int flags);

    It is the counterpart to recvmmsg(), allowing a process to send multiple messages with a single system call.

  • The ICMP sockets feature has been merged; its main purpose is to allow unprivileged programs to send echo-request datagrams.

  • Two new sysctl knobs allow the capabilities given to user-mode helpers invoked by the kernel to be restricted; see the commit for details.

  • The tmpfs filesystem has gained support for extended attributes.

  • The Xen block backend driver (allowing guests to export block devices to other guests) has been merged.

  • New hardware support includes:

    • Systems and processors: Netlogic XLR/XLS MIPS CPUs, Lantiq MIPS-based SOCs, PowerPC A2 and "wire speed processor" CPUs, and Armadeus APF9328 development boards.

    • Audio/video: Philips TEA5757 radio tuners, Digigram Lola boards, Apple iSight microphones, Maxim max98095 codecs, Wolfson Micro WM8915 codecs, Asahi Kasei AK4641 codecs, HP iPAQ hx4700 audio interfaces, NXP TDA18212 silicon tuners, Micron MT9V032 sensors, Sony CXD2820R DVB-T/T2/C demodulators, RedRat3 IR transceivers, Samsung S5P and EXYNOS4 MIPI CSI receivers, and Micronas DRXD tuners.

    • Input: PenMount dual touch panels, Maxim max11801 touchscreen controllers, Analog Devices ADP5589 I2C QWERTY keypad and I/O expanders, and Freescale MPR121 Touchkey controllers.

    • Network: Marvell "WiFi-Ex" wireless adapters (SD8787 initially) and Marvell 8787 Bluetooth interfaces.

    • USB: Renesas USBHS controllers, Samsung S5P EHCI controllers, Freescale USB OTG transceivers, and Samsung S3C24XX USB high-speed controllers.

    • Miscellaneous: CARMA DATA-FPGA programmers, Broadcom's "advanced microcontroller bus architecture," Freescale SEC4/CAAM security engines, Samsung S5PV210 crypto accelerators, Maxim MAX16065, MAX16066, MAX16067, MAX16068, MAX16070, and MAX16071 system managers, Maxim MAX6642 temperature sensors, TI UCD90XXX system health controllers, TI UCD9200 system controllers, Analog Devices ADM1275 hot-swap controllers, Analog Devices AD5504, AD5501, AD5760, and AD5780 DACs, Analog Devices AD7780 and AD7781 analog to digital convertors, Analog Devices ADXRS450 Digital Output Gyroscopes, Xilinx PS UARTs, TAOS TSL2580, TSL2581, and TSL2583 light-to-digital converters, Intel "management engine" interfaces, nVidia Tegra embedded controllers, and IEEE 1588 (precision time protocol) clocks.

    Also added to the staging tree is the user-space support code for the USB/IP subsystem which allows a system to "export" its USB devices over the net.

Changes visible to kernel developers include:

  • Prefetching is no longer used in linked list and hlist traversal; this may be the beginning of a much more extensive program to remove explicit prefetch operations. See this article for more information on the prefetch removal.

  • There is a new strtobool() function for turning user-supplied strings into boolean values:

        int strtobool(const char *s, bool *res);

    Anything starting with one of [yY1] is considered to be true, while strings starting with one of [nN0] are false; anything else gets an -EINVAL error.

  • There is a whole series of new functions for converting user-space strings to kernel-space integer values; all follow this pattern:

        int kstrtol_from_user(const char __user *s, size_t count, 
    			  unsigned int base, long *res);

    These functions take care of safely copying the string from user space and performing the integer conversion.

  • The kernel has a new generic binary search function:

        void *bsearch(const void *key, const void *base, size_t num, size_t size,
    	          int (*cmp)(const void *key, const void *elt));

    This function will search for key in an array starting at base containing num elements of the given size.

  • The use of threads for the handling of interrupts on specific lines can be controlled with irq_set_thread() and irq_set_nothread().

  • The static_branch() interface for the jump label mechanism has been merged.

  • The function tracer can now support multiple users with each tracing a different set of functions.

  • The alarm timer mechanism - which can set timers that fire even if the system is suspended - has been merged.

  • An object passed to kfree_rcu() will be handed to kfree() after the next read-copy-update grace period. There are a lot of RCU callbacks which only call kfree(); it should be able to replace those with kfree_rcu() calls.

  • The -Os (optimize for size) option is no longer the default for kernel compiles; the associated costs in code quality were deemed to be too high. Linus said: "I still happen to believe that I$ miss costs are a major thing, but sadly, -Os doesn't seem to be the solution. With or without it, gcc will miss some obvious code size improvements, and with it enabled gcc will sometimes make choices that aren't good even with high I$ miss ratios."

  • The first rounds of ARM architecture cleanup patches have gone in. A number of duplicated functionalities have been consolidated, and support for a number of (probably) never-used platform and board configurations have been removed.

  • The W= parameter to kernel builds now takes values from 1 to 3. At the first level, only warnings deemed to have a high chance of being relevant; a full kernel build generates "only" 4800 of them. At W=3, developers get a full 86,000 warnings to look at. Note that if you want all of the warnings, you need to say W=123.

The merge window for this development cycle is likely to end on May 29, just before Linus boards a plane for Japan. At that time, presumably, we will learn what the next release will be called; Linus has made it clear that he thinks the 2.6.x numbers are getting too high and that he thinks it's time for a change. Tune in next week for the conclusion of this merge window and the end of kernel version number suspense.

Comments (9 posted)

Kernel address randomization

By Jonathan Corbet
May 24, 2011
Last week's Kernel Page included a brief item about the hiding of kernel addresses from user space. This hiding has come under fire from a number of developers who say that it breaks things (perf, for example) and that it does not provide any real additional security. That said, there does seem to be a consensus around the idea that it's better if attackers don't know where the kernel keeps its data structures. As it turns out, there might be a better way to do that than simply hiding pointer values.

There is no doubt that having access to the layout of the kernel in memory is useful to attackers. As Dan Rosenberg put it:

I agree about the fact that kptr_restrict is an incomplete security feature. However, I disagree that it lacks usefulness entirely. Virtually every public kernel exploit in the past year leverages /proc/kallsyms or other kernel address leakage to target an attack.

The hiding of kernel addresses is meant to deprive attackers of that extra information, making their task harder. One big problem with that approach is that most systems out there are running stock distribution kernels. Getting the needed address information from the distributor's kernel package is not a particularly challenging task. So, on these systems, there is no real mystery about the layout of the kernel, regardless of whether pointer values are allowed to leak to user space or not.

While all of this was being discussed, another idea came out: why not randomize the location of the kernel in memory at boot time? Address space layout randomization has been used to resist canned attacks for a long time, but the kernel takes no such measure for itself. Given that the kernel image is relocatable, there is no real reason why it always needs to be loaded at the same address. If the kernel calculated a different offset for itself at every boot, it could subtract that offset from pointer values before passing them to user space. Those pointers could then be used by tools like perf, but they would no longer be helpful for anybody seeking to overwrite kernel data structures.

Dan has been looking into kernel-space randomization with some success; it turns out that simply relocating the kernel is not that hard. That said, he has run up against a few potential problems. The first of those is that there is very little entropy available at the beginning of the boot process, so the generation of a sufficiently random base address for the kernel is not entirely straightforward. It seems that enough bits of entropy can be derived from the real-time clock and time stamp counter to make it hard for an attacker to derive the base address later on, but a real random number would be better.

Next, as Linus pointed out, the kernel is not infinitely relocatable. There are a number of alignment requirements which constrain the kernel's placement, so, according to Linus, there is a maximum of 8-12 bits of randomization available. That means that an exploit attempt could find the right offset after a maximum of a few thousand tries. Given that computers can try things very quickly, that does not give a site administrator much time to respond.

As others pointed out, though, that amount of randomness is probably enough. Failed exploit attempts have a high probability of generating a kernel oops; even if an administrator does not notice the oops immediately, it should come to their attention at some point. So the ability to stealthily take over a system is gone. Beyond that, failed exploits may well take down the system entirely (especially if, as is the case with many RHEL systems, the "panic on oops" flag is set) or leave it in a state where further exploit attempts cannot work. There is, it seems, a real advantage to forcing an attacker to guess.

That advantage evaporates, though, if an attacker can somehow figure out what offset a given system used at boot time. Dan noticed one way that could happen: the unprivileged SIDT instruction can be used to locate the system's interrupt descriptor table. That location could, in turn, be used to calculate the kernel's base offset. Dynamic allocation of the table can solve that problem at the cost of messing with some tricky very-early-boot code. There could be other advantages to dynamically allocating the table, though; if the table were put into the per-CPU area, it might make the system a little more scalable.

So this problem can be solved, but there will, beyond doubt, be other places where it will be possible for an attacker to obtain a real kernel-space address. There are simply too many ways in which that information might leak into user space. Plugging all of those leaks looks like one of those long-term tasks that is never really done. It may, however, be possible to get close enough to done that attackers will not be able to count on knowing the true location of the kernel in a running system. That may be a bit of security through obscurity that is worth having.

Comments (11 posted)

The problem with prefetch

By Jonathan Corbet
May 24, 2011
Over time, software developers tend to learn that micro-optimization efforts are generally not worthwhile, especially in the absence of hard data pointing out a specific problem. Performance problems are often not where we think they are, so undirected attempts to tweak things to make them go faster can be entirely ineffective. Or, indeed, they can make things worse. That is a lesson that the kernel developers have just relearned.

At the kernel level, performance often comes down to cache behavior. Memory references which must actually be satisfied by memory are extremely slow; good performance requires that needed data be in a CPU cache much of the time. The kernel goes out of its way to use cache-hot memory when possible; there has also been some significant work put into tasks like reordering structures so that fields that are commonly accessed together are found in the same cache line. As a general rule, these optimizations have helped performance in measurable ways.

Cache misses are often unavoidable, but it is sometimes possible to attempt to reduce their cost. If the kernel knows that it will be accessing memory at a particular location in the near future, it can use a CPU-specific prefetch instruction to begin the process of bringing the data into cache. This instruction is made available to kernel code via the generic prefetch() function; developers have made heavy use of it. Consider, for example, this commonly-used macro from <linux/list.h>:

    #define list_for_each(pos, head) \
	for (pos = (head)->next; prefetch(pos->next), pos != (head); \
            pos = pos->next)

This macro (in a number of variants) is used to traverse a linked list. The idea behind the prefetch() call here is to begin the process of fetching the next entry in the list while the current entry is being processed. Hopefully by the time the next loop iteration starts, the data will have arrived - or, at least, it will be in transit. Linked lists are known to be cache-unfriendly data structures, so it makes sense that this type of optimization can help to speed things up.

Except that it doesn't - at least, not on x86 processors.

Andi Kleen may have been the first to question this optimization when he tried to remove the prefetches from list operations last September. His patch generated little discussion, though, and apparently fell through the cracks. Recently, Linus did some profiling on one of his favorite workloads (kernel builds) and found that the prefetch instructions were at the top of the ranking. Performing the prefetching cost time, and that time was not being repaid through better cache behavior; simply removing the prefetch() calls made the build go faster.

Ingo Molnar, being Ingo, jumped in and did a week's worth of research in an hour or so. Using perf and a slightly tweaked kernel, he was able to verify that using the prefetch instructions caused a performance loss of about 0.5%. That is not a headline-inspiring performance regression, certainly, but this is an optimization which was supposed to make things go faster. Clearly something is not working the way that people thought it was.

Linus pointed out one problem at the outset: his test involved a lot of traversals of singly-linked hlist hash table lists. Those lists tend to be short, so there is not much scope for prefetching; in fact, much of the time, the only prefetch attempted used the null pointer that indicates the end of the list. Prefetching with a null pointer seems silly, but it's also costly: evidently every such prefetch on x86 machines (and, seemingly, ARM as well) causes a translation lookaside buffer miss and a pipeline stall. Ingo measured this effect and came to the conclusion that each null prefetch cost about 20 processor cycles.

Clearly, null prefetches are a bad idea. It would be nice if the CPU would simply ignore attempts to prefetch using a null pointer, but that's not how things are, so, as is often the case, one ends up trying to solve the problem in software instead. Ingo did some testing with a version of prefetch() which would only issue prefetch instructions for non-null pointers; that version did, indeed, perform better. But it still performed measurably worse than simply skipping the prefetching altogether.

CPU designers are well aware of the cost of waiting for memory; they have put a great deal of effort into minimizing that cost whenever possible. Among other things, contemporary CPUs have their own memory prefetch units which attempt to predict which memory will be wanted next and start the process of retrieving it early. One thing Ingo noticed in his tests is that, even without any software prefetch operations, the number of prefetch operations run by the CPU was about the same. So the hardware prefetcher was busy during this time - and it was doing a better job than the software at deciding what to fetch. Throwing explicit prefetch operations into the mix, it seems, just had the effect of interfering with what the hardware was trying to do.

Ingo summarized his results this way:

So the conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.

One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September. Chances are good that other prefetch operations will be removed as well. There will still be a place for prefetch() in the kernel, but only in specific situations where it can be clearly shown to help performance. As with other low-level optimizations (likely() comes to mind), tossing in a prefetch because it seems like it might help is often not the right thing to do.

One other lesson to be found in this experience is that numbers matter. Andi was right when he wanted to remove these operations, but he did not succeed in getting his patch merged. One could come up with a number of reasons why things went differently this time, starting with the fact that Linus took an interest in the problem. But it's also true that performance-oriented patches really need to come with numbers to show that they are achieving the desired effect; had Andi taken the time to quantify the impact of his change, he would have had a stronger case for merging it.

Comments (52 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds