|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.33-rc4, released on January 12. "Hmm. Odd release. Something like 40% of the patches are in DRM (mostly nouveau and radeon, both staging, so it's a bit less scary than it sounds. But there's a noticeable i915 component too). That's all pretty unusual." There's also a couple of new low-level drivers, support for LZO-compressed kernels, and the new generic list_sort() function. Full details can be found in the long-format changelog.

Stable updates: the only stable update in the last week is 2.6.31.11, released on January 7 to fix a build error introduced with 2.6.31.10.

Comments (none posted)

Quotes of the week

If anything, today's computer users are less well adapted to dealing with applications that behave differently when the network is unexpectedly absent because both the user and the programmer assume that the network will be there because it always is. They would never set up a situation where the network would be missing and the programs they use/write are unlikely to handle the situation. Lazy kids.
-- Casey Schaufler

I hope all this is helpful since whatever behavior is being tickled makes recent kernels problematic on this caliber of hardware. Let alone the effects on my rear end from my beloved not being able to play 'Blast the Bubbles' the way she would like.
-- Greg Wettstein redefines mission-critical

Comments (none posted)

Timer slack

By Jonathan Corbet
January 12, 2010
One of the best ways to reduce a system's power usage is to avoid waking up the CPU whenever possible. Minimizing wakeups, in turn, is facilitated by ensuring that timers expire at the same time when it makes sense to do so. Waking the processor once to handle two timers is much more efficient than handling them in two separate wakeups. But doing so typically requires adjusting expiration times. For standard (not high resolution) kernel timers, the only way to make that adjustment is with the round_jiffies() function, which makes timeout periods coarser in the hopes that they will coincide more often. This method works to an extent, but it requires code changes wherever timers are used.

Arjan van de Ven has proposed an enhancement to the timer API - called timer slack - which should make it easier to coalesce timer events. In essence, it adds a certain amount of fuzziness to timer expiration times, giving the kernel some flexibility in how the timers are scheduled. That fuzziness is set with:

    void set_timer_slack(struct timer_list *timer, int slack_hz);

In essence, this call says that any timeout scheduled with the given timer can be delayed by up to slack_hz jiffies. By default, the slack is set to 0.4% of the total timeout period - a very conservative value. When the timer is queued, the actual expiration time is determined by means of a simple algorithm to choose a well-defined time within the slack period.

The value of this approach is that it makes it easy to coalesce timer events from multiple sources without needing to change every call site. Additional flexibility can then be had by increasing the slack for specific, frequently-used timers, but, even without that, slack timers should improve power efficiency on many systems.

Comments (4 posted)

The end of user-space mode setting?

By Jonathan Corbet
January 13, 2010
It has now been a year since kernel mode setting (KMS) went into the mainline. KMS moves control of low-level graphics processor modes into the kernel and away from user-space drivers, with a number of associated advantages. Initially only the Intel driver supported KMS, but it has found its way into the Radeon and Nouveau drivers. Now developers are beginning to talk about eliminating user-space mode setting support entirely.

On the Nouveau front, Ben Skeggs posted a patch to remove non-KMS support, saying:

The non-KMS paths are messy, and lets face it, rotting badly. IMO the KMS code is stable enough now that we can continue without the UMS crutch, and indeed, the KMS code supports a lot more chipsets (particularly on GF8 and up) than the UMS code ever will.

The main objection to the removal of this code is that BSD-based systems do not support KMS, but the current driver does not work on those systems anyway. So, while this patch has not found its way to the mainline, it would not be surprising if that happened before the 2.6.34 release.

At about the same time, some Intel driver developers started to ask whether non-KMS support could be dropped. There, too, it seems that the user-space mode setting code is unloved and proving hard to maintain. This code looks like it will remain an unwelcome guest for a while, though; Linus is in no hurry to remove it, and Dave Airlie is even more reluctant:

I'm in the 2-3 years at a minimum, with at least one kernel with no serious regressions in Intel KMS, which we haven't gotten close to yet. I'm not even sure the Intel guys are taking stable seriously enough yet. So far I don't think there is one kernel release (even stable) that works on all Intel chipsets without backporting patches.

So the removal of non-KMS support from the Intel driver is being held up by concerns about the stability of the KMS code. But there is a bigger issue as well: Intel support has been in the kernel for years, so there are plenty of systems which are dependent on user-space mode setting. That means that the support needs to be maintained for long enough to be sure of not breaking those systems. Nouveau, instead, has the advantage of not having been in the mainline until now, so the same regression concerns do not apply. There are advantages, sometimes, to being the latecomer.

Comments (19 posted)

Kernel development news

sys_membarrier()

By Jonathan Corbet
January 13, 2010
Mathieu Desnoyers is the longtime developer of the LTTng tracing toolkit. A current project of his is to provide for fast tracing of multithreaded user-space applications; that, in turn, requires a fast, multithreaded tracing utility. Tracing is controlled through a shared memory area; to make that control as fast as possible, Mathieu would like to use the read-copy-update (RCU) algorithm. That, in turn, means that he has been working on porting RCU - a kernel-only technology - to user space. In the process, he has run into some interesting challenges.

As with the kernel version, user-space RCU works by deferring the cleanup of in-memory objects until it is known that no more references to those objects can exist. The implementation must be done differently, though, since user-space code is unable to run in the same atomic mode used by RCU in the kernel. So, in user space, a call to rcu_read_lock() sets a variable in shared memory indicating that the thread is in an RCU critical section. Within that critical section, it's safe for the thread to access RCU-protected variables.

...at least, it's safe as long as nobody reorders operations in a way that causes an access to happen to an RCU-protected variable before the effects of rcu_read_lock() are visible to other CPUs. That kind of reordering can indeed happen, at both the compiler and CPU levels, so it's a problem which must be addressed. Compile-time reordering is relatively easy to avoid, but runtime reordering in the CPU requires the issuing of a memory barrier instruction. And, indeed, user-space RCU can be made to work by putting memory barriers into the rcu_read_lock() call.

The problem with that solution is that memory barriers slow things down significantly. Even worse, they slow down the fast path for a case - a change to an RCU-protected variable - which happens rarely. So Mathieu would like to get rid of that barrier. To that end, he coded up a solution which sends a signal to every thread when an RCU-protected variable is about to be changed, forcing each thread to execute a memory barrier at that time. This solution does speed things up, believe it or not, but signals are almost never the optimal solution to any problem. Mathieu would like to do something better.

His "something better" turned out to be a simple system call:

    void membarrier();

The initial implementation would simply send an inter-processor interrupt to every CPU in the system; the receiving CPUs would respond by executing an explicit memory barrier instruction. The solution worked, but it ran into a couple of objections in review:

  • By allowing a user-space program to force interrupts to all processors on the system, membarrier() presented an easy way to create denial-of-service attacks on the system.

  • The system call interrupted every processor on the system. Interrupting processors running different applications is a small but useless waste. The problem gets a little worse if some of those CPUs are running realtime tasks, which, presumably, would not welcome the forced addition of a bit of latency into their world. It would even interrupt processors which were currently sleeping - a useless exercise which would also increase energy use.

What followed was a long discussion on how to optimize the patch, whether an explicit memory barrier is needed even after the CPU has taken an inter-processor interrupt (the safe answer appears to be "yes"), and so on. All told, an impressive amount of effort has gone into the optimization of a small patch which is, at its core, implementing the slow path which should be rarely executed.

Current status, as of this writing, is that Mathieu has posted a new version of the patch with a number of changes. The first of those is the addition of an integer "expedited" parameter. If this value is zero, the system call simply calls synchronize_sched() and returns; this is the cheapest way of getting the needed functionality, but it comes at the cost of a latency of some milliseconds for the caller. It seems clear that Mathieu expects the "expedited" mode to be used most of the time.

For an expedited barrier, the system call will look at every CPU in the system, building a mask of those which are running in the same address space as the caller; those CPUs will then receive the inter-processor interrupt asking them to execute a memory barrier instruction. It's a rather more complicated implementation, but, since it only interrupts processors which are running the calling application, the denial of service, performance, and energy use concerns are no longer relevant. One assumes that the patch is getting close to its final form, but it's hard to say for sure: sometimes it's the smallest and simplest patches which are scrutinized the most.

Comments (2 posted)

FBAC-LSM

By Jake Edge
January 13, 2010

Process sandboxes for security are getting lots of attention these days. There are standalone utilities like isolate and Rainbow, sandboxes integrated into applications like the Chromium sandbox, as well as tools that use existing LSMs such as the SELinux sandbox. Furthermore, there are various proposals floating around to add Linux kernel features in support of application sandboxes, such as the seccomp additions and network restrictions. An LSM specifically designed for application sandboxing, which uses a new model called Functionality-Based Application Confinement (FBAC), was introduced on linux-kernel back in December.

FBAC-LSM came out of Z. Cliffe Schreuders's PhD research, and is a prototype implementation of the FBAC model. It uses an earlier version of the LSM interface, with the AppArmor pathname-based hooks and still needs "quite a bit of work to be done before it is ready for production systems or formal code review." Schreuders is looking for collaborators to work on completing the project, presumably with an eye towards getting it into the mainline.

The basic idea behind FBAC is to make security policy more accessible and understandable to users, so that application restrictions are more widely adopted. A major component of the FBAC system is a GUI-based policy manager that can guide users through setting policies for particular applications. Users specify the high-level needs of an application based on its type (such as web browser or file editor) and the policy manager will help create the policies that will govern its behavior.

In developing the policy manager, Schreuders analyzed over a hundred different applications to extract common behaviors that could be encapsulated in the FBAC policies. This allows the policy manager to automate certain aspects of developing policies for new applications, including things like configuration files, network ports, and other resources that the application requires. The policy manager also has a "learning mode" where it can observe the application and suggest additional privileges that might be granted.

FBAC has the underlying concept of "functionalities", which are essentially a set of permissions for file and network operations that are allowed. These are fine-grained permissions for things like "file_read", "file_getattr", "file_execute", "dir_mkdir", "network_incoming", etc. The permissions which are granted to a particular functionality are listed in its definition.

Functionalities are hierarchical, so that they can incorporate other, lower-level permissions into one that governs an entire application or class of applications. In addition, they are parameterized so that a functionality can be applied to multiple different applications, with the parameters specifying the particular files, directories, and network destinations that the permissions are granted for.

Both mandatory access control (MAC) and discretionary access control (DAC) are supported by FBAC. Application policy can be permanently set by an administrator, so that an ordinary user cannot make changes, or FBAC can be configured to allow users to further restrict applications beyond the policies set by the administrator. The confinement of an application then depends on the intersection of these mandatory and discretionary policies.

Allowing users to specify reduced privileges for arbitrary applications risks running afoul of the problems with setuid() programs that other sandboxing mechanisms (the network restrictions feature mentioned above for example) have encountered. Some means of stopping unprivileged users from changing the environment expected by setuid() programs will need to be provided.

The interface to FBAC-LSM is via a filesystem which is mounted at /sys/kernel/security/fbac-lsm. Various files in the directory allow querying the existing installed policies as well as adding new ones. There are several steps to sending the policy information, with each piece being written to a separate file in the directory. That is followed by "commit" being written to /sys/kernel/security/fbac-lsm/commit, which actually causes the policy to be processed. That is rather race-prone, but is required by the sysfs "one value per file" rule. It seems likely that FBAC-LSM will eventually change its interface to a private filesystem like those used by Smack and SELinux.

FBAC is a different approach from that taken by other security solutions, but it has enough similarities that Schreuders has plans to make the policy manager read and write AppArmor and SEEdit policies. But FBAC definitely lives up to its prototype billing. The code is rather disorganized and littered with commented-out sections that make it somewhat hard to follow.

The current incarnation of FBAC-LSM certainly has the feel of code that was put together somewhat hurriedly for a PhD dissertation, rather than as a "real" LSM. But it does embody some interesting ideas that merit further attention. One of the biggest hurdles faced by various security solutions (for which SELinux is the poster child) is the complexity of developing and—more importantly—understanding the policies that are being used. That complexity is something that Schreuders set out to reduce with FBAC. It remains to be seen if he has succeeded with that, but any such attempt is worthy of a look.

Comments (5 posted)

Speculating on page faults

By Jonathan Corbet
January 12, 2010
Improving the performance of the kernel is generally a good thing to do; that is why many of our best developers have put considerable amounts of time into optimization work. One area which has recently seen some attention is in the handling of soft page faults. As the course of this work shows, though, performance problems are not always where one thinks they might be; sometimes it's necessary to take a step back and reevaluate the situation, possibly dumping a lot of code in the process.

Page faults can be quite expensive, especially those which must be resolved by reading data from disk. On a typical system, though, there are a lot of page faults which do not require I/O. A page fault happens because a specific process does not have a valid page table entry for the needed page, but that page might already be in the page cache, in which case handling the fault is just a matter of fixing the page table entry and increasing the page's reference count; this happens often with shared pages or those brought in via the readahead mechanism. Faults for new anonymous pages (application data and stack space, mostly), instead, can be handled through the allocation of a zero-filled page. In either case, the fault is quickly taken care of with no recourse to backing store required.

In many workloads, this kind of "soft" fault happens much more often than hard faults requiring actual I/O. So it's important that they be executed quickly. Various developers had concluded that the kernel was, in fact, not handling this kind of fault quickly enough, and they identified the use of the mmap_sem reader/writer semaphore as the core of the problem. Contention wasn't the issue in this case - only a reader lock is required for page fault handling - but the cache line bouncing caused by continual acquisition of the lock was killing performance. As the number of cores in systems increases, this kind of problem can only get worse.

In response, Hiroyuki Kamezawa posted the first speculative page fault patch back in November. The core idea behind the patch was to try to handle page faults without taking mmap_sem at all. Doing so invites race conditions; in particular, the vm_area_struct (VMA) structure which controls the memory mapping can change while the work is in progress. So the speculative fault code must handle the fault, then check for concurrent changes and, if necessary, redo the work the older, slower way. That's the "speculative" part: doing the work in a lockless mode in the hope that the world will not change in the meantime and force that work to be done again.

The speculative page fault code must also ensure that no changes which could create real trouble (such as freeing the VMA outright) can happen while the fault is being handled. To this end, various versions of the patch have tried techniques like adding reference counts to the VMA structure or using read-copy-update with the red-black tree code (which is used to locate the VMA covering a specific address within an address space) to defer changes while the speculative code is doing its thing.

This work yielded some real results: the number of page faults per unit time that the system could handle, when running a fault-heavy benchmark, approximately doubled. The idea clearly had merit, but Peter Zijlstra felt that Kamezawa-san's patches "weren't quite crazy enough". He set out to rectify this problem with a speculative page fault patch of his own, which saw a number of revisions. Peter's approach included the addition of speculative page table locks and the use of RCU to manage VMA structures. The result was a patch which would "sometimes boot" and which seemed to improve performance.

This is about when Linus got involved, pointing out some problems with Peter's patch, concluding:

I would say that this whole series is _very_ far from being mergeable. Peter seems to have been thinking about the details, while missing all the subtle big picture effects that seem to actually change semantics.

Peter agreed with this conclusion, noting that he'd never thought it was ready yet.

It was in further discussion that Linus, looking at a profile of page fault handling activity, noticed something funny: the real overhead seemed to be in spinlock operations, which shouldn't be involved in the x86-optimized rwsem implementation at all. It turns out that said optimization was only applied to 32-bit systems; on 64-bit builds, reader/writer semaphores were using the generic, semaphore-based implementation. That meant that they were rather slower than they needed to be.

So Linus tossed out a new rwsem implementation based on the x86 exchange-and-add (xadd) instruction with a typical warning:

In other words: UNTESTED! It may molest your pets and drink all your beer. You have been warned.

Kamezawa-san bravely tried the code anyway, and got an interesting result. His pets and his beer both came through without trauma - and the page fault performance was better than with his speculative fault patch. Peter, too, ran some tests against his own speculative code; those results showed that the rwsem change tripled page fault performance. His speculative fault patch improved performance by just a tiny bit more than that, and the two together a little more yet. But the rwsem patch is a small and clear fix, while the speculative page fault patch is large, widespread, scary, and with known problems. So nobody really disputed Peter's conclusion:

So while I think its quite feasible to do these speculative faults, it appears we're not quite ready for them.

As of this writing, nobody has posted a final version of the rwsem patch. Linus has noted that there are things which can be improved with it, but it would be fairly typical for him to leave that work to others. But, one assumes, some version of this patch will be waiting in the wings when the 2.6.34 merge window opens. It will be a clear demonstration that low-hanging performance fruit exists even in our highly-optimized kernel; one need only think to look in the right place.

Comments (7 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.33-rc4 ?
Greg KH Linux 2.6.31.11 ?

Build system

Michal Marek nconfig ?

Core kernel code

Development tools

Dan Carpenter smatch 1.54 ?
Jason Baron jump label v4 ?

Device drivers

Filesystems and block I/O

Memory management

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds