Brief items
The current development kernel is 2.6.33-rc4,
released on January 12.
"
Hmm. Odd release. Something like 40% of the patches are in DRM
(mostly nouveau and radeon, both staging, so it's a bit less scary than it
sounds. But there's a noticeable i915 component too). That's all pretty
unusual." There's also a couple of new low-level drivers, support
for LZO-compressed kernels, and the new
generic list_sort()
function. Full details can be found in
the
long-format changelog.
Stable updates: the only stable update in the last week is 2.6.31.11, released on
January 7 to fix a build error introduced with 2.6.31.10.
Comments (none posted)
If anything, today's computer users are less well adapted to
dealing with applications that behave differently when the network
is unexpectedly absent because both the user and the programmer
assume that the network will be there because it always is. They
would never set up a situation where the network would be missing
and the programs they use/write are unlikely to handle the
situation. Lazy kids.
--
Casey Schaufler
I hope all this is helpful since whatever behavior is being tickled
makes recent kernels problematic on this caliber of hardware. Let
alone the effects on my rear end from my beloved not being able to
play 'Blast the Bubbles' the way she would like.
--
Greg Wettstein redefines mission-critical
Comments (none posted)
By Jonathan Corbet
January 12, 2010
One of the best ways to reduce a system's power usage is to avoid waking up
the CPU whenever possible. Minimizing wakeups, in turn, is facilitated by
ensuring that timers expire at the same time when it makes sense to do so.
Waking the processor once to handle two timers is much more efficient than
handling them in two separate wakeups. But doing so typically requires
adjusting expiration times. For standard (not high resolution) kernel
timers, the only way to make that adjustment is with the
round_jiffies() function, which makes timeout periods coarser in
the hopes that they will coincide more often. This method works to an
extent, but it requires code changes wherever timers are used.
Arjan van de Ven has proposed an enhancement to the timer API - called timer slack - which should make
it easier to coalesce timer events. In essence, it adds a certain amount
of fuzziness to timer expiration times, giving the kernel some flexibility
in how the timers are scheduled. That fuzziness is set with:
void set_timer_slack(struct timer_list *timer, int slack_hz);
In essence, this call says that any timeout scheduled with the given
timer can be delayed by up to slack_hz jiffies. By
default, the slack is set to 0.4% of the total timeout period - a very
conservative value.
When the timer is queued, the actual expiration time is determined by means
of a simple algorithm to choose a well-defined time within the slack
period.
The value of this approach is that it makes it easy to coalesce timer
events from multiple sources without needing to change every call site.
Additional flexibility can then be had by increasing the slack for
specific, frequently-used timers, but, even without that, slack timers
should improve power efficiency on many systems.
Comments (4 posted)
By Jonathan Corbet
January 13, 2010
It has now been a year since kernel mode setting (KMS) went into the
mainline. KMS moves control of low-level graphics processor modes into the
kernel and away from user-space drivers, with a number of associated
advantages. Initially only the Intel driver supported KMS, but it has
found its way into the Radeon and Nouveau drivers. Now developers are
beginning to talk about eliminating user-space mode setting support
entirely.
On the Nouveau front, Ben Skeggs posted a
patch to remove non-KMS support, saying:
The non-KMS paths are messy, and lets face it, rotting badly. IMO
the KMS code is stable enough now that we can continue without the
UMS crutch, and indeed, the KMS code supports a lot more chipsets
(particularly on GF8 and up) than the UMS code ever will.
The main objection to the removal of this code is that BSD-based systems do
not support KMS, but the current driver does not work on those systems
anyway. So, while this patch has not found its way to the mainline, it
would not be surprising if that happened before the 2.6.34 release.
At about the same time, some Intel driver developers started to ask whether non-KMS support could be dropped.
There, too, it seems that the user-space mode setting code is unloved and
proving hard to maintain. This code looks like it will remain an unwelcome
guest for a while, though; Linus is in no
hurry to remove it, and Dave Airlie is even
more reluctant:
I'm in the 2-3 years at a minimum, with at least one kernel with no
serious regressions in Intel KMS, which we haven't gotten close to
yet. I'm not even sure the Intel guys are taking stable seriously
enough yet. So far I don't think there is one kernel release (even
stable) that works on all Intel chipsets without backporting
patches.
So the removal of non-KMS support from the Intel driver is being held up by
concerns about the stability of the KMS code. But there is a bigger issue
as well: Intel support has been in the kernel for years, so there are
plenty of systems which are dependent on user-space mode setting. That
means that the support needs to be maintained for long enough to be sure of
not breaking those systems. Nouveau, instead, has the advantage of not
having been in the mainline until now, so the same regression concerns do
not apply. There are advantages, sometimes, to being the latecomer.
Comments (19 posted)
Kernel development news
By Jonathan Corbet
January 13, 2010
Mathieu Desnoyers is the longtime developer of the
LTTng tracing toolkit.
A current project of his is to provide for fast tracing of multithreaded
user-space applications; that, in turn, requires a fast, multithreaded
tracing utility. Tracing is controlled through a shared memory area; to
make that control as fast as possible, Mathieu would like to use the
read-copy-update (RCU) algorithm. That, in turn, means that he has been
working on porting RCU - a kernel-only technology - to user space. In the
process, he has run into some interesting challenges.
As with the kernel version, user-space RCU works by deferring the cleanup
of in-memory objects until it is known that no more references to those
objects can exist. The implementation must be done differently, though,
since user-space code is unable to run in the same atomic mode used by RCU
in the kernel. So, in user space, a call to rcu_read_lock() sets
a variable in shared memory indicating that the thread is in an RCU
critical section. Within that critical section, it's safe for the thread
to access RCU-protected variables.
...at least, it's safe as long as nobody reorders operations in a way that
causes an access to happen to an RCU-protected variable before the effects
of rcu_read_lock() are visible to other CPUs. That kind of
reordering can indeed happen, at both the compiler and CPU levels, so it's
a problem which must be addressed. Compile-time reordering is relatively
easy to avoid, but runtime reordering in the CPU requires the issuing of a
memory barrier instruction. And, indeed, user-space RCU can be made to
work by putting memory barriers into the rcu_read_lock() call.
The problem with that solution is that memory barriers slow things down
significantly. Even worse, they slow down the fast path for a case - a
change to an RCU-protected variable - which happens rarely. So Mathieu
would like to get rid of that barrier. To that end, he coded up a solution
which sends a signal to every thread when an RCU-protected variable is
about to be changed, forcing each thread to execute a memory barrier at
that time. This solution does speed things up, believe it or not, but
signals are almost never the optimal solution to any problem. Mathieu
would like to do something better.
His "something better" turned out to be a simple system call:
void membarrier();
The initial implementation would simply send an inter-processor interrupt
to every CPU in the system; the receiving CPUs would respond by executing
an explicit memory barrier instruction. The solution worked, but it ran
into a couple of objections in review:
- By allowing a user-space program to force interrupts to all processors
on the system, membarrier() presented an easy way to create
denial-of-service attacks on the system.
- The system call interrupted every processor on the system.
Interrupting processors running different applications is a small but
useless waste. The problem gets a little worse if some of those CPUs
are running realtime tasks, which, presumably, would not welcome the
forced addition of a bit of latency into their world. It would even
interrupt processors which were currently sleeping - a useless
exercise which would also increase energy use.
What followed was a long discussion on how to optimize the patch, whether
an explicit memory barrier is needed even after the CPU has taken an
inter-processor interrupt (the safe answer appears to be "yes"), and so
on. All told, an impressive amount of effort has gone into the
optimization of a small patch which is, at its core, implementing the slow
path which should be rarely executed.
Current status, as of this writing, is that Mathieu has posted a new version of the patch with a number of
changes. The first of those is the addition of an integer
"expedited" parameter. If this value is zero, the system call
simply calls synchronize_sched() and returns; this is the cheapest
way of getting the needed functionality, but it comes at the cost of a
latency of some milliseconds for the caller. It seems clear that Mathieu
expects the "expedited" mode to be used most of the time.
For an expedited barrier, the system call will look at every CPU in the
system, building a mask of those which are running in the same address
space as the caller; those CPUs will then receive the inter-processor
interrupt asking them to execute a memory barrier instruction. It's a
rather more complicated implementation, but, since it only interrupts
processors which are running the calling application, the denial of
service, performance, and energy use concerns are no longer relevant. One
assumes that the patch is getting close to its final form, but it's hard to
say for sure: sometimes it's the smallest and simplest patches which are
scrutinized the most.
Comments (2 posted)
By Jake Edge
January 13, 2010
Process sandboxes for security are getting lots of attention these days.
There are standalone utilities like isolate and Rainbow, sandboxes integrated
into applications like the Chromium sandbox, as well as tools that
use existing LSMs such as the SELinux sandbox.
Furthermore,
there are various proposals floating around to add Linux kernel features in
support of application sandboxes, such as the seccomp additions and network restrictions. An LSM
specifically designed for application sandboxing, which uses
a new model called Functionality-Based Application
Confinement (FBAC), was introduced on linux-kernel back
in December.
FBAC-LSM came out of Z. Cliffe
Schreuders's PhD research, and is a
prototype implementation of the FBAC model. It uses an earlier version of
the LSM interface, with the AppArmor pathname-based hooks and still needs "quite a bit of work to be done before it is ready for production systems or
formal code review." Schreuders is looking for collaborators to
work on completing the project, presumably with an eye towards getting it
into the mainline.
The basic idea behind FBAC is to make security policy more accessible and
understandable to users, so that application restrictions are more widely
adopted. A major component of the FBAC system is a GUI-based policy
manager that can guide users through setting policies for particular
applications. Users specify the high-level needs of an application based
on its type (such as web browser or file editor) and the policy manager
will help create the policies that will govern its behavior.
In developing the policy manager, Schreuders analyzed over a hundred
different applications to extract common behaviors that could be
encapsulated in the FBAC policies. This allows the policy manager to
automate certain aspects of developing policies for new applications,
including things like configuration files, network ports, and other
resources that the application requires.
The policy manager also has a "learning mode" where it can observe the
application and suggest additional privileges that might be granted.
FBAC has the underlying concept of "functionalities", which are essentially
a set of permissions for file and network operations that are allowed.
These are fine-grained permissions for things like "file_read",
"file_getattr", "file_execute", "dir_mkdir", "network_incoming", etc.
The permissions which are granted to a particular functionality are listed
in its definition.
Functionalities are hierarchical, so that they can incorporate other,
lower-level permissions into one that governs an entire application or
class of applications. In addition, they are parameterized so that a
functionality can be applied to multiple different applications, with the
parameters specifying the particular files, directories, and network
destinations that the permissions are granted for.
Both mandatory access control (MAC) and discretionary access control (DAC)
are supported by FBAC. Application policy can be permanently set by an
administrator, so that an ordinary user cannot make changes, or FBAC can
be configured to allow users to further restrict applications beyond the
policies set by the administrator. The confinement of an application then
depends on the intersection of these mandatory and discretionary policies.
Allowing users to specify reduced privileges for arbitrary applications
risks running afoul of the problems with setuid() programs that
other sandboxing mechanisms (the network restrictions feature mentioned
above for example) have encountered. Some means of stopping unprivileged
users from changing the environment expected by setuid() programs
will need to be provided.
The interface to FBAC-LSM is via a filesystem which is mounted at
/sys/kernel/security/fbac-lsm. Various files in the directory
allow querying the existing installed policies as well as adding new ones.
There are several steps to sending the policy information, with each piece
being written to a separate file in the directory. That is followed by
"commit"
being written to /sys/kernel/security/fbac-lsm/commit, which
actually causes the policy to be processed. That is rather race-prone, but
is required by the sysfs "one value per file" rule. It seems
likely that FBAC-LSM will eventually change its interface to a
private filesystem like those used by
Smack and SELinux.
FBAC is a different approach from that taken by other security solutions, but
it has enough similarities that Schreuders has plans to make the policy
manager read and write AppArmor and SEEdit policies. But FBAC
definitely
lives up to its prototype billing. The code is rather disorganized and
littered with commented-out sections that make it somewhat hard to follow.
The current incarnation of FBAC-LSM certainly has the feel of code that was
put together somewhat hurriedly for a PhD dissertation, rather than as a "real" LSM. But it
does embody some interesting ideas that merit further attention. One of
the biggest hurdles faced by various security solutions (for which SELinux
is the poster child) is the complexity of developing and—more
importantly—understanding the policies that are being used. That
complexity is something that Schreuders set out to reduce with FBAC. It
remains to be seen if he has succeeded with that, but any such attempt is
worthy of a look.
Comments (5 posted)
By Jonathan Corbet
January 12, 2010
Improving the performance of the kernel is generally a good thing to do;
that is why many of our best developers have put considerable amounts of
time into optimization work.
One area which has recently seen some attention is in the handling of soft
page faults. As the course of this work shows, though, performance
problems are not always where one thinks they might be; sometimes it's
necessary to take a step back and reevaluate the situation, possibly
dumping a lot of code in the process.
Page faults can be quite expensive, especially those which must be resolved
by reading data from disk. On a typical system, though, there are a lot of
page faults which do not require I/O. A page fault happens because a
specific process does not have a valid page table entry for the needed
page, but that page might already be in the page cache, in which case
handling the fault is just a matter of fixing the page table entry and
increasing the page's reference count; this happens often with shared
pages or those brought in via the readahead mechanism.
Faults for new anonymous pages (application data and stack space, mostly),
instead, can be handled through the
allocation of a zero-filled page. In either case, the fault is quickly taken
care of with no recourse to backing store required.
In many workloads, this kind of "soft" fault happens much more often than
hard faults requiring actual I/O. So it's important that they be executed
quickly. Various developers had concluded that the kernel was, in fact,
not handling this kind of fault quickly enough, and they identified the use
of the mmap_sem reader/writer semaphore as the core of the
problem. Contention wasn't the issue in this case - only a reader lock is
required for page fault handling - but the cache line bouncing caused by
continual acquisition of the lock was killing performance. As the number
of cores in systems increases, this kind of problem can only get worse.
In response, Hiroyuki Kamezawa posted the first speculative page fault patch
back in November. The core idea behind the patch was to try to handle page
faults without taking mmap_sem at all. Doing so invites race
conditions; in particular, the vm_area_struct (VMA) structure
which controls the memory mapping can change while the work is in
progress. So the speculative fault code must handle the fault, then check for
concurrent changes and, if necessary, redo the work the older, slower way.
That's the "speculative" part: doing the work in a lockless mode in the hope
that the world will not change in the meantime and force that work to be done
again.
The speculative page fault code must also ensure that no changes which
could create real trouble (such as freeing the VMA outright) can happen
while the fault is being handled. To this end, various versions of the
patch have tried techniques like adding reference counts to the VMA
structure or using read-copy-update with the red-black tree code (which is
used to locate the VMA covering a specific address within an address space)
to defer changes while the speculative code is doing its thing.
This work yielded some real results: the number of page faults per unit
time that the system could handle, when running a fault-heavy benchmark,
approximately doubled. The idea clearly had merit, but Peter Zijlstra felt that Kamezawa-san's patches
"weren't quite crazy enough". He set out to rectify this
problem with a speculative page
fault patch of his own, which saw a number of revisions. Peter's
approach included the addition of speculative page table locks and the use
of RCU to manage VMA structures. The result was a patch which would
"sometimes boot" and which seemed to improve performance.
This is about when Linus got involved, pointing out some problems
with Peter's patch, concluding:
I would say that this whole series is _very_ far from being
mergeable. Peter seems to have been thinking about the details,
while missing all the subtle big picture effects that seem to
actually change semantics.
Peter agreed with this conclusion, noting that he'd never thought it was
ready yet.
It was in further discussion that Linus, looking at a profile of page fault
handling activity, noticed something funny:
the real overhead seemed to be in spinlock operations, which shouldn't be
involved in the x86-optimized rwsem implementation at all. It turns out
that said optimization was only applied to 32-bit systems; on 64-bit
builds, reader/writer semaphores were using the generic, semaphore-based
implementation. That meant that they were rather slower than they needed
to be.
So Linus tossed out a new rwsem
implementation based on the x86 exchange-and-add (xadd)
instruction with a typical warning:
In other words: UNTESTED! It may molest your pets and drink all
your beer. You have been warned.
Kamezawa-san bravely tried the code anyway, and got an interesting result. His pets and his beer
both came through without trauma - and the page fault performance was better than
with his speculative fault patch. Peter, too, ran some tests against his own speculative code;
those results showed that the rwsem change tripled page fault performance.
His speculative fault patch improved performance by just a tiny bit more
than that, and the two together a little more yet. But the rwsem patch is a
small and clear fix, while the speculative page fault patch is large,
widespread, scary, and with known problems. So nobody really disputed
Peter's conclusion:
So while I think its quite feasible to do these speculative faults,
it appears we're not quite ready for them.
As of this writing, nobody has posted a final version of the rwsem patch.
Linus has noted that there are things which can be improved with it, but it
would be fairly typical for him to leave that work to others. But, one
assumes, some version of this patch will be waiting in the wings when the
2.6.34 merge window opens. It will be a clear demonstration that
low-hanging performance fruit exists even in our highly-optimized kernel;
one need only think to look in the right place.
Comments (7 posted)
Patches and updates
Kernel trees
Build system
- Michal Marek: nconfig .
(January 7, 2010)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>