The current development kernel is 2.6.36-rc2
by Linus on
August 22. It contains mostly fixes, but Linus did also pull some
small parts of the VFS scalability patch set. "The other big merge
in -rc2 is the intel graphics update. I'm not hugely happy about the timing
of it, but I think I needed to pull it. Apart from that, there's a number
of random fixes all over, the appended shortlog gives you a taste of
" See said shortlog for details, or the
for all the details.
About 200 changes have been merged (as of this writing) since the
2.6.36-rc2 release. They are dominated by fixes, but there's also a new
driver for Marvell pxa168 Ethernet controllers and a core mutex change (see
Stable updates: the 220.127.116.11, 18.104.22.168, 22.214.171.124, and 126.96.36.199 stable kernel updates were released
on August 20. These are relatively small updates containing fixes for
the new "stack guard page" feature which was added to close the
recently-disclosed X.org local
There is a rather larger set of updates in the review process currently;
they can be expected on or after August 26.
Comments (none posted)
Well, sir, the wait staff and I thought we'd just write to ask how
your convalescence was going. Sorry about spilling the patch down
your front, sir, but the glass was slippery. Who could have
foreseen that you'd step backwards, fall over the pool boy and
upset a flaming bananas foster on to the front of your shorts. I
did get a commendation for my quick action with a high pressure
hose to the affected area (thank you, sir, for your shocked but
grateful expression). I must say I didn't foresee the force of the
blast catapulting you against the pool steps, but a concussion is a
small price to pay for avoiding fried nuts with the bananas, eh,
sir? It's funny how things turn out, and I bet if you could do it
over again, you'd have got off your arse to fetch your own damn
patch now, wouldn't you, sir?
-- James Bottomley
(Thanks to Jody Belka)
Oh no you don't. As per the documentation in the kernel, I get to
now mock you mercilessly for trying to do such a foolish thing!
-- Greg Kroah-Hartman
As far as I'm concerned, the guard page thing is not - and
shouldn't be thought of - a "hard" feature. If it's needed, it's
really a bug in user space. But given that there are bugs in user
space, the guard page makes it a bit harder to abuse those
bugs. But it's about "a bit harder" rather than anything else.
-- Linus Torvalds
Comments (none posted)
A kernel mutex is a sleeping lock; a thread which loses in contention for a
specific mutex will be blocked until that mutex becomes available. At least,
that's what the documentation says; the reality is a bit more complicated.
Experience has shown that throughput can sometimes be improved if processes waiting
for a lock do not go to sleep immediately. In particular, if (1) a
thread finds a mutex to be unavailable, and (2) the holder of the
mutex is currently running, that thread will spin until the mutex becomes
available or the holder blocks. That "optimistic" spinning allows the transfer of the
mutex without going through a sleep/wakeup cycle, and, importantly, it
gives the mutex to a running (and, thus, cache-hot) thread. The result is
an unfair, but better-performing mutex implementation.
Except that, as it turns out, it doesn't always perform better. While
doing some testing on a 64-core system, Tim Chen noticed a problem:
multiple threads can be waiting for the same mutex at any given time. Once
the mutex becomes available, only one of this spinning threads will obtain
it; the others will continue to spin, contending for the lock. In general,
optimism can be good, but excessive optimism can be harmful if it leads to
continued behavior which does not yield useful results. That would appear
to be the case here.
Tim's response was a patch changing the
optimistic spinning implementation slightly. There is now an additional
check in the loop to see if the owner of the mutex has changed. If the
ownership of a mutex changes while a thread is spinning, waiting for it,
that means that it was released and somebody else grabbed it first. In
other words, there is heavy contention and multiple CPUs are spinning in a
race that only one of them can win. In such cases, it makes sense to just
go to sleep and wait until things calm down a bit.
Various benchmark results showed significant performance improvements in
heavily-contended situations. That was enough to get the patch merged for
Comments (7 posted)
One of the darker corners of the kernel's memory management subsystem is
flag. That flag says that an allocation request
cannot be allowed to fail regardless of whether memory is actually
available or not; if the request cannot be satisfied, the allocator will
loop continuously in the hope that, somehow, something will eventually
change. Needless to say, kernel developers are not encouraged to use this
option. More recently, David Rientjes has been trying to get rid of it
pushing the (possibly) infinite looping back to callers.
Andrew Morton was not convinced by the patch:
The reason for adding GFP_NOFAIL in the first place was my
observation that the kernel contained lots of open-coded
All of these are wrong, bad, buggy and mustfix. So we consolidated
the wrongbadbuggymustfix concept into the core MM so that
miscreants could be easily identified and hopefully fixed.
David's response is that said miscreants have not been fixed over the
course of many years, and that __GFP_NOFAIL imposes complexity on
the page allocator which slows things down for all users. Andrew came back
with a suggestion for special versions of the allocation functions which
would perform the looping; that would move the implementation out of the
core allocator, but still make it possible to search for code needing to
fix; David obliged with a
patch adding kmalloc_nofail() and friends.
This kind of patch is guaranteed to bring out comments from those who feel
that it is far better to just fix code which is not prepared to deal with
memory allocation failures. But, as Ted Ts'o pointed out, that is not always an easy thing
So we can mark the retry loop helper function as deprecated, and
that will make some of these cases go away, but ultimately if we're
going to fail the memory allocation, something bad is going to
happen, and the only question is whether we want to have something
bad happen by looping in the memory allocator, or to force the file
system to panic/oops the system, or have random application die
and/or lose user data because they don't expect write() to return
Ted's point is that there are always going to be places where recovery from
a memory allocation failure is quite hard, if it's possible at all. So the
kernel can provide some means by which looping on failure can be done
centrally, or see it done in various ad hoc ways in random places in the
kernel. Bad code is not improved by being swept under the rug, so it seems
likely that some sort of central loop-on-failure mechanism will continue to
Comments (none posted)
Kernel development news
It is rare for Linus to talk about what he plans to merge in a given
development cycle before the merge window opens; it seems that he prefers
to see what the pull requests look like and make his decisions afterward.
He made an exception in the
On a slightly happier note: one thing I do hope we can merge in the
upcoming merge window is Nick Piggin's cool VFS scalability series.
I've been using it on my own machine, and gone through all the
commits (not that I shouldn't go through some of them some more),
and am personally really excited about it. It's seldom we see major
performance improvements in core code that are quite that
noticeable, and Nick's whole RCU pathname lookup in particular just
tickles me pink.
It's a rare developer who, upon having tickled the Big Penguin to that
particular shade, will hold off on merging his changes. But Nick asked that the patches sit out for one more
cycle, perhaps out of the entirely rational fear of bugs which might
irritate users to a rather deeper shade. So Linus will have to wait a bit
for his RCU pathname lookup code.
That said, some parts of the VFS scalability code did make it into the
mainline for 2.6.36-rc2.
Like most latter-day scalability work, the VFS work is focused on
increasing locality and eliminating situations where CPUs must share
resources. Given that a filesystem is an inherently global structure,
increasing locality can be a challenging task; as a result, parts of Nick's
patch set are on the complex and tricky side. But, in the end, it comes
down to dealing with things locally whenever possible, but making global
action possible when the need arises.
The first step is the introduction of two new lock types, the first of
which is called a "local/global lock" (lglock). An lglock is intended to
provide very fast access to per-CPU data while making it possible (at a
rather higher cost) to get at another CPU's data. An lglock is created
The DEFINE_LGLOCK() macro is a 99-line wonder which creates the
necessary data structure and accessor functions. By design, lglocks can
only be defined at the file global level; they are not intended to be
embedded within data structures.
Another set of macros is used for working with the lock:
lg_local_lock_cpu(name, int cpu);
lg_local_unlock_cpu(name, int cpu);
Underneath it all, an lglock is really just a per-CPU array of spinlocks.
So a call to lg_local_lock() will acquire the current CPU's
spinlock, while lg_local_lock_cpu() will acquire the lock
belonging to the specified cpu. Acquiring an lglock also disables
preemption, which would not otherwise happen in realtime kernels. As long
as almost all locking is local, it will be very fast; the lock will not
bounce between CPUs and will not be contended. Both of those assumptions
go away, of course, if the cross-CPU version is used.
Sometimes it is necessary to globally lock the lglock:
A call to lg_global_lock() will go through the entire array,
acquiring the spinlock for every CPU. Needless to say, this will be a very
expensive operation; if it happens with any frequency at all, an lglock is
probably the wrong primitive to use. The _online version only
acquires locks for CPUs which are currently running, while
lg_global_lock() acquires locks for all possible CPUs.
The VFS scalability patch set also brings back the "big reader lock"
concept. The idea behind a brlock is to make locking for read access as
fast as possible, while making write locking possible. The brlock API
(also defined in <linux/lglock.h>) looks like this:
As it happens, this version of brlocks is implemented entirely with
lglocks; br_read_lock() maps directly to lg_local_lock(),
and br_write_lock() turns into lg_global_lock().
The first use of lglocks is to protect the list of open files which is
attached to each superblock structure. This list is currently protected by
the global files_lock, which becomes a bottleneck when a lot of
open() and close() calls are being made. In 2.6.36, the
list of open files becomes a per-CPU array, with each CPU managing its own
list. When a file is opened, a (cheap) call to lg_local_lock()
suffices to protect the local list while the new file is added.
When a file is closed, things are just a bit more complicated. There is no
guarantee that the file will be on the local CPU's list, so the VFS must be
prepared to reach across to another CPU's list to clean things up. That,
of course, is what lg_local_lock_cpu() is for. Cross-CPU locking
will be more expensive than local locking, but (1) it only involves
one other CPU, and (2) in situations where there is a lot of opening
and closing of files, chances are that the process working with any
specific file will not migrate between CPUs during the (presumably short)
time that the file is open.
The real reason that the per-superblock open files list exists is to let
the kernel check for writable files when a filesystem is being remounted
read-only. That operation requires exclusive access to the entire list, so
lg_global_lock() is used. The global lock is costly, but
read-only remounts are not a common occurrence, so nobody is likely to
Also for 2.6.36, Nick changed the global vfsmount_lock into a
brlock. This lock protects the tree of mounted filesystems; it must be
acquired (in a read-only mode) whenever a pathname lookup crosses from one
mount point to the next. Write access is only needed when filesystems are
mounted or unmounted - again, an uncommon occurrence on most systems. Nick
warns that this change is unlikely to speed up most workloads now - indeed,
it may slow some down slightly - but its value will become clearer when
some of the other bottlenecks are taken care of.
Aside from a few smaller changes, that is where VFS scalability work stops
for the 2.6.36 development cycle. The more complicated work - dealing with
dcache_lock in particular - will go through a few more months of
testing before it is pushed toward the mainline. Then, perhaps, we'll see
Linus in a proper shade of pink.
Comments (1 posted)
Adding an interface for user space to be able to access the kernel
subsystem—along with any hardware acceleration available—seems
like a reasonable idea at first blush. But adding a huge chunk of
formerly user-space code to the kernel to implement additional cryptographic
including public key cryptosystems, is likely to be difficult to sell.
Coupling that with an ioctl()-based API, with pointers and
data, raises the barrier further still. Still, there are some good
arguments for providing some kind of user-space interface to the
crypto subsystem, even if the current proposal doesn't pass muster.
Miloslav Trmač posted an RFC
patchset that implements the /dev/crypto user-space
interface. The code is derived from cryptodev-linux, but the
new implementation was largely developed by Nikos Mavrogiannopoulos.
The patchset is rather large, mostly because of the inclusion of two
user-space libraries for handling multi-precision integers (LibTomMath)
cryptographic algorithms (LibTomCrypt);
some 20,000 lines of code in all. That is the current
implementation, though there is mention of switching to something based on
is believed to be more scrutinized as well as more actively maintained, but
is not particularly small either.
One of the key benefits of the new API is that keys can be handled
completely within the kernel, allowing user space to do whatever encryption
or decryption it needs without ever exposing the key to the application.
That means that application vulnerabilities would be unable to expose any
keys. The keys can also be wrapped by the kernel so that the application
can receive an encrypted blob that it can store persistently to be loaded
back into the kernel after a reboot.
Ted Ts'o questioned the whole idea behind the
interface, specifically whether hardware acceleration would really speed
than not, by the time you take into account the time to move the
crypto context as well as the data into kernel space and back out, and
after you take into account price/performance, most hardware crypto
[accelerators] have marginal performance benefits; in fact, more often
than not, it's a lose.
He was also concerned that the key handling was redundant: "If the
goal is access to hardware-escrowed keys, don't we have the TPM [Trusted
interface for that already?" But Mavrogiannopoulos noted that embedded systems are one target for this work, "where the hardware version of AES might
be 100 times faster than the software". He also said that the TPM
interface was not flexible enough and that one goal of the new API is that
"it can be wrapped by a PKCS #11 [Public-Key Cryptography Standard
for cryptographic tokens like keys]
module and used transparently by other crypto libraries
(openssl/nss/gnutls)", which the TPM interface is unable to support.
There is already support in the kernel for key management, so Kyle Moffett
would like to see that used: "We already have one very nice key/keyring API in the kernel
(see Documentation/keys.txt) that's being used for crypto keys for
NFSv4, AFS, etc. Can't you just add a bunch of cryptoapi key types to
that API instead?" Mavrogiannopoulos thinks
that because the keyring API allows exporting keys to user
space—something that the /dev/crypto API explicitly
prevents—it would be inappropriate. Keyring developer David Howells
suggests an easy way around that particular problem:
"Don't provide a read() key type operation, then".
But the interface itself also drew complaints. To use
/dev/crypto, an application needs to open() the
device, then start issuing ioctl() calls. Each ioctl()
operation (which are named NCRIO_*) has its own structure type
that gets passed as the data parameter to ioctl():
res = ioctl(fd, NCRIO_..., &data);
Many of the structures contain pointers for user data (input and output),
which are declared as void pointers. That necessitates using the
compat_ioctl to handle 32 vs. 64-bit pointer issues, which Arnd Bergmann disagrees with: "New drivers should be written to *avoid* compat_ioctl calls, using only
very simple fixed-length data structures as ioctl commands.".
He doesn't think that pointers should be used in the
interface at all if possible: "Ideally, you would use ioctl to control
the device while you use read and write to pass actual bits of data".
Beyond that, the interface also mixes in netlink-style variable
length attributes to support things like algorithm choice,
initialization vector, key type (secret, private, public), key wrapping
algorithm, and many additional attributes that are algorithm-specific like
key length or RSA and DSA-specific values. Each of these can be tacked on
as an array of (struct nlattr, attribute data) pairs using the
same formatting as netlink messages,
to the end of the operation-specific structure for most, but not all, of
the operations. It is, in short, a complex interface that is reasonably
well-documented in the first patch of the series.
Bergmann and others are also concerned about the inclusion of all of the
extra code, as well:
However, the more [significant]
problem is the amount of code added to a security module. 20000 lines of
code that is essentially a user-level library moved into kernel space
can open up so many possible holes that you end up with a less secure
(and slower) setup in the end than just doing everything in user space.
Mavrogiannopoulos thinks that the "benefits outweigh
the risks" of adding the extra code, likening it to the existing
encryption and compression facilities in the kernel. The difference, as
Bergmann points out, is that the kernel actually uses those facilities
itself, so they must be in the kernel. The additional code being added
here is strictly to support user space.
In the patchset introduction, Trmač lists a number of arguments for
adding more algorithms to the kernel and providing a user-space API, most
of which boil down to various government specifications that require a
separation between the crypto provider and user. The intent is to keep the
key material separate from the—presumably more
programs, but there are other ways to do that, including have a root daemon
that offers the needed functionality as noted in the introduction.
There is a worry that the overhead of doing it that way would be too
high: "this would be slow due to context switches, scheduler
mismatching and all the IPC overhead". However, no numbers have yet
been offered to show how much overhead is added.
There are a number of interesting capabilities embodied in the API,
in particular for handling keys. A master AES key can be set for the
subsystem by a suitably privileged program which will then be used to
encrypt and wrap keys before they are handed off to user space. None of
the key handling is persistent across reboots, so user space will have to
store any keys that get generated for it. Using the master key allows
that, without giving user space access to anything other than an encrypted
All of the expected operations are available through the interface:
encrypt, decrypt, sign, and verify. Each is accessible from a session that
gets initiated by an NCRIO_SESSION_INIT ioctl(), followed by zero
or more NCRIO_SESSION_UPDATE calls, and ending with a NCRIO_SESSION_FINAL.
For one-shot operations, there is also a NCRIO_SESSION_ONCE call that
handles all three of those operations in one call.
While it seems to be a well thought-out interface, with room for expansion
to handle unforeseen algorithms with different requirements, it's also very
complex. Other than the separation of keys and faster encryption for
embedded devices, it doesn't offer that much for desktop or server users,
and it adds an immense amount of code and the associated maintenance
burden. In its current form, it's hard to see /dev/crypto making
its way into the mainline, but some of the ideas it implements might—particularly if they are better integrated with existing kernel facilities
like the keyring.
Comments (8 posted)
One thing that kernels do is collect statistics. If one wishes to know how
many multicast packets have been received, page faults have been incurred,
disk reads have been performed, or interrupts have been received, the
kernel has the answer. This role is not normally questioned, but,
recently, there have been occasional suggestions that the handling of
statistics should be changed somewhat. The result is a changing view of
how information should be extracted from the kernel - and some interesting
Back in July, Gleb Natapov submitted a
patch changing the way paging is handled in KVM-virtualized guests.
Included in the patch was the collection of a couple of new statistics on
page faults handled in each virtual CPU. More than one month later
(virtualization does make things slower), Avi Kivity reviewed the patch; one of his suggestions
Please don't add more stats, instead add tracepoints which can be
converted to stats by userspace.
Nobody questioned this particular bit of advice. Perhaps that's because
virtualization seems boring to a lot of developers. But it is also
indicative of a wider trend.
That trend is, of course, the migration of much kernel data collection and
processing to the "perf events" subsystem. It has only been one year since perf
showed up in a released kernel, but it has seen sustained development and
growth since then. Some developers have been known to suggest that,
eventually, the core kernel will be an obscure bit of code that must be
kept around in order to make perf run.
Moving statistics collection to tracepoints brings some obvious
advantages. If nobody is paying attention to the statistics, no data is
collected and the overhead is nearly zero. When individual events can be
captured, their correlation with other events can be investigated, timing
can be analyzed, associated data can be captured, etc. So it makes some
sense to export the actual events instead of boiling them down to a small
set of numbers.
The down side of using tracepoints to replace counters is that it is no
longer possible to query statistics maintained over the lifetime of the
system. As Matt Mackall observed
over a year ago:
Tracing is great for looking at changes, but it completely falls
down for static system-wide measurements because it would require
integrating from time=0 to get a meaningful summation. That's
completely useless for taking a measurement on a system that
already has an uptime of months.
Most often, your editor would surmise, administrators and developers are
looking for changes in counters and do not need to integrate from time=0.
There are times, though, when that information can be useful to have. One
could come close by enabling the tracepoints of interest during the
bootstrap process and continuously collecting the events, but that can be
expensive, especially for high-frequency events.
There is another important issue which has been raised in the past and which
has never really been resolved. Tracepoints are generally seen as
debugging aids used mainly by kernel developers. They are often tied into
low-level kernel implementation details; changes to the code can often force
changes to nearby tracepoints, or make them entirely obsolete.
Tracepoints, in other words, are likely to be nearly as volatile as the
kernel that they are instrumenting. The kernel changes rapidly, so it
stands to reason that the tracepoints would change rapidly as well.
Needless to say, changing tracepoints will create problems for any
user-space utilities which make use of those tracepoints. Thus far, kernel
developers have not encouraged widespread use of tracepoints; the kernel
still does not have that many of them, and, as noted above, they are mainly
debugging tools. If tracepoints are made into a replacement for kernel
statistics, though, then the number of user-space tools using tracepoints
can only increase. And that will lead to resistance to patches which
change those tracepoints and break the tools.
In other words, tracepoints are becoming part of the user-space ABI.
Despite the fact that concerns about the ABI status tracepoints have been raised in the
past, this change seems to be coming in through the back door with no real
planning. As Linus has pointed
out in the past, the fact that nobody has designated tracepoints as
part of the official ABI or documented them does not really change things.
Once an interface has been exposed to user space and come into wider use,
it's part of the ABI regardless of the developers' intentions. If
user-space tools use tracepoints, kernel developers will have to support
those tracepoints indefinitely into the future.
Past discussions have included suggestions for ways to mark
tracepoints which are intended to be stable, but no conclusions have
resulted. So the situation remains murky. It may well be that things will
stay that way until some future kernel change breaks somebody's tools.
Then the kernel community will be forced to choose between restoring
compatibility for the broken tracepoints or overtly changing its
longstanding promise not to break the user-space ABI (too often). It might
be better to figure things out before they get to that point.
Comments (6 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>