User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.36-rc2, released by Linus on August 22. It contains mostly fixes, but Linus did also pull some small parts of the VFS scalability patch set. "The other big merge in -rc2 is the intel graphics update. I'm not hugely happy about the timing of it, but I think I needed to pull it. Apart from that, there's a number of random fixes all over, the appended shortlog gives you a taste of it." See said shortlog for details, or the full changelog for all the details.

About 200 changes have been merged (as of this writing) since the 2.6.36-rc2 release. They are dominated by fixes, but there's also a new driver for Marvell pxa168 Ethernet controllers and a core mutex change (see below).

Stable updates: the,,, and stable kernel updates were released on August 20. These are relatively small updates containing fixes for the new "stack guard page" feature which was added to close the recently-disclosed local root vulnerability.

There is a rather larger set of updates in the review process currently; they can be expected on or after August 26.

Comments (none posted)

Quotes of the week

Well, sir, the wait staff and I thought we'd just write to ask how your convalescence was going. Sorry about spilling the patch down your front, sir, but the glass was slippery. Who could have foreseen that you'd step backwards, fall over the pool boy and upset a flaming bananas foster on to the front of your shorts. I did get a commendation for my quick action with a high pressure hose to the affected area (thank you, sir, for your shocked but grateful expression). I must say I didn't foresee the force of the blast catapulting you against the pool steps, but a concussion is a small price to pay for avoiding fried nuts with the bananas, eh, sir? It's funny how things turn out, and I bet if you could do it over again, you'd have got off your arse to fetch your own damn patch now, wouldn't you, sir?
-- James Bottomley (Thanks to Jody Belka)

Oh no you don't. As per the documentation in the kernel, I get to now mock you mercilessly for trying to do such a foolish thing!
-- Greg Kroah-Hartman

As far as I'm concerned, the guard page thing is not - and shouldn't be thought of - a "hard" feature. If it's needed, it's really a bug in user space. But given that there are bugs in user space, the guard page makes it a bit harder to abuse those bugs. But it's about "a bit harder" rather than anything else.
-- Linus Torvalds

Comments (none posted)

Preventing overly-optimistic spinning

By Jonathan Corbet
August 25, 2010
A kernel mutex is a sleeping lock; a thread which loses in contention for a specific mutex will be blocked until that mutex becomes available. At least, that's what the documentation says; the reality is a bit more complicated. Experience has shown that throughput can sometimes be improved if processes waiting for a lock do not go to sleep immediately. In particular, if (1) a thread finds a mutex to be unavailable, and (2) the holder of the mutex is currently running, that thread will spin until the mutex becomes available or the holder blocks. That "optimistic" spinning allows the transfer of the mutex without going through a sleep/wakeup cycle, and, importantly, it gives the mutex to a running (and, thus, cache-hot) thread. The result is an unfair, but better-performing mutex implementation.

Except that, as it turns out, it doesn't always perform better. While doing some testing on a 64-core system, Tim Chen noticed a problem: multiple threads can be waiting for the same mutex at any given time. Once the mutex becomes available, only one of this spinning threads will obtain it; the others will continue to spin, contending for the lock. In general, optimism can be good, but excessive optimism can be harmful if it leads to continued behavior which does not yield useful results. That would appear to be the case here.

Tim's response was a patch changing the optimistic spinning implementation slightly. There is now an additional check in the loop to see if the owner of the mutex has changed. If the ownership of a mutex changes while a thread is spinning, waiting for it, that means that it was released and somebody else grabbed it first. In other words, there is heavy contention and multiple CPUs are spinning in a race that only one of them can win. In such cases, it makes sense to just go to sleep and wait until things calm down a bit.

Various benchmark results showed significant performance improvements in heavily-contended situations. That was enough to get the patch merged for 2.6.36-rc2.

Comments (7 posted)

When memory allocation failure is not an option

By Jonathan Corbet
August 25, 2010
One of the darker corners of the kernel's memory management subsystem is the __GFP_NOFAIL flag. That flag says that an allocation request cannot be allowed to fail regardless of whether memory is actually available or not; if the request cannot be satisfied, the allocator will loop continuously in the hope that, somehow, something will eventually change. Needless to say, kernel developers are not encouraged to use this option. More recently, David Rientjes has been trying to get rid of it by pushing the (possibly) infinite looping back to callers.

Andrew Morton was not convinced by the patch:

The reason for adding GFP_NOFAIL in the first place was my observation that the kernel contained lots of open-coded retry-for-ever loops.

All of these are wrong, bad, buggy and mustfix. So we consolidated the wrongbadbuggymustfix concept into the core MM so that miscreants could be easily identified and hopefully fixed.

David's response is that said miscreants have not been fixed over the course of many years, and that __GFP_NOFAIL imposes complexity on the page allocator which slows things down for all users. Andrew came back with a suggestion for special versions of the allocation functions which would perform the looping; that would move the implementation out of the core allocator, but still make it possible to search for code needing to fix; David obliged with a patch adding kmalloc_nofail() and friends.

This kind of patch is guaranteed to bring out comments from those who feel that it is far better to just fix code which is not prepared to deal with memory allocation failures. But, as Ted Ts'o pointed out, that is not always an easy thing to do:

So we can mark the retry loop helper function as deprecated, and that will make some of these cases go away, but ultimately if we're going to fail the memory allocation, something bad is going to happen, and the only question is whether we want to have something bad happen by looping in the memory allocator, or to force the file system to panic/oops the system, or have random application die and/or lose user data because they don't expect write() to return ENOMEM.

Ted's point is that there are always going to be places where recovery from a memory allocation failure is quite hard, if it's possible at all. So the kernel can provide some means by which looping on failure can be done centrally, or see it done in various ad hoc ways in random places in the kernel. Bad code is not improved by being swept under the rug, so it seems likely that some sort of central loop-on-failure mechanism will continue to exist indefinitely.

Comments (none posted)

Kernel development news

VFS scalability patches in 2.6.36

By Jonathan Corbet
August 24, 2010
It is rare for Linus to talk about what he plans to merge in a given development cycle before the merge window opens; it seems that he prefers to see what the pull requests look like and make his decisions afterward. He made an exception in the 2.6.35 announcement, though:

On a slightly happier note: one thing I do hope we can merge in the upcoming merge window is Nick Piggin's cool VFS scalability series. I've been using it on my own machine, and gone through all the commits (not that I shouldn't go through some of them some more), and am personally really excited about it. It's seldom we see major performance improvements in core code that are quite that noticeable, and Nick's whole RCU pathname lookup in particular just tickles me pink.

It's a rare developer who, upon having tickled the Big Penguin to that particular shade, will hold off on merging his changes. But Nick asked that the patches sit out for one more cycle, perhaps out of the entirely rational fear of bugs which might irritate users to a rather deeper shade. So Linus will have to wait a bit for his RCU pathname lookup code. That said, some parts of the VFS scalability code did make it into the mainline for 2.6.36-rc2.

Like most latter-day scalability work, the VFS work is focused on increasing locality and eliminating situations where CPUs must share resources. Given that a filesystem is an inherently global structure, increasing locality can be a challenging task; as a result, parts of Nick's patch set are on the complex and tricky side. But, in the end, it comes down to dealing with things locally whenever possible, but making global action possible when the need arises.

The first step is the introduction of two new lock types, the first of which is called a "local/global lock" (lglock). An lglock is intended to provide very fast access to per-CPU data while making it possible (at a rather higher cost) to get at another CPU's data. An lglock is created with:

    #include <linux/lglock.h>


The DEFINE_LGLOCK() macro is a 99-line wonder which creates the necessary data structure and accessor functions. By design, lglocks can only be defined at the file global level; they are not intended to be embedded within data structures.

Another set of macros is used for working with the lock:

    lg_local_lock_cpu(name, int cpu);
    lg_local_unlock_cpu(name, int cpu);

Underneath it all, an lglock is really just a per-CPU array of spinlocks. So a call to lg_local_lock() will acquire the current CPU's spinlock, while lg_local_lock_cpu() will acquire the lock belonging to the specified cpu. Acquiring an lglock also disables preemption, which would not otherwise happen in realtime kernels. As long as almost all locking is local, it will be very fast; the lock will not bounce between CPUs and will not be contended. Both of those assumptions go away, of course, if the cross-CPU version is used.

Sometimes it is necessary to globally lock the lglock:


A call to lg_global_lock() will go through the entire array, acquiring the spinlock for every CPU. Needless to say, this will be a very expensive operation; if it happens with any frequency at all, an lglock is probably the wrong primitive to use. The _online version only acquires locks for CPUs which are currently running, while lg_global_lock() acquires locks for all possible CPUs.

The VFS scalability patch set also brings back the "big reader lock" concept. The idea behind a brlock is to make locking for read access as fast as possible, while making write locking possible. The brlock API (also defined in <linux/lglock.h>) looks like this:



As it happens, this version of brlocks is implemented entirely with lglocks; br_read_lock() maps directly to lg_local_lock(), and br_write_lock() turns into lg_global_lock().

The first use of lglocks is to protect the list of open files which is attached to each superblock structure. This list is currently protected by the global files_lock, which becomes a bottleneck when a lot of open() and close() calls are being made. In 2.6.36, the list of open files becomes a per-CPU array, with each CPU managing its own list. When a file is opened, a (cheap) call to lg_local_lock() suffices to protect the local list while the new file is added.

When a file is closed, things are just a bit more complicated. There is no guarantee that the file will be on the local CPU's list, so the VFS must be prepared to reach across to another CPU's list to clean things up. That, of course, is what lg_local_lock_cpu() is for. Cross-CPU locking will be more expensive than local locking, but (1) it only involves one other CPU, and (2) in situations where there is a lot of opening and closing of files, chances are that the process working with any specific file will not migrate between CPUs during the (presumably short) time that the file is open.

The real reason that the per-superblock open files list exists is to let the kernel check for writable files when a filesystem is being remounted read-only. That operation requires exclusive access to the entire list, so lg_global_lock() is used. The global lock is costly, but read-only remounts are not a common occurrence, so nobody is likely to notice.

Also for 2.6.36, Nick changed the global vfsmount_lock into a brlock. This lock protects the tree of mounted filesystems; it must be acquired (in a read-only mode) whenever a pathname lookup crosses from one mount point to the next. Write access is only needed when filesystems are mounted or unmounted - again, an uncommon occurrence on most systems. Nick warns that this change is unlikely to speed up most workloads now - indeed, it may slow some down slightly - but its value will become clearer when some of the other bottlenecks are taken care of.

Aside from a few smaller changes, that is where VFS scalability work stops for the 2.6.36 development cycle. The more complicated work - dealing with dcache_lock in particular - will go through a few more months of testing before it is pushed toward the mainline. Then, perhaps, we'll see Linus in a proper shade of pink.

Comments (1 posted)

An API for user-space access to kernel cryptography

By Jake Edge
August 25, 2010

Adding an interface for user space to be able to access the kernel crypto subsystem—along with any hardware acceleration available—seems like a reasonable idea at first blush. But adding a huge chunk of formerly user-space code to the kernel to implement additional cryptographic algorithms, including public key cryptosystems, is likely to be difficult to sell. Coupling that with an ioctl()-based API, with pointers and variable length data, raises the barrier further still. Still, there are some good arguments for providing some kind of user-space interface to the crypto subsystem, even if the current proposal doesn't pass muster.

Miloslav Trmač posted an RFC patchset that implements the /dev/crypto user-space interface. The code is derived from cryptodev-linux, but the new implementation was largely developed by Nikos Mavrogiannopoulos. The patchset is rather large, mostly because of the inclusion of two user-space libraries for handling multi-precision integers (LibTomMath) and additional cryptographic algorithms (LibTomCrypt); some 20,000 lines of code in all. That is the current implementation, though there is mention of switching to something based on Libgcrypt, which is believed to be more scrutinized as well as more actively maintained, but is not particularly small either.

One of the key benefits of the new API is that keys can be handled completely within the kernel, allowing user space to do whatever encryption or decryption it needs without ever exposing the key to the application. That means that application vulnerabilities would be unable to expose any keys. The keys can also be wrapped by the kernel so that the application can receive an encrypted blob that it can store persistently to be loaded back into the kernel after a reboot.

Ted Ts'o questioned the whole idea behind the interface, specifically whether hardware acceleration would really speed things up:

more often than not, by the time you take into account the time to move the crypto context as well as the data into kernel space and back out, and after you take into account price/performance, most hardware crypto [accelerators] have marginal performance benefits; in fact, more often than not, it's a lose.

He was also concerned that the key handling was redundant: "If the goal is access to hardware-escrowed keys, don't we have the TPM [Trusted Platform Module] interface for that already?" But Mavrogiannopoulos noted that embedded systems are one target for this work, "where the hardware version of AES might be 100 times faster than the software". He also said that the TPM interface was not flexible enough and that one goal of the new API is that "it can be wrapped by a PKCS #11 [Public-Key Cryptography Standard for cryptographic tokens like keys] module and used transparently by other crypto libraries (openssl/nss/gnutls)", which the TPM interface is unable to support.

There is already support in the kernel for key management, so Kyle Moffett would like to see that used: "We already have one very nice key/keyring API in the kernel (see Documentation/keys.txt) that's being used for crypto keys for NFSv4, AFS, etc. Can't you just add a bunch of cryptoapi key types to that API instead?" Mavrogiannopoulos thinks that because the keyring API allows exporting keys to user space—something that the /dev/crypto API explicitly prevents—it would be inappropriate. Keyring developer David Howells suggests an easy way around that particular problem: "Don't provide a read() key type operation, then".

But the interface itself also drew complaints. To use /dev/crypto, an application needs to open() the device, then start issuing ioctl() calls. Each ioctl() operation (which are named NCRIO_*) has its own structure type that gets passed as the data parameter to ioctl():

    res = ioctl(fd, NCRIO_..., &data);

Many of the structures contain pointers for user data (input and output), which are declared as void pointers. That necessitates using the compat_ioctl to handle 32 vs. 64-bit pointer issues, which Arnd Bergmann disagrees with: "New drivers should be written to *avoid* compat_ioctl calls, using only very simple fixed-length data structures as ioctl commands.". He doesn't think that pointers should be used in the interface at all if possible: "Ideally, you would use ioctl to control the device while you use read and write to pass actual bits of data".

Beyond that, the interface also mixes in netlink-style variable length attributes to support things like algorithm choice, initialization vector, key type (secret, private, public), key wrapping algorithm, and many additional attributes that are algorithm-specific like key length or RSA and DSA-specific values. Each of these can be tacked on as an array of (struct nlattr, attribute data) pairs using the same formatting as netlink messages, to the end of the operation-specific structure for most, but not all, of the operations. It is, in short, a complex interface that is reasonably well-documented in the first patch of the series.

Bergmann and others are also concerned about the inclusion of all of the extra code, as well:

However, the more [significant] problem is the amount of code added to a security module. 20000 lines of code that is essentially a user-level library moved into kernel space can open up so many possible holes that you end up with a less secure (and slower) setup in the end than just doing everything in user space.

Mavrogiannopoulos thinks that the "benefits outweigh the risks" of adding the extra code, likening it to the existing encryption and compression facilities in the kernel. The difference, as Bergmann points out, is that the kernel actually uses those facilities itself, so they must be in the kernel. The additional code being added here is strictly to support user space.

In the patchset introduction, Trmač lists a number of arguments for adding more algorithms to the kernel and providing a user-space API, most of which boil down to various government specifications that require a separation between the crypto provider and user. The intent is to keep the key material separate from the—presumably more vulnerable—user-space programs, but there are other ways to do that, including have a root daemon that offers the needed functionality as noted in the introduction. There is a worry that the overhead of doing it that way would be too high: "this would be slow due to context switches, scheduler mismatching and all the IPC overhead". However, no numbers have yet been offered to show how much overhead is added.

There are a number of interesting capabilities embodied in the API, in particular for handling keys. A master AES key can be set for the subsystem by a suitably privileged program which will then be used to encrypt and wrap keys before they are handed off to user space. None of the key handling is persistent across reboots, so user space will have to store any keys that get generated for it. Using the master key allows that, without giving user space access to anything other than an encrypted blob.

All of the expected operations are available through the interface: encrypt, decrypt, sign, and verify. Each is accessible from a session that gets initiated by an NCRIO_SESSION_INIT ioctl(), followed by zero or more NCRIO_SESSION_UPDATE calls, and ending with a NCRIO_SESSION_FINAL. For one-shot operations, there is also a NCRIO_SESSION_ONCE call that handles all three of those operations in one call.

While it seems to be a well thought-out interface, with room for expansion to handle unforeseen algorithms with different requirements, it's also very complex. Other than the separation of keys and faster encryption for embedded devices, it doesn't offer that much for desktop or server users, and it adds an immense amount of code and the associated maintenance burden. In its current form, it's hard to see /dev/crypto making its way into the mainline, but some of the ideas it implements might—particularly if they are better integrated with existing kernel facilities like the keyring.

Comments (8 posted)

Statistics and tracepoints

By Jonathan Corbet
August 24, 2010
One thing that kernels do is collect statistics. If one wishes to know how many multicast packets have been received, page faults have been incurred, disk reads have been performed, or interrupts have been received, the kernel has the answer. This role is not normally questioned, but, recently, there have been occasional suggestions that the handling of statistics should be changed somewhat. The result is a changing view of how information should be extracted from the kernel - and some interesting ABI questions.

Back in July, Gleb Natapov submitted a patch changing the way paging is handled in KVM-virtualized guests. Included in the patch was the collection of a couple of new statistics on page faults handled in each virtual CPU. More than one month later (virtualization does make things slower), Avi Kivity reviewed the patch; one of his suggestions was:

Please don't add more stats, instead add tracepoints which can be converted to stats by userspace.

Nobody questioned this particular bit of advice. Perhaps that's because virtualization seems boring to a lot of developers. But it is also indicative of a wider trend.

That trend is, of course, the migration of much kernel data collection and processing to the "perf events" subsystem. It has only been one year since perf showed up in a released kernel, but it has seen sustained development and growth since then. Some developers have been known to suggest that, eventually, the core kernel will be an obscure bit of code that must be kept around in order to make perf run.

Moving statistics collection to tracepoints brings some obvious advantages. If nobody is paying attention to the statistics, no data is collected and the overhead is nearly zero. When individual events can be captured, their correlation with other events can be investigated, timing can be analyzed, associated data can be captured, etc. So it makes some sense to export the actual events instead of boiling them down to a small set of numbers.

The down side of using tracepoints to replace counters is that it is no longer possible to query statistics maintained over the lifetime of the system. As Matt Mackall observed over a year ago:

Tracing is great for looking at changes, but it completely falls down for static system-wide measurements because it would require integrating from time=0 to get a meaningful summation. That's completely useless for taking a measurement on a system that already has an uptime of months.

Most often, your editor would surmise, administrators and developers are looking for changes in counters and do not need to integrate from time=0. There are times, though, when that information can be useful to have. One could come close by enabling the tracepoints of interest during the bootstrap process and continuously collecting the events, but that can be expensive, especially for high-frequency events.

There is another important issue which has been raised in the past and which has never really been resolved. Tracepoints are generally seen as debugging aids used mainly by kernel developers. They are often tied into low-level kernel implementation details; changes to the code can often force changes to nearby tracepoints, or make them entirely obsolete. Tracepoints, in other words, are likely to be nearly as volatile as the kernel that they are instrumenting. The kernel changes rapidly, so it stands to reason that the tracepoints would change rapidly as well.

Needless to say, changing tracepoints will create problems for any user-space utilities which make use of those tracepoints. Thus far, kernel developers have not encouraged widespread use of tracepoints; the kernel still does not have that many of them, and, as noted above, they are mainly debugging tools. If tracepoints are made into a replacement for kernel statistics, though, then the number of user-space tools using tracepoints can only increase. And that will lead to resistance to patches which change those tracepoints and break the tools.

In other words, tracepoints are becoming part of the user-space ABI. Despite the fact that concerns about the ABI status tracepoints have been raised in the past, this change seems to be coming in through the back door with no real planning. As Linus has pointed out in the past, the fact that nobody has designated tracepoints as part of the official ABI or documented them does not really change things. Once an interface has been exposed to user space and come into wider use, it's part of the ABI regardless of the developers' intentions. If user-space tools use tracepoints, kernel developers will have to support those tracepoints indefinitely into the future.

Past discussions have included suggestions for ways to mark tracepoints which are intended to be stable, but no conclusions have resulted. So the situation remains murky. It may well be that things will stay that way until some future kernel change breaks somebody's tools. Then the kernel community will be forced to choose between restoring compatibility for the broken tracepoints or overtly changing its longstanding promise not to break the user-space ABI (too often). It might be better to figure things out before they get to that point.

Comments (6 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds