Kernel development
Brief items
Kernel release status
The 2.6.37 merge window is open as of this writing, so there is no current development kernel prepatch. The merge window can be expect to close right around the end of the month. See the article below for a summary of activity in this merge window so far.Stable updates: there have been no stable updates in the last week. The 2.6.27.55, 2.6.32.25, and 2.6.35.8 updates are currently in the review process and may be released at any time.
Quotes of the week
Linus warns of a short merge window
Linus has sent out a notice that the 2.6.37 merge window will indeed be shorter than usual; it will probably conclude on October 30 or 31, just in time for the 2010 Kernel Summit. "And so far, in the five days since the 2.6.36 release, we've merged 5500+ commits. That has turned my "maybe we can do a shorter merge window" into a 'we can definitely do a shorter merge window'. Because we already have enough changes, and there's almost a week to go - so I think we're well on track for doing that."
Clang builds a working 2.6.36 Kernel
Bryce Lelbach has announced that he has managed to build and boot a (mostly) working kernel using the LLVM-based Clang compiler. It seems that there are a lot of problems remaining, though, and he had to use a couple of GCC-compiled pieces to get the system to boot. "SELinux, Posix ACLs, IPSec, eCrypt, anything that uses the crypto API - None of these will compile, due to either an ICE or variable-length arrays in structures (don't remember which, it's in my notes somewhere). If it's variable-length arrays or another intentionally unsupported GNUtension, I'm hoping it's just used in some isolated implementation detail (or details), and not a fundamental part of the crypto API (honestly just haven't had a chance to dive into the crypto source yet)."
Running work in hardware interrupt context
As a general rule, kernel developers work to avoid running code in hardware interrupt context; there is a whole array of mechanisms by which interrupt-driven work can be deferred to less pressing times. Apparently, however, there is an occasional need to run arbitrary code in the hardware interrupt context - and there is no hardware conveniently signaling interrupts at the time. To enable the running of code in hardware interrupt context, a new API has been added to 2.6.37.The first step is to fill in an irq_work structure:
#include <linux/irq_work.h>
struct irq_work my_work;
init_irq_work(struct irq_work *entry, void (*func)(struct irq_work *func));
There is then a fairly familiar pair of functions for running the work indicated by this structure:
bool irq_work_queue(struct irq_work *entry);
void irq_work_sync(struct irq_work *entry);
The intended area of use is apparently code running from non-maskable interrupts which needs to be able to interact with the rest of the system. One should assume that just about any other use of this feature is likely to be scrutinized closely.
Jump label
The kernel is filled with tests whose results almost never change. A classic example is tracepoints, which will be disabled on running systems with only very rare exceptions. There has long been interest in optimizing the tests done in such places; with 2.6.37, the "jump label" feature will make those tests go away entirely.Consider the definition of a typical tracepoint, which, behind all of the preprocessor madness, looks something like:
static inline trace_foo(args)
{
if (unlikely(trace_foo_enabled))
goto do_trace;
return;
do_trace:
/* Actually do tracing stuff */
}
The cost of a test for a single tracepoint is essentially zero. The number of tracepoints in the kernel is growing, though, and each one adds a new test. Each test must fetch a value from memory, adding to the pressure on the cache and hurting performance. Given that the value almost never changes, it would be nice to find a way to optimize the "tracepoint disabled" case.
In 2.6.37, this tracepoint can be rewritten using a new macro:
#include <linux/jump_label.h>
#define JUMP_LABEL(key, label) \
if (unlikely(*key)) \
goto label;
The nice thing is that JUMP_LABEL() does not have to be implemented like that. It can, instead, (1) note the location of the test and the key value in a special table, and (2) simply insert a no-op instruction. That reduces the cost of the test (and the tracepoint) to zero for the common "not enabled" case. Most of the time, the tracepoint will never be enabled and the omitted test will never be missed.
The tricky part happens when somebody wants to enable the tracepoint. Changing its status now requires calling one of a pair of special functions:
void enable_jump_label(void *key);
void disable_jump_label(void *key);
A call to enable_jump_label() will look up the key in the jump label table, then replace the special no-op instructions with the assembly equivalent of "goto label", enabling the tracepoint. Disabling the jump label will cause the no-op instruction to be restored.
The end result is a significant reduction in the overhead of disabled tracepoints. This feature only works on architectures which support it (x86 only, at the moment) and only with relatively recent versions of GCC; otherwise the preprocessor version is used.
Kernel development news
2.6.37 merge window, part 1
The 2.6.36 kernel was released on October 20, and the 2.6.37 merge window duly started shortly thereafter. As of this writing, some 6450 changes have been merged for the next development cycle, with more surely to come. Some of the more significant, user-visible changes merged for 2.6.37 include:
- The first parts of the inode scalability patch set have been merged,
but, as of this writing, the core locking changes have not yet been
pushed for inclusion. See this
article for more information on the inode scalability work.
- The x86 architecture now uses separate stacks for interrupt handling
when 8K stacks are in use. The option to use 4K stacks has been
removed.
- The big kernel lock removal process continues; the core kernel is
almost entirely BKL-free. There is now a configuration option which
may be used to build a kernel without the BKL. File locking still
requires the BKL, though; schemes are afoot to fix it before the
close of the merge window, but this work is not yet complete. If file
locking can be cleaned up, it will be possible for many (or most)
users to run a BKL-free 2.6.37 kernel.
- The "rados block device" has been added. RBD allows the creation
of a special block device which is backed by objects stored in the
Ceph distributed system.
- The GFS2 cluster filesystem is no longer marked "experimental." GFS2
has also gained support for the fallocate() system call.
- A new sysfs file, /sys/selinux/status, allows a user-space
application to quickly notice when security policies have changed.
The intended use is evidently daemons which cache the results of
access-control decisions and need to know when those results might
change. A separate file, called policy, has been added for
those simply wanting to read the current policy from the kernel.
- The scheduler now works harder to avoid migrating high-priority
realtime tasks. The
scheduler also will no longer charge processor time used to handle
interrupts to the process which happened to be running at the time.
- VMware's VMI paravirtualization support has been deprecated
by the company and, as scheduled, removed from the 2.6.37 kernel.
- Some hibernation improvements have been merged, including the ability
to compress the hibernation image with LZO,
- The ARM architecture has gained support for the seccomp (secure computing)
feature.
- The block layer can now throttle I/O bandwidth to specific devices,
controlled by the cgroup mechanism. This is the second piece of the
I/O bandwidth controller puzzle which allows the establishment of
specific bandwidth limits which will be enforced even if more I/O
bandwidth is available.
- The new "ttyprintk" device allows suitably-privileged user space to
feed messages through the kernel by way of a pseudo TTY device.
- The kernel has gained support for the point-to-point tunneling
protocol (PPTP); see the
accel-pptp project page for more information.
- The NFS
serverclient has a new "idmapper" implementation for the translation between user and group names and IDs. The new code is more flexible and performs better; see Documentation/filesystems/nfs/idmapper.txt for details. - There is a new -olocal_lock= mount option for the NFS client
which can cause it to treat either (or both) of flock() and
POSIX locks as local.
- Most of the functions of the nfsservctl() system call have
been deprecated and marked for removal in 2.6.40. There is a new
configuration option for those who would like to remove this
functionality ahead of time.
- Simple support for the pNFS protocol has been merged.
- Huge pages can now be migrated between nodes like normal memory pages.
- There is the usual pile of new drivers:
- Systems and processors: Flexibility Connect boards,
Telechips TCC ARM926-based systems,
Telechips TCC8000-SDK development kits,
Vista Silicon Visstrim_m10 i.MX27-based boards,
LaCie d2 Network v2 NAS boards,
Qualcomm MSM8x60 RUMI3 emulators,
Qualcomm MSM8x60 SURF eval boards,
Eukrea CPUIMX51SD modules,
Freescale MPC8308 P1M boards,
APM APM821xx evaluation boards,
Ito SH-2007 reference boards,
IBM "SMI-free" realtime BIOS's,
MityDSP-L138 and MityDSP-1808 systems,
OMAP3 Logic 3530 LV SOM boards,
OMAP3 IGEP modules, and
taskit Stamp9G20 CPU modules.
- Block: Chelsio T4 iSCSI offload engines.
- Input: Roccat Pyra gaming mice,
UC-Logic WP4030U, WP5540U and WP8060U tablets,
several varieties of Waltop tablets,
OMAP4 keyboard controllers,
NXP Semiconductor LPC32XX touchscreen controllers,
Hanwang Art Master III tablets,
ST-Ericsson Nomadik SKE keyboards,
ROHM BU21013 touch panel controllers, and
TI TNETV107X touchscreens.
- Miscellaneous: Freescale eSPI controllers,
Topcliff platform controllher hub devices,
OMAP AES crypto accelerators,
NXP PCA9541 I2C master selectors,
Intel Clarksboro memory controller hubs,
OMAP 2-4 onboard serial ports,
GPIO-controlled fans,
Linear Technology LTC4261 Negative Voltage Hot Swap Controller
I2C interfaces,
TI BQ20Z75 gas gauge ICs,
OMAP TWL4030 BCI chargers,
ROHM ROHM BH1770GLC and OSRAM SFH7770 combined ALS and proximity sensors,
Avago APDS990X combined ALS and proximity sensors,
Intersil ISL29020 ambient light sensors, and
Medfield Avago APDS9802 ALS sensor modules.
- Network: Brocade 1010/1020 10Gb Ethernet cards,
Conexant CX82310 USB ethernet ports,
Atheros AR9170 "otus" 802.11n USB devices, and
Topcliff PCH Gigabit Ethernet controllers.
- Sound: Marvell 88pm860x codecs,
TI WL1273 FM radio codecs,
HP iPAQ RX1950 audio devices,
Native Instruments Traktor Kontrol S4 audio devices,
Aztech Sound Galaxy AZT1605 and AZT2316 ISA sound cards,
Wolfson Micro WM8985 and WM8962 codecs,
Wolfson Micro WM8804 S/PDIF transceivers,
Samsung S/PDIF controllers, and
Cirrus Logic EP93xx AC97 controllers.
- USB: Intel Langwell USB OTG transceivers, YUREX "leg shake" sensors, and USB-attached SCSI devices.
- Systems and processors: Flexibility Connect boards,
Telechips TCC ARM926-based systems,
Telechips TCC8000-SDK development kits,
Vista Silicon Visstrim_m10 i.MX27-based boards,
LaCie d2 Network v2 NAS boards,
Qualcomm MSM8x60 RUMI3 emulators,
Qualcomm MSM8x60 SURF eval boards,
Eukrea CPUIMX51SD modules,
Freescale MPC8308 P1M boards,
APM APM821xx evaluation boards,
Ito SH-2007 reference boards,
IBM "SMI-free" realtime BIOS's,
MityDSP-L138 and MityDSP-1808 systems,
OMAP3 Logic 3530 LV SOM boards,
OMAP3 IGEP modules, and
taskit Stamp9G20 CPU modules.
- The old ieee1394 stack has been removed, replaced at last by the "firewire" drivers.
Changes visible to kernel developers include:
- The jump label
optimization mechanism has been merged; its initial purpose is to
reduce the overhead of inactive tracepoints.
- Yet another RCU variant has been added: "tiny preempt RCU" is meant
for uniprocessor systems. "
This implementation uses but a single blocked-tasks list rather than the combinatorial number used per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies processing. This version also takes advantage of uniprocessor execution to accelerate grace periods in the case where there are no readers.
" - New tracepoints have been added in the network device layer, places
where sk_buff structures are freed,
softirq_raise(), workqueue operations, and
memory management LRU list shrinking operations.
There is also a new script for using perf to analyze network device
events.
- The wakeup latency tracer now has function graph support.
- There is a new mechanism for running
arbitrary code in hardware interrupt context.
- The power management layer now has a formal concept of "wakeup
sources" which can bring the system out of a sleep state. Among other
things, it can collect statistics to help the user determine what is
keeping a system awake. Wakeup events can abort the freezing of
tasks, reducing the time required to recover from an aborted suspend
or hibernate operation.
- A new mechanism for managing the automatic suspending of idle devices
has been added.
- There is a new set of functions for managing the "operating
performance points" of system-on-chip components. (commit).
- A long list of changes to the memblock (formerly LMB) low-level
management code has been merged, and the x86 architecture now uses
memblock for its early memory management.
- The default handling for lseek() has changed: if a driver
does not provide its own llseek() function, the VFS layer
will cause all attempts to change the file position to fail with an
ESPIPE error. All in-tree drivers which lacked
llseek() functions have been changed to use
noop_llseek(), which preserves the previous behavior.
- There is a new way to create workqueues:
struct workqueue_struct *alloc_ordered_workqueue(const char *name, unsigned int flags);Items submitted to the resulting workqueue will be run in order, one at a time. It's meant to eventually replace the old singlethreaded workqueues.
Also added is:
bool flush_work_sync(struct work_struct *work);This function will wait until a specific work item has completed.
- The ALSA ASoC API has been significantly extended to support sound
cards with multiple codecs and DMA controllers. (commit).
- The stack-based
kmap_atomic() patch has been merged, with an associated
API change. See the new Documentation/vm/highmem.txt file for
details.
- There are two new memory allocation helpers:
void *vzalloc(unsigned long size); void *vzalloc_node(unsigned long size, int node);Both behave like the equivalent vmalloc() calls, but they also zero the allocated memory. - Most of the work needed to remove the concept of hard barriers from the block layer has been merged. This task will probably be completed before the closing of the merge window.
Linus has let it be known that he expects this merge window to be shorter than usual so that it can be closed before the 2010 Kernel Summit begins on November 1. Expect patches to be merged at a high rate until the end of October; an update next week will cover the changes merged in the last part of the 2.6.37 merge window.
Resolving the inode scalability discussion
Nick Piggin's VFS scalability patch set has been under development for well over one year. Linus was ready to pull this work during the 2.6.36 merge window, but Nick asked for more time for things to settle out; as a result, only some of the simpler parts were merged then. Last week, we mentioned that some developers became concerned when it started to become clear that the remaining work would not be ready for 2.6.37 either. Out of that concern came a competing version of the patch set (by Dave Chinner) and a big fight. This discussion was of the relatively deep and intimidating variety, but your editor, never afraid to make a total fool of himself, will attempt to clarify the core disagreements and a possible path forward anyway.The global inode_lock is used within the virtual filesystem layer (VFS) to protect several data structures and a wide variety of inode-oriented operations. As a global lock, it has become an increasingly annoying bottleneck as the number of CPUs and threads in systems increases; it clearly needs to be broken up in a way which makes it more scalable. Unfortunately, like a number of old locks in the VFS, the boundaries of what's protected by inode_lock are not always entirely clear, so any attempts to change locking in that area must be done with a great deal of caution. That is why improving inode locking scalability has been such a slow affair.
Getting rid of inode_lock requires putting some other locking in place for everything that inode_lock protects. Nick's patch set creates separate global locks for some of those resources: wb_inode_list_lock for the list of inodes under writeback, and inode_lru_lock for the list of inodes in the cache. The standalone inodes_stat statistics structure is converted over to atomic types. Then the existing i_lock per-inode spinlock is used to cover everything else in the inode structure; once that is done, inode_lock can be removed. The remainder of the patch set (more than half of the total) is then dedicated to reducing the coverage of i_lock, often by using read-copy-update (RCU) instead.
Before any of that, though, Nick's patch set changed the way the core memory management "shrinker" code works. Shrinkers are callbacks which can be invoked by the core when memory is tight; their job is then to reduce the amount of memory used by a specific data structure. The inode and dentry caches can take up quite a bit of memory, so they both have shrinkers which will free up (hopefully) unneeded cache entries when the memory is needed elsewhere. Nick changed the shrinker API to cause it to target specific memory zones; that allows the core to balance free memory across memory types and across NUMA nodes.
The per-zone shrinkers were one of the early flash points in this debate. Dave Chinner and others on the VFS side of the house worried that invoking shrinkers in such a fine-grained way would increase contention at the filesystem level and make it harder to shrink the caches in an efficient way. They also thought that this change was orthogonal to the core goal of eliminating the scalability problems caused by the global inode_lock. Nick fought hard for per-zone shrinkers, and he clearly believes that they are necessary, but he has also dropped them from his patch set for now in an attempt to push things forward.
The next disagreement has to do with the coverage of i_lock; Dave Chinner's alternative patch set avoids using i_lock to cover most of the inode structure. Instead, Dave introduces other locks from the outset, reaching a point where he has relatively fine-grained lock coverage by the time inode_lock is removed at the end of his series. Compared to this approach, Nick's patches have been criticized as being messy and not as scalable.
Nick's response is that the "width" of i_lock is a detail which can be resolved later. His intent was to do the minimal amount of work required to allow the removal of inode_lock, without going straight for the ultimate scalable solution. The goal was to be able to ensure that the locking remains correct by changing as little as possible before the removal of the global lock; that way, hopefully, there are fewer chances of breaking things. Beyond that, any bugs which do slip through before the patch removing inode_lock will almost certainly not reveal themselves until after that removal. That means that anybody trying to use bisection to find a bug will end up at the inode_lock removal patch instead of the real culprit. Thus, minimizing the number of changes before that removal should make debugging easier.
That is why Nick removes inode_lock before the middle of his patch series, while Dave's series does that removal near the end. Both patch sets include a number of the same changes - putting per-bucket locks onto the inode hash table, for example - but Nick does it after removing inode_lock, while Dave does it before. There are also differences, with Nick heading deep into RCU territory while Dave avoids using RCU. Both developers claim to be aiming for similar end results, they just take different roads to get there.
[PULL QUOTE: One of the hardest problems in the VFS is ensuring that all locks are taken in the proper order so that the system will not deadlock. END QUOTE] Finally, there is also a deep disagreement over the locking of the inode cache itself. In current kernels, the cache data structure (the LRU and writeback lists, essentially) is covered by inode_lock with the rest. Both patch sets create separate locks for the LRU and for writeback. The problem is with lock ordering; one of the hardest problems in the VFS is ensuring that all locks are taken in the proper order so that the system will not deadlock. Nick's patches require the VFS to acquire i_lock for the inode(s) of interest prior to acquiring the writeback or LRU locks; Dave, instead, wants i_lock to be the innermost lock.
The problem is that it is not always possible to acquire the locks in the
specified order. Code which is working through the LRU list, for example, must
have that list locked; if it then decides to operate on an inode found in
the LRU list, it must lock the inode. But that violates Nick's locking
order. To make things work correctly, Nick uses spin_trylock() in
such situations to avoid hanging. Uses of spin_trylock() tend to
attract scrutiny, and that is the case here; Dave has described the code as "a
large mess of trylock operations
" which he has gone out of his way
to avoid. Nick responds that the code is
not that bad, and that Dave's approach brings locking complexities of its
own.
This is about where Al Viro jumped in, calling both approaches wrong. Al would like to see the writeback locks taken prior to i_lock (because code tends to work from the list first, prior to attacking individual inodes), but he says the LRU lock should be taken after i_lock because code changing the LRU status of an inode will normally already have that inode's lock. According to Al, Nick is overly concerned with the management of the various inode lists and, as a result, "overengineering" the code. After some discussion, Dave eventually agreed with something close to Al's view and acknowledged that Nick's placement of the LRU lock below i_lock was correct, eliminating that point of contention.
Al has also described the way he would like things to proceed; this is a good thing. When it comes to VFS locking, few are willing to challenge his point of view; that means that he can probably bring about a resolution to this particular dispute. He wants a patch series which starts with the split of the writeback and LRU lists, then proceeds by pulling things out from under inode_lock one at a time. He is apparently pulling together a tree based on both Nick's and Dave's work, but with things done in the order he likes. The end result will probably be credited to Nick, who figured out how to solve a long list of difficult problems around inode_lock, but it will differ significantly from what he initially proposed.
What is not at all clear, though, is how much of this will come together for the 2.6.37 merge window. Al has a long history of last-second pull requests full of hairy changes; Linus tends to let him get away with it. But this would be very last minute, and the changes are deep, so, while Al has pushed some of the initial changes, the core locking work may not be ready in time for 2.6.37. Either way, once inode scalability has been taken care of, discussion can begin on the removal of dcache_lock, which is a rather more complex problem than inode_lock; that should be interesting to watch.
Linux at NASDAQ OMX
One tends to think of "the NASDAQ" as a single exchange based in the US, but, in fact, NASDAQ OMX operates exchanges all over the world - and they run on Linux. In the US for instance, that includes markets like the NASDAQ Stock Market, The NASDAQ Options Market, and NASDAQ OMX PSX, its newest market that launched on October 8. At a brief presentation at the Linux Foundation's invitation-only End User Summit in Jersey City, NASDAQ OMX vice president Bob Evans talked about the ups and downs of using Linux in a seriously mission-critical environment.NASDAQ OMX's exchanges run on thousands of Linux-based servers. These servers handle realtime transaction processing, monitoring, and development as well. The big challenge in this environment, of course, is performance; real money depends on whether the exchange can keep up with the order stream. Latency matters as much as throughput, though; orders must be responded to (and executed) within bounded period of time. Needless to say, reliability is also crucially important; down time is not well received, to say the least.
To meet these requirements, NASDAQ OMX runs large clusters of thousands of machines. These clusters can process hundreds of millions of orders per day - up to one million orders per second - with 250µs latency.
According to Bob, Linux has incorporated some useful technologies in recent years. The NAPI interrupt mitigation technique for network drivers has, on its own, freed up about 1/3 of the available CPU time for other work. The epoll system call cuts out much of the per-call overhead, taking 33µs off of the latency in one benchmark. Handling clock_gettime() in user space via the VDSO page cuts almost another 60ns. Bob was also quite pleased with how the Linux page cache works; it is effective enough, he says, to eliminate the need to use asynchronous I/O, simplifying the code considerably.
On the other hand, there are some things which have not worked out as well for them. These include I/O signals; they are complex to program with and, if things get busy, the signal queue can overflow. The user-space libaio asynchronous I/O (AIO) implementation is thread-based; it scales poorly, he says, and does not integrate well with epoll. Kernel-based asynchronous I/O, instead, lacks proper socket support. He also mentioned the recvmsg() system call, which requires a call into the kernel for every incoming packet.
There is some new stuff coming along which shows some promise. The new recvmmsg() system call can receive multiple packets with a single call. For now, though, it is just a wrapper around the internal recvmsg() implementation and does not hold the socket lock across the entire operation. But, he said, recvmmsg() is a good example of how the ability to add new APIs to Linux is a good thing. He also likes the combination of kernel-based AIO and the eventfd() system call; that makes it possible to integrate file-based AIO into an applications normal event-processing loop. There is also some potential in syslets, which he sees as a way of delivering cheap notifications to user space; it's not clear whether syslets will scale usefully, though.
What NASDAQ OMX would really like to see in Linux now is good socket-based AIO. That would make it possible to replace epoll/recvmsg/sendmsg sequences with fewer system calls. Even better would be if the kernel could provide notifications for multiple events at a time. Best would be if the interface to this functionality were completely based on sockets. He described a vision of an "epoll-like kernel object" which would handle in-kernel network traffic processing. The application could post asynchronous send and receive requests to the queue, and receive notifications when they have been executed. He would like to see multiple sockets attached to a single object, and a file descriptor suitable for passing to poll() for notifications. With a setup like that, it should be possible to push more network traffic through the kernel with lower latencies.
In summary, NASDAQ OMX seems to be happy with its use of Linux. They also seem to like to go with current software - the exchange is currently rolling out 2.6.35.3 kernels. "Emerging APIs" are helping operations like NASDAQ OMX realize real-world performance gains in areas that matter. Linux, Bob says, is one of the few systems that are willing to introduce new APIs just for performance reasons. That is an interesting point of view to contrast with Linus Torvalds's often-stated claim that nobody uses Linux-specific APIs; it seems that there are users, they just tend to be relatively well hidden.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
