The 2.6.37 merge window is open
as of this writing, so there is no
current development kernel prepatch. The merge window can be expect to
close right around the end of the month. See the article below for a
summary of activity in this merge window so far.
Stable updates: there have been no stable updates in the last week.
The 220.127.116.11, 18.104.22.168, and 22.214.171.124 updates are currently in the review
process and may be released at any time.
Comments (none posted)
All we really need to do is get someone a case of the beverage of
their choice and turn them loose on the problem. I think that the
few anti-stacking holdouts (I was one, but converted a couple years
ago) can be swayed by a reasonable implementation. It won't be
easy, there are plenty of problems that need to be solved, but
anyone who wants easy should stick to developing web portals and
stay out of the kernel.
-- Casey Schaufler
This option enables support for Zalgo in kernel
messages. Zalgo is a corruption. The arrival of Zalgo has
been foretold. Zalgo will not... wait.
-- Matthew Garrett
People do tend to prefer to do localised expedient things rather
than sticking their necks out and implementing proper, generic
kernel-wide functions. If I see it happen, I'll tell them.
Usually I don't see it until months after it's merged.
-- Andrew Morton
Comments (3 posted)
Linus has sent out a notice that the 2.6.37 merge window will indeed be
shorter than usual; it will probably conclude on October 30 or 31, just in
time for the 2010 Kernel Summit. "And so far, in the five days since the 2.6.36 release, we've merged
5500+ commits. That has turned my "maybe we can do a shorter merge
window" into a 'we can definitely do a shorter merge window'. Because
we already have enough changes, and there's almost a week to go - so I
think we're well on track for doing that.
Full Story (comments: none)
Bryce Lelbach has announced that he has managed to build and boot a
(mostly) working kernel using the LLVM-based Clang compiler. It seems that
there are a lot of
problems remaining, though, and he had to use a couple of GCC-compiled
pieces to get the system to boot. " SELinux, Posix ACLs, IPSec,
eCrypt, anything that uses the crypto API - None of these will compile, due
to either an ICE or variable-length arrays in structures (don't remember
which, it's in my notes somewhere). If it's variable-length arrays or
another intentionally unsupported GNUtension, I'm hoping it's just used in
some isolated implementation detail (or details), and not a fundamental
part of the crypto API (honestly just haven't had a chance to dive into the
crypto source yet).
Full Story (comments: 69)
As a general rule, kernel developers work to avoid running code in hardware
interrupt context; there is a whole array of mechanisms by which
interrupt-driven work can be deferred to less pressing times. Apparently,
however, there is an occasional need to run arbitrary code in the hardware
interrupt context - and there is no hardware conveniently signaling
interrupts at the time. To enable the running of code in hardware
interrupt context, a new API has been added to 2.6.37.
The first step is to fill in an irq_work structure:
struct irq_work my_work;
init_irq_work(struct irq_work *entry, void (*func)(struct irq_work *func));
There is then a fairly familiar pair of functions for running the work
indicated by this structure:
bool irq_work_queue(struct irq_work *entry);
void irq_work_sync(struct irq_work *entry);
The intended area of use is apparently code running from non-maskable
interrupts which needs to be able to interact with the rest of the system.
One should assume that just about any other use of this feature is likely
to be scrutinized closely.
Comments (2 posted)
The kernel is filled with tests whose results almost never change. A
classic example is tracepoints, which will be disabled on running systems
with only very rare exceptions. There has long been interest in optimizing
the tests done in such places; with 2.6.37, the "jump label" feature
will make those tests go away entirely.
Consider the definition of a typical tracepoint, which, behind all of the
preprocessor madness, looks something like:
static inline trace_foo(args)
/* Actually do tracing stuff */
The cost of a test for a single tracepoint is essentially zero. The number
of tracepoints in the kernel is growing, though, and each one adds a new
test. Each test must fetch a value from memory, adding to the pressure on
the cache and hurting performance. Given that the value almost never changes, it
would be nice to find a way to optimize the "tracepoint disabled" case.
In 2.6.37, this tracepoint can be rewritten using a new macro:
#define JUMP_LABEL(key, label) \
if (unlikely(*key)) \
The nice thing is that JUMP_LABEL() does not have to be
implemented like that. It can, instead, (1) note the location of the
test and the key value in a special table, and (2) simply
insert a no-op instruction. That reduces the cost of the test (and the
tracepoint) to zero for the common "not enabled" case. Most of the time,
the tracepoint will never be enabled and the omitted test will never be
The tricky part happens when somebody wants to enable the tracepoint.
Changing its status now requires calling one of a pair of special
void enable_jump_label(void *key);
void disable_jump_label(void *key);
A call to enable_jump_label() will look up the key in the jump
label table, then replace the special no-op instructions with the assembly
equivalent of "goto label", enabling the tracepoint.
Disabling the jump label will cause the no-op instruction to be restored.
The end result is a significant reduction in the overhead of disabled
tracepoints. This feature only works on architectures which support it
(x86 only, at the moment) and only with relatively recent versions of GCC;
otherwise the preprocessor version is used.
Comments (16 posted)
Kernel development news
The 2.6.36 kernel was released on October 20, and the 2.6.37 merge window
duly started shortly thereafter. As of this writing, some 6450
changes have been merged for the next development cycle, with more surely
to come. Some of the more significant, user-visible changes merged for
- The first parts of the inode scalability patch set have been merged,
but, as of this writing, the core locking changes have not yet been
pushed for inclusion. See this
article for more information on the inode scalability work.
- The x86 architecture now uses separate stacks for interrupt handling
when 8K stacks are in use. The option to use 4K stacks has been
- The big kernel lock removal process continues; the core kernel is
almost entirely BKL-free. There is now a configuration option which
may be used to build a kernel without the BKL. File locking still
requires the BKL, though; schemes are afoot to fix it before the
close of the merge window, but this work is not yet complete. If file
locking can be cleaned up, it will be possible for many (or most)
users to run a BKL-free 2.6.37 kernel.
- The "rados block device" has been added. RBD allows the creation
of a special block device which is backed by objects stored in the
Ceph distributed system.
- The GFS2 cluster filesystem is no longer marked "experimental." GFS2
has also gained support for the fallocate() system call.
- A new sysfs file, /sys/selinux/status, allows a user-space
application to quickly notice when security policies have changed.
The intended use is evidently daemons which cache the results of
access-control decisions and need to know when those results might
change. A separate file, called policy, has been added for
those simply wanting to read the current policy from the kernel.
- The scheduler now works harder to avoid migrating high-priority
realtime tasks. The
scheduler also will no longer charge processor time used to handle
interrupts to the process which happened to be running at the time.
- VMware's VMI paravirtualization support has been deprecated
by the company and, as scheduled, removed from the 2.6.37 kernel.
- Some hibernation improvements have been merged, including the ability
to compress the hibernation image with LZO,
- The ARM architecture has gained support for the seccomp (secure computing)
- The block layer can now throttle I/O bandwidth to specific devices,
controlled by the cgroup mechanism. This is the second piece of the
I/O bandwidth controller puzzle which allows the establishment of
specific bandwidth limits which will be enforced even if more I/O
bandwidth is available.
- The new "ttyprintk" device allows suitably-privileged user space to
feed messages through the kernel by way of a pseudo TTY device.
- The kernel has gained support for the point-to-point tunneling
protocol (PPTP); see the
accel-pptp project page for more information.
- The NFS
server client has a new "idmapper" implementation for the translation
between user and group names and IDs. The new code is more flexible
and performs better; see Documentation/filesystems/nfs/idmapper.txt
- There is a new -olocal_lock= mount option for the NFS client
which can cause it to treat either (or both) of flock() and
POSIX locks as local.
- Most of the functions of the nfsservctl() system call have
been deprecated and marked for removal in 2.6.40. There is a new
configuration option for those who would like to remove this
functionality ahead of time.
- Simple support for the pNFS protocol has been merged.
- Huge pages can now be migrated between nodes like normal memory pages.
- There is the usual pile of new drivers:
- Systems and processors: Flexibility Connect boards,
Telechips TCC ARM926-based systems,
Telechips TCC8000-SDK development kits,
Vista Silicon Visstrim_m10 i.MX27-based boards,
LaCie d2 Network v2 NAS boards,
Qualcomm MSM8x60 RUMI3 emulators,
Qualcomm MSM8x60 SURF eval boards,
Eukrea CPUIMX51SD modules,
Freescale MPC8308 P1M boards,
APM APM821xx evaluation boards,
Ito SH-2007 reference boards,
IBM "SMI-free" realtime BIOS's,
MityDSP-L138 and MityDSP-1808 systems,
OMAP3 Logic 3530 LV SOM boards,
OMAP3 IGEP modules, and
taskit Stamp9G20 CPU modules.
- Block: Chelsio T4 iSCSI offload engines.
- Input: Roccat Pyra gaming mice,
UC-Logic WP4030U, WP5540U and WP8060U tablets,
several varieties of Waltop tablets,
OMAP4 keyboard controllers,
NXP Semiconductor LPC32XX touchscreen controllers,
Hanwang Art Master III tablets,
ST-Ericsson Nomadik SKE keyboards,
ROHM BU21013 touch panel controllers, and
TI TNETV107X touchscreens.
- Miscellaneous: Freescale eSPI controllers,
Topcliff platform controllher hub devices,
OMAP AES crypto accelerators,
NXP PCA9541 I2C master selectors,
Intel Clarksboro memory controller hubs,
OMAP 2-4 onboard serial ports,
Linear Technology LTC4261 Negative Voltage Hot Swap Controller
TI BQ20Z75 gas gauge ICs,
OMAP TWL4030 BCI chargers,
ROHM ROHM BH1770GLC and OSRAM SFH7770 combined ALS and proximity sensors,
Avago APDS990X combined ALS and proximity sensors,
Intersil ISL29020 ambient light sensors, and
Medfield Avago APDS9802 ALS sensor modules.
- Network: Brocade 1010/1020 10Gb Ethernet cards,
Conexant CX82310 USB ethernet ports,
Atheros AR9170 "otus" 802.11n USB devices, and
Topcliff PCH Gigabit Ethernet controllers.
- Sound: Marvell 88pm860x codecs,
TI WL1273 FM radio codecs,
HP iPAQ RX1950 audio devices,
Native Instruments Traktor Kontrol S4 audio devices,
Aztech Sound Galaxy AZT1605 and AZT2316 ISA sound cards,
Wolfson Micro WM8985 and WM8962 codecs,
Wolfson Micro WM8804 S/PDIF transceivers,
Samsung S/PDIF controllers, and
Cirrus Logic EP93xx AC97 controllers.
- USB: Intel Langwell USB OTG transceivers,
YUREX "leg shake" sensors, and
USB-attached SCSI devices.
- The old ieee1394 stack has been removed, replaced at last by
the "firewire" drivers.
Changes visible to kernel developers include:
- The jump label
optimization mechanism has been merged; its initial purpose is to
reduce the overhead of inactive tracepoints.
- Yet another RCU variant has been added: "tiny preempt RCU" is meant
for uniprocessor systems. "This implementation uses but a
single blocked-tasks list rather than the combinatorial number used
per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory
consumption and greatly simplifies processing. This version also
takes advantage of uniprocessor execution to accelerate grace periods
in the case where there are no readers."
- New tracepoints have been added in the network device layer, places
where sk_buff structures are freed,
softirq_raise(), workqueue operations, and
memory management LRU list shrinking operations.
There is also a new script for using perf to analyze network device
- The wakeup latency tracer now has function graph support.
- There is a new mechanism for running
arbitrary code in hardware interrupt context.
- The power management layer now has a formal concept of "wakeup
sources" which can bring the system out of a sleep state. Among other
things, it can collect statistics to help the user determine what is
keeping a system awake. Wakeup events can abort the freezing of
tasks, reducing the time required to recover from an aborted suspend
or hibernate operation.
- A new mechanism for managing the automatic suspending of idle devices
has been added.
- There is a new set of functions for managing the "operating
performance points" of system-on-chip components. (commit).
- A long list of changes to the memblock (formerly LMB) low-level
management code has been merged, and the x86 architecture now uses
memblock for its early memory management.
- The default handling for lseek() has changed: if a driver
does not provide its own llseek() function, the VFS layer
will cause all attempts to change the file position to fail with an
ESPIPE error. All in-tree drivers which lacked
llseek() functions have been changed to use
noop_llseek(), which preserves the previous behavior.
- There is a new way to create workqueues:
struct workqueue_struct *alloc_ordered_workqueue(const char *name,
unsigned int flags);
Items submitted to the resulting workqueue will be run in order, one
at a time. It's meant to eventually replace the old singlethreaded
Also added is:
bool flush_work_sync(struct work_struct *work);
This function will wait until a specific work item has completed.
- The ALSA ASoC API has been significantly extended to support sound
cards with multiple codecs and DMA controllers. (commit).
- The stack-based
kmap_atomic() patch has been merged, with an associated
API change. See the new Documentation/vm/highmem.txt file for
- There are two new memory allocation helpers:
void *vzalloc(unsigned long size);
void *vzalloc_node(unsigned long size, int node);
Both behave like the equivalent vmalloc() calls, but they
also zero the allocated memory.
- Most of the work needed to remove the concept of hard
barriers from the block layer has been merged. This task will
probably be completed before the closing of the merge window.
Linus has let it be known that he expects this merge window to be shorter
than usual so that it can be closed before the 2010 Kernel Summit begins on
November 1. Expect patches to be merged at a high rate until the end
of October; an update next week will cover the changes merged in the last
part of the 2.6.37 merge window.
Comments (13 posted)
Nick Piggin's VFS scalability
has been under development for well over one year. Linus was
ready to pull this work during the 2.6.36 merge window, but Nick asked for
more time for things to settle out; as a result, only some of the simpler
parts were merged then. Last week, we mentioned
that some developers
became concerned when it started to become clear that the remaining work
would not be ready for 2.6.37 either. Out of that concern came a competing
version of the patch set (by Dave Chinner) and a big fight. This
discussion was of the relatively deep and intimidating variety, but your
editor, never afraid to make a total fool of himself, will attempt to
clarify the core disagreements and a possible path forward anyway.
The global inode_lock is used within the virtual filesystem layer
(VFS) to protect several data structures and a wide variety of
inode-oriented operations. As a global lock,
it has become an increasingly annoying bottleneck as the number of CPUs and
threads in systems increases; it clearly needs to be broken up in a way
which makes it more scalable. Unfortunately, like a number of old locks in
the VFS, the boundaries of what's protected by inode_lock are not
always entirely clear, so any attempts to change locking in that area must
be done with a great deal of caution. That is why improving inode locking
scalability has been such a slow affair.
Getting rid of inode_lock requires putting some other locking in
place for everything that inode_lock protects. Nick's patch set
creates separate global locks for some of those resources:
wb_inode_list_lock for the list of inodes under writeback, and
inode_lru_lock for the list of inodes in the cache. The standalone
inodes_stat statistics structure is converted over to atomic
types. Then the existing i_lock per-inode spinlock is used to
cover everything else in the inode structure; once that is done,
inode_lock can be removed. The remainder of the patch set (more
than half of the total) is then dedicated to reducing the coverage of
i_lock, often by using read-copy-update (RCU) instead.
Before any of that, though, Nick's patch set changed the way the core
memory management "shrinker" code works. Shrinkers are callbacks which can
be invoked by the core when memory is tight; their job is then to reduce
the amount of memory used by a specific data structure. The inode and
dentry caches can take up quite a bit of memory, so they both have
shrinkers which will free up (hopefully) unneeded cache entries when the
memory is needed elsewhere. Nick changed the shrinker API to cause it to
target specific memory zones; that allows the core to balance free memory
across memory types and across NUMA nodes.
The per-zone shrinkers were one of the early flash points in this debate.
Dave Chinner and others on the VFS side of the house worried that invoking
shrinkers in such a fine-grained way would increase contention at the
filesystem level and make it
harder to shrink the caches in an efficient way. They also thought that
this change was orthogonal to the core goal of eliminating the scalability
problems caused by the global inode_lock. Nick fought hard for
per-zone shrinkers, and he clearly believes that they are necessary, but he
has also dropped them from his patch set for now in an attempt to push
The next disagreement has to do with the coverage of i_lock; Dave
Chinner's alternative patch set avoids using i_lock to cover most
of the inode structure. Instead, Dave introduces other locks from
the outset, reaching a point where he has relatively fine-grained lock
coverage by the time inode_lock is removed at the end of his
series. Compared to this approach, Nick's patches have been criticized as
being messy and not as scalable.
Nick's response is that the "width" of i_lock is a detail which
can be resolved later. His
intent was to do the minimal amount of work required to allow the removal
of inode_lock, without going straight for the ultimate scalable
solution. The goal was to be able to ensure that the locking remains
correct by changing as little as possible before the removal of the global
lock; that way, hopefully, there are fewer chances of breaking things.
Beyond that, any bugs which do slip through before the patch removing
inode_lock will almost certainly not reveal themselves until after
that removal. That means that anybody trying to use bisection to find a
bug will end up at the inode_lock removal patch instead of the
real culprit. Thus, minimizing the number of changes before that removal
should make debugging easier.
That is why Nick removes inode_lock before the middle of his patch
series, while Dave's series does that removal near the end. Both patch
sets include a number of the same changes - putting per-bucket locks onto
the inode hash table, for example - but Nick does it after removing
inode_lock, while Dave does it before. There are also
differences, with Nick heading deep into RCU territory while Dave avoids
using RCU. Both developers claim to be aiming for similar end results,
they just take different roads to get there.
One of the hardest problems
in the VFS is ensuring that all locks are taken in the proper order so that
the system will not deadlock.
Finally, there is also a deep disagreement over the locking of the inode
cache itself. In current kernels, the cache data structure (the LRU and
writeback lists, essentially) is covered by inode_lock with the
rest. Both patch sets create separate locks for the LRU and for
writeback. The problem is with lock ordering; one of the hardest problems
in the VFS is ensuring that all locks are taken in the proper order so that
the system will not deadlock. Nick's patches require the VFS to acquire
i_lock for the inode(s) of interest prior to acquiring the
writeback or LRU locks; Dave, instead, wants i_lock to be the
The problem is that it is not always possible to acquire the locks in the
specified order. Code which is working through the LRU list, for example, must
have that list locked; if it then decides to operate on an inode found in
the LRU list, it must lock the inode. But that violates Nick's locking
order. To make things work correctly, Nick uses spin_trylock() in
such situations to avoid hanging. Uses of spin_trylock() tend to
attract scrutiny, and that is the case here; Dave has described the code as "a
large mess of trylock operations" which he has gone out of his way
to avoid. Nick responds that the code is
not that bad, and that Dave's approach brings locking complexities of its
This is about where Al Viro jumped in,
calling both approaches wrong. Al would like to see the writeback locks
taken prior to i_lock (because code tends to work from the list
first, prior to attacking individual inodes), but he says the LRU lock
should be taken after i_lock because code changing the LRU status
of an inode will normally already have that inode's lock. According to Al, Nick is overly concerned with
the management of the various inode lists and, as a result,
"overengineering" the code. After some discussion, Dave eventually agreed with something close to Al's view and
acknowledged that Nick's placement of the LRU lock below i_lock
was correct, eliminating that point of contention.
Al has also described the way he would like things
to proceed; this is a good thing. When it comes to VFS locking, few are
willing to challenge his point of view; that means that he can probably
bring about a resolution to this particular dispute. He wants a patch
series which starts with the split of the writeback and LRU lists, then
proceeds by pulling things out from under inode_lock one at a
time. He is apparently pulling together a tree based on both Nick's and
Dave's work, but with things done in the order he likes. The end result
will probably be credited to Nick, who figured out how to solve a long list
of difficult problems around inode_lock, but it will differ
significantly from what he initially proposed.
What is not at all clear, though, is how much of this will come together
for the 2.6.37 merge window. Al has a long history of last-second pull
requests full of hairy changes; Linus tends to let him get away with it.
But this would be very last minute, and the changes are deep, so, while Al
has pushed some of the initial changes, the core locking work may not be
ready in time for 2.6.37. Either way, once inode scalability has been
taken care of, discussion can begin
on the removal of dcache_lock, which is a rather more complex
problem than inode_lock; that should be interesting to watch.
Comments (none posted)
One tends to think of "the NASDAQ" as a single exchange based in the US,
but, in fact, NASDAQ OMX
exchanges all over the world - and they
run on Linux. In the US for instance, that includes markets like the
NASDAQ Stock Market, The NASDAQ Options Market, and NASDAQ OMX
newest market that launched on October 8. At a brief presentation at the
Linux Foundation's invitation-only End User Summit in Jersey City, NASDAQ
OMX vice president Bob Evans talked about the ups and downs of using Linux
in a seriously mission-critical environment.
NASDAQ OMX's exchanges run on thousands of Linux-based servers. These
servers handle realtime transaction processing, monitoring, and development
as well. The big challenge in this environment, of course, is performance;
real money depends on whether the exchange can keep up with the order
stream. Latency matters as much as throughput, though; orders must be
responded to (and executed) within bounded period of time. Needless to say,
reliability is also crucially important; down time is not well received, to
say the least.
To meet these requirements, NASDAQ OMX runs large clusters of thousands of
machines. These clusters can process hundreds of millions of orders per day
- up to one million orders per second - with 250µs latency.
According to Bob, Linux has incorporated some useful technologies in recent
years. The NAPI interrupt mitigation technique for network drivers has, on
its own, freed up about 1/3 of the available CPU time for other work. The
epoll system call cuts out much of the per-call overhead, taking 33µs off
of the latency in one benchmark. Handling clock_gettime() in user space via
the VDSO page cuts almost another 60ns. Bob was also quite pleased with how
the Linux page cache works; it is effective enough, he says, to eliminate
the need to use asynchronous I/O, simplifying the code considerably.
On the other hand, there are some things which have not worked out as
well for them. These include I/O signals; they are complex to program with
and, if things get busy, the signal queue can overflow. The user-space
libaio asynchronous I/O (AIO) implementation is thread-based; it scales
poorly, he says, and does not integrate well with epoll. Kernel-based
asynchronous I/O, instead, lacks proper socket support. He also mentioned
the recvmsg() system call, which requires a call into the kernel for every
There is some new stuff coming along which shows some promise. The new
recvmmsg() system call can receive multiple packets with a single
call. For now, though, it is just a wrapper around the internal
recvmsg() implementation and does not hold the socket lock across
the entire operation. But, he said, recvmmsg() is a good example
of how the ability to add new APIs to Linux is a good thing. He also likes the
combination of kernel-based AIO and the eventfd() system call; that makes
it possible to integrate file-based AIO into an applications normal
event-processing loop. There is also some potential in syslets, which he
sees as a way of delivering cheap notifications to user space; it's not
clear whether syslets will scale usefully, though.
What NASDAQ OMX would really like to see in Linux now is good socket-based
AIO. That would make it possible to replace epoll/recvmsg/sendmsg sequences
with fewer system calls. Even better would be if the kernel could provide
notifications for multiple events at a time. Best would be if the interface
to this functionality were completely based on sockets. He described a
vision of an "epoll-like kernel object" which would handle in-kernel
network traffic processing. The application could post asynchronous send
and receive requests to the queue, and receive notifications when they have
been executed. He would like to see multiple sockets attached to a single
object, and a file descriptor suitable for passing to poll() for
notifications. With a setup like that, it should be possible to push more
network traffic through the kernel with lower latencies.
In summary, NASDAQ OMX seems to be happy with its use of Linux. They also
seem to like to go with current software - the exchange is currently
rolling out 126.96.36.199 kernels. "Emerging APIs" are helping operations like
NASDAQ OMX realize real-world performance gains in areas that
matter. Linux, Bob says, is one of the few systems that are willing to
introduce new APIs just for performance reasons. That is an interesting
point of view to contrast with Linus Torvalds's often-stated claim that
nobody uses Linux-specific APIs; it seems that there are users, they just
tend to be relatively well hidden.
Comments (80 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>