User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.14-rc6, released on March 9. Linus says: "There haven't been any huge problems, but there's been quite a few small bumps that shouldn't happen this late in the release cycle. And rc6 is noticeably bigger than rc5 was, as well. So I'm really hoping that the upcoming week will be calmer, because otherwise I'll start thing rc8 and even rc9."

Stable updates: 3.13.6 and 3.10.33 were released on March 6; 3.4.83 followed on March 11.

Comments (2 posted)

A discussion between database and kernel developers

On March 27, there will be a meeting between database and kernel developers at the Linux Foundation's Collaboration Summit; interested developers are invited to attend. "If there are developers attending Collaboration Summit that work in the database or kernel communities, it would be great if you could come along. Previous discussions were on the PostgreSQL list and that should be expanded in case we accidentally build postgres-only features. The intent is to identify the problems encountered by databases and where relevant, test cases that can be used to demonstrate them if they exist. While the kernel community may be aware of some of the problems, they are not always widely known or understood."

Full Story (comments: 56)

Kernel development news

3.14 development statistics

By Jonathan Corbet
March 12, 2014
Normally, by the time the -rc6 prepatch is released in a given development cycle, the final release is thought to be not too far away. This time around, Linus is making noises about needing to go as far as -rc8 or even -rc9 due to the number of outstanding problems. Even if we are still a few weeks from release, the situation with 3.14 will be stable enough that a look at where the changes in this release came from makes sense. The picture that emerges can be described as "business as usual," with the continuation of some longstanding trends.

As of this writing, the 3.14 development cycle has seen the addition of just over 12,000 non-merge changesets from exactly 1,400 developers. These changes added 591,000 lines of code while removing 250,000 lines, for a net growth of 341,000 lines. The most active developers this time around were:

Most active 3.14 developers
By changesets
H Hartley Sweeten2782.3%
Laurent Pinchart2321.9%
Jingoo Han1741.4%
Rashika Kheria1621.3%
Geert Uytterhoeven1381.1%
Tejun Heo1231.0%
Sachin Kamat1221.0%
Kuninori Morimoto1100.9%
Sujith Manoharan970.8%
Linus Walleij890.7%
Wei Yongjun860.7%
Alex Deucher820.7%
Stephen Warren810.7%
Lars-Peter Clausen790.7%
Ville Syrjälä780.6%
Namhyung Kim760.6%
Thierry Reding740.6%
Johannes Berg730.6%
Christoph Hellwig730.6%
Ding Tianhong710.6%
By changed lines
Greg Kroah-Hartman7303510.5%
Micky Ching236573.4%
Stephen Boyd175112.5%
Paul Gortmaker175112.5%
Greg Rose164282.4%
Tero Kristo145092.1%
Ben Skeggs133201.9%
Sergio Aguirre83881.2%
Ben Hutchings80021.2%
Tejun Heo79751.2%
Laurent Pinchart77991.1%
Frank Haverkamp74241.1%
Takashi Iwai62470.9%
Thomas Hellstrom61480.9%
Tom Lendacky61030.9%
Upinder Malhi60120.9%
Sujith Manoharan58370.8%
Peter De Schrijver56800.8%
H Hartley Sweeten55860.8%
Rob Clark53450.8%

After a brief absence, H. Hartley Sweeten has returned to the top of the "by changesets" list with more work on the seemingly interminable project of fixing up the Comedi drivers in the staging tree. Laurent Pinchart did a lot of work in the Video4Linux and ARM architecture trees, while Jingoo Han continues to work on cleaning up the driver tree. Rashika Kheria, an Outreach Program for Women participant, contributed lots of driver cleanup patches, and Geert Uytterhoeven did a lot of work in the m68k architecture and embedded driver subsystems.

Greg Kroah-Hartman often appears near the top of the "lines changed" column; this time, it is as the result of the addition of the Realtek 8821 PCI WIFI driver and more than the usual number of reverted patches. Micky Ching added the rts5208 and rts5288 drivers to the staging tree, Stephen Boyd added a bunch of Qualcomm hardware support, Paul Gortmaker did (among other things) a bunch of header file cleanups, and Greg Rose contributed the Intel i40evf network driver.

At least 210 employers supported work on the 3.14 kernel. The most active of these employers were:

Most active 3.14 employers
By changesets
Red Hat8777.3%
Renesas Electronics3392.8%
Texas Instruments2882.4%
Vision Engraving Systems2782.3%
FOSS Outreach Program for Women2372.0%
Huawei Technologies2111.8%
By lines changed
Linux Foundation7867511.4%
Red Hat463716.7%
Texas Instruments282744.1%
Realsil Microelectronics236763.4%
Wind River199462.9%
Code Aurora Forum178782.6%
Renesas Electronics96551.4%

There are few surprises here; instead, this table shows the continuation of some trends we have been seeing for a while. After a short-lived jump in 3.13, the number of contributions from volunteers is back to its long-term decline. Intel seems to have taken a permanent place at the top of the list of changeset contributors. Contributions from the mobile and embedded industry continue to grow. It's tempting to call out the fact that 3.14 will contain a fix to the nouveau driver that came directly from NVIDIA, but this turns out to be the second time that has happened; the first fix from NVIDIA was quietly merged for 3.9 in early 2013.

A slightly different picture emerges if one looks at non-author signoffs — Signed-off-by tags applied to patches by developers other than the author. Such tags are applied by subsystem maintainers as they accept patches; the statistics can thus indicate who the gatekeepers to the kernel are. Associating signoffs with employers leads to these results:

Most signoffs in 3.14
By developer
Greg Kroah-Hartman151613.1%
David S. Miller11289.7%
Mark Brown5024.3%
Andrew Morton4654.0%
Mauro Carvalho Chehab3703.2%
John W. Linville3523.0%
Simon Horman2562.2%
Arnaldo Carvalho de Melo2372.0%
Daniel Vetter2251.9%
Rafael J. Wysocki1731.5%
Jeff Kirsher1661.4%
Chris Mason1601.4%
Linus Walleij1481.3%
Ingo Molnar1461.3%
Brian Norris1401.2%
John Crispin1291.1%
Jesse Brandeburg1271.1%
Josef Bacik1261.1%
Johannes Berg1251.1%
Benjamin Herrenschmidt1211.0%
By employer
Red Hat233620.2%
Linux Foundation154813.4%
Renesas Electronics2652.3%
Texas Instruments2001.7%
Pure Storage910.8%
University of Cambridge770.7%

The concentration of companies here is reduced from what it once was, but it is still true that more than half of the patches that were merged into 3.14 went by way of the employees of just four organizations. Linaro and Facebook are moving up the list; in both cases, this has mostly been done by hiring developers who were already working as subsystem maintainers. Here, too, the presence of volunteer developers has been slowly declining over time.

All told, the kernel development machine appears to be in a relatively steady state, with things running smoothly and changes happening over relatively long periods of time. Your editor is thus curious to know whether these reports remain useful on a per-release basis. Perhaps it makes sense to scale them back to, say, an annual exercise where the long-term trends might be more pronounced. This is a question that will be considered over the course of the 3.15 development cycle.

Comments (9 posted)

Unmixing the pool

By Jake Edge
March 12, 2014

One of the more useful outcomes from the Snowden revelations may well be the increased scrutiny of the security of our systems. While no one would (sensibly) claim that all of the problems have been found and fixed, there has been improvement in many different areas over the last year or so. One of the areas that has received some attention recently has been the kernel's random number generation. Two recent patches continue that trend, though it is hard to claim that they are the direct result of the exposure of NSA (and other secret agency) spying.

Both patches update the state of the "entropy pool" that is used to generate random numbers for both /dev/random and /dev/urandom. That pool is associated with a (conservative) estimate of the amount of entropy (i.e. state unknowable by an attacker) stored within it. Anything that is believed to truly add entropy gets credited in that estimate, while other, possibly even attacker-controlled, input is simply mixed into the pool without entropy credit. The entropy estimate is used to block /dev/random readers when the amount of data requested is larger than the amount of entropy in the pool.


The first patch is from H. Peter Anvin and it simply adds support for the RDSEED instruction to the kernel. RDSEED is an instruction being added to Intel processors that returns "fully conditioned entropy that is suitable for use as seeds to a PRNG [pseudo-random number generator]". The patches use four bytes of RDSEED output to mix into the entropy pool at boot time (with no entropy credit). In addition, four bytes are generated using the instruction once per second and then mixed into the pool. It is also used to do an "emergency refill" with 64 bytes of RDSEED output if /dev/random is about to block due to a lack of entropy credit to fulfill a request. In both of the latter cases, four bits of credit are given for each byte of RDSEED output that gets mixed into the pool.

Some may not be convinced that a black-box hardware random number generator (RNG) buried inside an Intel chip should be given that much (or any) entropy credit. It is a difficult question, as there is no technical barrier to the instruction returning known-to-the-NSA sequences and there is no way for anyone (at least anyone outside of Intel) to know for sure. While that may seem paranoid, many formerly paranoid scenarios have moved into the "plausible" category over the last year. That concern has not been raised about the RDSEED patches, however.

Mixing and unmixing

The other patch, from Kees Cook, would add some output from a newly instantiated hardware RNG into the entropy pool. When the RNG is registered (via hwrng_register()), sixteen bytes of its output would get mixed into the pool, but without any entropy credit. Jason Cooper was concerned that even mixing these bytes into the pool could lead to problems: "By adding this patch, even without crediting entropy to the pool, a rogue hwrng now has significantly more influence over the initial state of the entropy pools."

But Cook didn't see it as any different than mixing in other random or system-specific data at initialization time:

Sure, and I don't want to be the one weakening the entropy pool. However, I think this patch is no different from the ones that stuff a NIC MAC into the pool -- it's taking something from my system that is unique or random and pushing the entropy seed around. It seems silly that if I've loaded the hwrng-tpm module, my system entropy pool isn't bumped.

In addition, former random subsystem maintainer Matt Mackall brought up an important aspect of the design of the mixing function. Because it can be reversed, mixing even attacker-controlled data into the pool can never remove randomness that was there at the outset:

The pool mixing function is intentionally _reversible_. This is a crucial security property.

That means, if I have an initial secret pool state X, and hostile attacker controlled data Y, then we can do:

X' = mix(X, Y)


X = unmix(X', Y)

We can see from this that the combination of (X' and Y) still contain the information that was originally in X. Since it's clearly not in Y.. it must all remain in X'.

That didn't entirely mollify Cooper, who was still concerned that built-in hardware RNGs would have their output mixed in early in the boot sequence. He was worried that those bytes could pollute the pool, but Mackall reiterated his argument, putting it in starker terms:

If the pool is in an attacker-knowable state at early boot, adding attacker-controlled data does not make the situation any worse. In fact, if the attacker has less-than-perfect control of the inputs, mixing more things in will make things exponentially harder for the attacker.

Put another way: mixing can't ever [remove] unknownness from the pool, it can only add more. So the only reason you should ever choose not to mix something into the pool is performance.

While the mixing function design with reversibility in mind (and its implications) was clear in Mackall's mind, it would seem that others in the kernel community did not know about it. It's an interesting property that makes perfect sense, once you know about it, but is rather counter-intuitive otherwise. In any case, Cooper's objections were withdrawn, and hardware RNG maintainer Herbert Xu queued the patch. We should see it in 3.15.

Comments (8 posted)

MCS locks and qspinlocks

By Jonathan Corbet
March 11, 2014
Impressive amounts of effort have gone into optimizing the kernel's low-level locking mechanisms over the years, but that does not mean that there is no room for improving their performance further. Some work that will be in the 3.14 3.15 kernel, with more likely to come later, has the potential to speed up kernel locking considerably, especially in situations where there are significant amounts of contention.

Conceptually, a spinlock is a simple mechanism. The lock can be as small as a single bit; if that bit is clear, the lock is available. A thread wanting to acquire the lock will attempt to set that bit with an atomic compare-and-swap instruction, "spinning" repeatedly if the lock is not available at the time. Over the years, spinlocks have become more complex; ticket spinlocks added fairness to the mechanism in 2008, and 2013 saw the addition of better paravirtualization support, for example.

Spinlocks still suffer from a fundamental problem beyond the fact that simply spinning for a lock can be painful, though: every attempt to acquire a lock requires moving the cache line containing that lock to the local CPU. For contended locks, this cache-line bouncing can hurt performance significantly. So it's not surprising that developers have been working to reduce cache contention with spinlocks; an attempt to add automatic backoff to spinlocks in early 2013 was working toward that goal, for example, but this work was never merged.

MCS locks

More recently, Tim Chen put together a different approach based on a primitive called an "MCS lock" (described in this paper [PDF]). By expanding a spinlock into a per-CPU structure, an MCS lock is able to eliminate much of the cache-line bouncing experienced by simpler locks, especially in the contended case.

In 3.14 the tip tree for 3.15, an MCS lock is defined by an instance of this structure:

    struct mcs_spinlock {
	struct mcs_spinlock *next;
	int locked; /* 1 if lock acquired */

If one is willing to put up with your editor's second-rate drawing skills, one could envision an unlocked MCS spinlock as follows:

[Unlocked MCS lock]

When a CPU comes along with a desire to acquire this lock, it will provide an mcs_spinlock structure of its own. Using an unconditional atomic exchange operation, it stores the address of its own structure in the lock's next field and marks the lock as taken, yielding a situation that looks like this:

[Locked MCS lock]

The atomic exchange will return the previous value of the next pointer. Since that pointer was null, the acquiring CPU knows that it was successful in acquiring the lock. Once that is done, the lock is busy, but there is no contention for it. Should a second CPU come along and try to acquire the lock, it will start in the same way, storing a pointer to its mcs_spinlock structure in the next pointer of the main lock:

[Contending an MCS lock]

When the second CPU does this atomic swap on the main lock, it too will get back the previous contents of the next field — the pointer to the first CPU's mcs_spinlock structure. The non-NULL value tells the second CPU that the lock is not available, while the specific pointer value says who is ahead in line for the lock. The second CPU will respond to this situation by storing a pointer to its mcs_spinlock structure in the next field of CPU 1's structure:

[Contending an MCS lock]

Note that the use of an atomic swap operation on the main lock means that only CPU 2 can have a pointer to CPU 1's mcs_spinlock structure. So there is no need for atomic operations when making changes to that structure, though some careful programming is still needed to make sure that changes are visible to CPU 1 at the right times.

Once this assignment is done, CPU 2 will spin on the locked value in its own mcs_spinlock structure rather than the value in the main lock. Its spinning will thus be entirely CPU-local, not touching the main lock at all. This process can go on indefinitely as contention for the lock increases, with each CPU placing itself in line behind those that are already there, and each CPU spinning on its own copy of the lock. The pointer in the "main" lock, thus, always indicates the tail of the queue of waiting CPUs.

When CPU 1 finally finishes with the lock, it will do a compare-and-swap operation on the main lock, trying to set the next pointer to NULL on the assumption that this pointer still points to its own structure. If that operation succeeds, the lock was never contended and the job is done. If some other CPU has changed that pointer as shown above, though, the compare-and-swap will fail. In that case, CPU 1 will not change the main lock at all; instead, it will change the locked value in CPU 2's structure and remove itself from the situation:

[Transferred MCS lock]

Once its copy of locked changes, CPU 2 will break out of its spin and become the new owner of the lock.

An MCS lock, thus, is somewhat more complicated than a regular spinlock. But that added complexity removes much of the cache-line bouncing from the contended case; it also is entirely fair, passing the lock to each CPU in the order that the CPUs arrived.


In the tip tree, MCS locks are used in the implementation of mutexes, but they do not replace the existing ticket spinlock implementation. One reason for this is size: ticket spinlocks fit into a single 32-bit word, while MCS locks do not. That turns out to be important: spinlocks are embedded into many kernel structures, some of which (notably struct page) cannot tolerate an increase in size. If the MCS lock technique is to be used throughout the kernel, some other approach will be needed.

The version of that approach which is likely to be merged can be seen in the "qspinlock" patch series from Peter Zijlstra which, in turn, is based on an implementation by Waiman Long. In this patch set, each CPU gets an array of four mcs_spinlock structures in a well-known location. Four structures are needed because a CPU could be trying to acquire more than one spinlock at a time: imagine what happens if a hardware interrupt comes in while a thread is spinning on a lock, and the interrupt handler tries to take a lock of its own, for example. The array of structures allows lock acquisition attempts from the normal, software interrupt, hardware interrupt, and non-maskable interrupt contexts to be kept apart.

The 32-bit qspinlock is divided into a number of fields:

  • an integer counter to function like the locked field described above,
  • a two-bit "index" field saying which entry in the per-CPU mcs_spinlock array is used by the waiter at the tail of the list,
  • a single "pending" bit, and
  • an integer field to hold the CPU number indicating the tail of the queue.

The last field is arguably the key: by storing a CPU number rather than a pointer, the qspinlock patch set allows all of the relevant information to be crammed into a single 32-bit word. But there are a couple of other optimizations to be found in this patch set as well.

One has to do with the value used by each CPU for spinning. When a CPU is next in line to acquire a lock, it will spin on the lock itself instead of spinning on its per-CPU structure. In this way, the per-CPU structure's cache line need not be touched when the lock is released, removing one cache line miss from the equation. Any subsequent CPUs will spin on their own structures until they reach the head of the queue.

The "pending" bit extends that strategy a bit further. Should a CPU find that a lock is busy but that no other CPUs are waiting, it will simply set the pending bit and not bother with its own mcs_spinlock structure at all. The second CPU to come along will see the pending bit, begin the process of building the queue, and spin on its local copy of the locked field as usual. Cache-line bouncing between waiters is still eliminated, but the first waiter is also able to avoid the cache-miss penalty associated with accessing its own mcs_spinlock array.

Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% — a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code.

The qspinlock patch can, thus, improve performance in situations where locks are highly contended, though, as Waiman noted in the patch posting, it is not meant to be an actual solution for contention problems. Importantly, qspinlocks also perform better in the uncontended case. Releasing a ticket spinlock requires a read-modify-write operation, while a qspinlock can be released with a simple write. So, while the main performance benefits of qspinlocks are to be seen on large systems, most systems should see at least a small improvement. That should be enough to get this code merged as soon as the 3.15 development cycle.

Comments (19 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds