Kernel development
Brief items
Kernel release status
The current development kernel is 3.14-rc6, released on March 9. Linus says: "There haven't been any huge problems, but there's been quite a few small bumps that shouldn't happen this late in the release cycle. And rc6 is noticeably bigger than rc5 was, as well. So I'm really hoping that the upcoming week will be calmer, because otherwise I'll start thing rc8 and even rc9."
Stable updates: 3.13.6 and 3.10.33 were released on March 6; 3.4.83 followed on March 11.
A discussion between database and kernel developers
On March 27, there will be a meeting between database and kernel developers at the Linux Foundation's Collaboration Summit; interested developers are invited to attend. "If there are developers attending Collaboration Summit that work in the database or kernel communities, it would be great if you could come along. Previous discussions were on the PostgreSQL list and that should be expanded in case we accidentally build postgres-only features. The intent is to identify the problems encountered by databases and where relevant, test cases that can be used to demonstrate them if they exist. While the kernel community may be aware of some of the problems, they are not always widely known or understood."
Kernel development news
3.14 development statistics
Normally, by the time the -rc6 prepatch is released in a given development cycle, the final release is thought to be not too far away. This time around, Linus is making noises about needing to go as far as -rc8 or even -rc9 due to the number of outstanding problems. Even if we are still a few weeks from release, the situation with 3.14 will be stable enough that a look at where the changes in this release came from makes sense. The picture that emerges can be described as "business as usual," with the continuation of some longstanding trends.As of this writing, the 3.14 development cycle has seen the addition of just over 12,000 non-merge changesets from exactly 1,400 developers. These changes added 591,000 lines of code while removing 250,000 lines, for a net growth of 341,000 lines. The most active developers this time around were:
Most active 3.14 developers
By changesets H Hartley Sweeten 278 2.3% Laurent Pinchart 232 1.9% Jingoo Han 174 1.4% Rashika Kheria 162 1.3% Geert Uytterhoeven 138 1.1% Tejun Heo 123 1.0% Sachin Kamat 122 1.0% Kuninori Morimoto 110 0.9% Sujith Manoharan 97 0.8% Linus Walleij 89 0.7% Wei Yongjun 86 0.7% Alex Deucher 82 0.7% Stephen Warren 81 0.7% Lars-Peter Clausen 79 0.7% Ville Syrjälä 78 0.6% Namhyung Kim 76 0.6% Thierry Reding 74 0.6% Johannes Berg 73 0.6% Christoph Hellwig 73 0.6% Ding Tianhong 71 0.6%
By changed lines Greg Kroah-Hartman 73035 10.5% Micky Ching 23657 3.4% Stephen Boyd 17511 2.5% Paul Gortmaker 17511 2.5% Greg Rose 16428 2.4% Tero Kristo 14509 2.1% Ben Skeggs 13320 1.9% Sergio Aguirre 8388 1.2% Ben Hutchings 8002 1.2% Tejun Heo 7975 1.2% Laurent Pinchart 7799 1.1% Frank Haverkamp 7424 1.1% Takashi Iwai 6247 0.9% Thomas Hellstrom 6148 0.9% Tom Lendacky 6103 0.9% Upinder Malhi 6012 0.9% Sujith Manoharan 5837 0.8% Peter De Schrijver 5680 0.8% H Hartley Sweeten 5586 0.8% Rob Clark 5345 0.8%
After a brief absence, H. Hartley Sweeten has returned to the top of the "by changesets" list with more work on the seemingly interminable project of fixing up the Comedi drivers in the staging tree. Laurent Pinchart did a lot of work in the Video4Linux and ARM architecture trees, while Jingoo Han continues to work on cleaning up the driver tree. Rashika Kheria, an Outreach Program for Women participant, contributed lots of driver cleanup patches, and Geert Uytterhoeven did a lot of work in the m68k architecture and embedded driver subsystems.
Greg Kroah-Hartman often appears near the top of the "lines changed" column; this time, it is as the result of the addition of the Realtek 8821 PCI WIFI driver and more than the usual number of reverted patches. Micky Ching added the rts5208 and rts5288 drivers to the staging tree, Stephen Boyd added a bunch of Qualcomm hardware support, Paul Gortmaker did (among other things) a bunch of header file cleanups, and Greg Rose contributed the Intel i40evf network driver.
At least 210 employers supported work on the 3.14 kernel. The most active of these employers were:
Most active 3.14 employers
By changesets Intel 1233 10.2% (None) 1075 8.9% Red Hat 877 7.3% (Unknown) 701 5.8% Linaro 528 4.4% Samsung 523 4.3% SUSE 396 3.3% IBM 351 2.9% Renesas Electronics 339 2.8% 324 2.7% Texas Instruments 288 2.4% Vision Engraving Systems 278 2.3% (Consultant) 268 2.2% NVIDIA 248 2.1% FOSS Outreach Program for Women 237 2.0% Huawei Technologies 211 1.8% Freescale 210 1.7% Qualcomm 157 1.3% Oracle 152 1.3% Broadcom 144 1.2%
By lines changed Linux Foundation 78675 11.4% Intel 69526 10.0% (None) 47083 6.8% Red Hat 46371 6.7% Texas Instruments 28274 4.1% (Unknown) 25716 3.7% IBM 25427 3.7% Realsil Microelectronics 23676 3.4% SUSE 22686 3.3% NVIDIA 20720 3.0% Samsung 19988 2.9% Wind River 19946 2.9% Code Aurora Forum 17878 2.6% 13452 1.9% Linaro 12945 1.9% Cisco 12747 1.8% (Consultant) 12301 1.8% Qualcomm 10806 1.6% Renesas Electronics 9655 1.4% Freescale 9533 1.4%
There are few surprises here; instead, this table shows the continuation of some trends we have been seeing for a while. After a short-lived jump in 3.13, the number of contributions from volunteers is back to its long-term decline. Intel seems to have taken a permanent place at the top of the list of changeset contributors. Contributions from the mobile and embedded industry continue to grow. It's tempting to call out the fact that 3.14 will contain a fix to the nouveau driver that came directly from NVIDIA, but this turns out to be the second time that has happened; the first fix from NVIDIA was quietly merged for 3.9 in early 2013.
A slightly different picture emerges if one looks at non-author signoffs — Signed-off-by tags applied to patches by developers other than the author. Such tags are applied by subsystem maintainers as they accept patches; the statistics can thus indicate who the gatekeepers to the kernel are. Associating signoffs with employers leads to these results:
Most signoffs in 3.14
By developer Greg Kroah-Hartman 1516 13.1% David S. Miller 1128 9.7% Mark Brown 502 4.3% Andrew Morton 465 4.0% Mauro Carvalho Chehab 370 3.2% John W. Linville 352 3.0% Simon Horman 256 2.2% Arnaldo Carvalho de Melo 237 2.0% Daniel Vetter 225 1.9% Rafael J. Wysocki 173 1.5% Jeff Kirsher 166 1.4% Chris Mason 160 1.4% Linus Walleij 148 1.3% Ingo Molnar 146 1.3% Brian Norris 140 1.2% John Crispin 129 1.1% Jesse Brandeburg 127 1.1% Josef Bacik 126 1.1% Johannes Berg 125 1.1% Benjamin Herrenschmidt 121 1.0%
By employer Red Hat 2336 20.2% Linux Foundation 1548 13.4% Intel 1367 11.8% Linaro 1093 9.4% 649 5.6% Samsung 647 5.6% (None) 417 3.6% 283 2.4% SUSE 270 2.3% Renesas Electronics 265 2.3% IBM 262 2.3% Texas Instruments 200 1.7% Broadcom 183 1.6% (Unknown) 159 1.4% Fon 129 1.1% NVIDIA 97 0.8% Cisco 91 0.8% Pure Storage 91 0.8% (Consultant) 82 0.7% University of Cambridge 77 0.7%
The concentration of companies here is reduced from what it once was, but it is still true that more than half of the patches that were merged into 3.14 went by way of the employees of just four organizations. Linaro and Facebook are moving up the list; in both cases, this has mostly been done by hiring developers who were already working as subsystem maintainers. Here, too, the presence of volunteer developers has been slowly declining over time.
All told, the kernel development machine appears to be in a relatively steady state, with things running smoothly and changes happening over relatively long periods of time. Your editor is thus curious to know whether these reports remain useful on a per-release basis. Perhaps it makes sense to scale them back to, say, an annual exercise where the long-term trends might be more pronounced. This is a question that will be considered over the course of the 3.15 development cycle.
Unmixing the pool
One of the more useful outcomes from the Snowden revelations may well be the increased scrutiny of the security of our systems. While no one would (sensibly) claim that all of the problems have been found and fixed, there has been improvement in many different areas over the last year or so. One of the areas that has received some attention recently has been the kernel's random number generation. Two recent patches continue that trend, though it is hard to claim that they are the direct result of the exposure of NSA (and other secret agency) spying.
Both patches update the state of the "entropy pool" that is used to generate random numbers for both /dev/random and /dev/urandom. That pool is associated with a (conservative) estimate of the amount of entropy (i.e. state unknowable by an attacker) stored within it. Anything that is believed to truly add entropy gets credited in that estimate, while other, possibly even attacker-controlled, input is simply mixed into the pool without entropy credit. The entropy estimate is used to block /dev/random readers when the amount of data requested is larger than the amount of entropy in the pool.
Adding RDSEED
The first patch is from H. Peter Anvin and it simply adds support for the RDSEED instruction to the
kernel. RDSEED is an instruction being added to Intel processors
that returns "fully
conditioned entropy that is suitable for use as seeds to a PRNG
[pseudo-random number generator]
". The patches use four bytes of
RDSEED output
to mix into the entropy pool at boot time (with
no entropy credit). In addition, four bytes are generated using the
instruction once per second and then mixed into the pool.
It is also used to do an "emergency refill" with 64 bytes of RDSEED output
if /dev/random is
about to block due to a lack of entropy credit to fulfill a
request. In both of the latter cases, four bits of credit are given for
each byte of RDSEED output that gets mixed into the pool.
Some may not be convinced that a black-box hardware random number generator (RNG) buried inside an Intel chip should be given that much (or any) entropy credit. It is a difficult question, as there is no technical barrier to the instruction returning known-to-the-NSA sequences and there is no way for anyone (at least anyone outside of Intel) to know for sure. While that may seem paranoid, many formerly paranoid scenarios have moved into the "plausible" category over the last year. That concern has not been raised about the RDSEED patches, however.
Mixing and unmixing
The other patch, from Kees Cook, would add some
output from a newly instantiated hardware RNG into the entropy pool.
When the RNG is
registered (via hwrng_register()), sixteen bytes of its output
would get mixed into the pool, but without any entropy credit. Jason
Cooper was concerned that even mixing these
bytes into the pool could lead to problems: "By adding this patch, even without crediting entropy to the pool, a
rogue hwrng now has significantly more influence over the initial state
of the entropy pools.
"
But Cook didn't see it as any different than mixing in other random or system-specific data at initialization time:
In addition, former random subsystem maintainer Matt Mackall brought up an important aspect of the design of the mixing function. Because it can be reversed, mixing even attacker-controlled data into the pool can never remove randomness that was there at the outset:
That means, if I have an initial secret pool state X, and hostile attacker controlled data Y, then we can do:
X' = mix(X, Y)
and
X = unmix(X', Y)
We can see from this that the combination of (X' and Y) still contain the information that was originally in X. Since it's clearly not in Y.. it must all remain in X'.
That didn't entirely mollify Cooper, who was still concerned that built-in hardware RNGs would have their output mixed in early in the boot sequence. He was worried that those bytes could pollute the pool, but Mackall reiterated his argument, putting it in starker terms:
Put another way: mixing can't ever [remove] unknownness from the pool, it can only add more. So the only reason you should ever choose not to mix something into the pool is performance.
While the mixing function design with reversibility in mind (and its implications) was clear in Mackall's mind, it would seem that others in the kernel community did not know about it. It's an interesting property that makes perfect sense, once you know about it, but is rather counter-intuitive otherwise. In any case, Cooper's objections were withdrawn, and hardware RNG maintainer Herbert Xu queued the patch. We should see it in 3.15.
MCS locks and qspinlocks
Impressive amounts of effort have gone into optimizing the kernel's low-level locking mechanisms over the years, but that does not mean that there is no room for improving their performance further. Some work that will be in theConceptually, a spinlock is a simple mechanism. The lock can be as small as a single bit; if that bit is clear, the lock is available. A thread wanting to acquire the lock will attempt to set that bit with an atomic compare-and-swap instruction, "spinning" repeatedly if the lock is not available at the time. Over the years, spinlocks have become more complex; ticket spinlocks added fairness to the mechanism in 2008, and 2013 saw the addition of better paravirtualization support, for example.
Spinlocks still suffer from a fundamental problem beyond the fact that simply spinning for a lock can be painful, though: every attempt to acquire a lock requires moving the cache line containing that lock to the local CPU. For contended locks, this cache-line bouncing can hurt performance significantly. So it's not surprising that developers have been working to reduce cache contention with spinlocks; an attempt to add automatic backoff to spinlocks in early 2013 was working toward that goal, for example, but this work was never merged.
MCS locks
More recently, Tim Chen put together a different approach based on a primitive called an "MCS lock" (described in this paper [PDF]). By expanding a spinlock into a per-CPU structure, an MCS lock is able to eliminate much of the cache-line bouncing experienced by simpler locks, especially in the contended case.
In 3.14 the tip tree for 3.15, an MCS lock is defined by
an instance of this structure:
struct mcs_spinlock { struct mcs_spinlock *next; int locked; /* 1 if lock acquired */ };
If one is willing to put up with your editor's second-rate drawing skills, one could envision an unlocked MCS spinlock as follows:
When a CPU comes along with a desire to acquire this lock, it will provide an mcs_spinlock structure of its own. Using an unconditional atomic exchange operation, it stores the address of its own structure in the lock's next field and marks the lock as taken, yielding a situation that looks like this:
The atomic exchange will return the previous value of the next pointer. Since that pointer was null, the acquiring CPU knows that it was successful in acquiring the lock. Once that is done, the lock is busy, but there is no contention for it. Should a second CPU come along and try to acquire the lock, it will start in the same way, storing a pointer to its mcs_spinlock structure in the next pointer of the main lock:
When the second CPU does this atomic swap on the main lock, it too will get back the previous contents of the next field — the pointer to the first CPU's mcs_spinlock structure. The non-NULL value tells the second CPU that the lock is not available, while the specific pointer value says who is ahead in line for the lock. The second CPU will respond to this situation by storing a pointer to its mcs_spinlock structure in the next field of CPU 1's structure:
Note that the use of an atomic swap operation on the main lock means that only CPU 2 can have a pointer to CPU 1's mcs_spinlock structure. So there is no need for atomic operations when making changes to that structure, though some careful programming is still needed to make sure that changes are visible to CPU 1 at the right times.
Once this assignment is done, CPU 2 will spin on the locked value in its own mcs_spinlock structure rather than the value in the main lock. Its spinning will thus be entirely CPU-local, not touching the main lock at all. This process can go on indefinitely as contention for the lock increases, with each CPU placing itself in line behind those that are already there, and each CPU spinning on its own copy of the lock. The pointer in the "main" lock, thus, always indicates the tail of the queue of waiting CPUs.
When CPU 1 finally finishes with the lock, it will do a compare-and-swap operation on the main lock, trying to set the next pointer to NULL on the assumption that this pointer still points to its own structure. If that operation succeeds, the lock was never contended and the job is done. If some other CPU has changed that pointer as shown above, though, the compare-and-swap will fail. In that case, CPU 1 will not change the main lock at all; instead, it will change the locked value in CPU 2's structure and remove itself from the situation:
Once its copy of locked changes, CPU 2 will break out of its spin and become the new owner of the lock.
An MCS lock, thus, is somewhat more complicated than a regular spinlock. But that added complexity removes much of the cache-line bouncing from the contended case; it also is entirely fair, passing the lock to each CPU in the order that the CPUs arrived.
Qspinlocks
In the tip tree, MCS locks are used in the implementation of mutexes, but they do not replace the existing ticket spinlock implementation. One reason for this is size: ticket spinlocks fit into a single 32-bit word, while MCS locks do not. That turns out to be important: spinlocks are embedded into many kernel structures, some of which (notably struct page) cannot tolerate an increase in size. If the MCS lock technique is to be used throughout the kernel, some other approach will be needed.
The version of that approach which is likely to be merged can be seen in the "qspinlock" patch series from Peter Zijlstra which, in turn, is based on an implementation by Waiman Long. In this patch set, each CPU gets an array of four mcs_spinlock structures in a well-known location. Four structures are needed because a CPU could be trying to acquire more than one spinlock at a time: imagine what happens if a hardware interrupt comes in while a thread is spinning on a lock, and the interrupt handler tries to take a lock of its own, for example. The array of structures allows lock acquisition attempts from the normal, software interrupt, hardware interrupt, and non-maskable interrupt contexts to be kept apart.
The 32-bit qspinlock is divided into a number of fields:
- an integer counter to function like the locked field described above,
- a two-bit "index" field saying which entry in the per-CPU mcs_spinlock array is used by the waiter at the tail of the list,
- a single "pending" bit, and
- an integer field to hold the CPU number indicating the tail of the queue.
The last field is arguably the key: by storing a CPU number rather than a pointer, the qspinlock patch set allows all of the relevant information to be crammed into a single 32-bit word. But there are a couple of other optimizations to be found in this patch set as well.
One has to do with the value used by each CPU for spinning. When a CPU is next in line to acquire a lock, it will spin on the lock itself instead of spinning on its per-CPU structure. In this way, the per-CPU structure's cache line need not be touched when the lock is released, removing one cache line miss from the equation. Any subsequent CPUs will spin on their own structures until they reach the head of the queue.
The "pending" bit extends that strategy a bit further. Should a CPU find that a lock is busy but that no other CPUs are waiting, it will simply set the pending bit and not bother with its own mcs_spinlock structure at all. The second CPU to come along will see the pending bit, begin the process of building the queue, and spin on its local copy of the locked field as usual. Cache-line bouncing between waiters is still eliminated, but the first waiter is also able to avoid the cache-miss penalty associated with accessing its own mcs_spinlock array.
Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% — a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code.
The qspinlock patch can, thus, improve performance in situations where locks are highly contended, though, as Waiman noted in the patch posting, it is not meant to be an actual solution for contention problems. Importantly, qspinlocks also perform better in the uncontended case. Releasing a ticket spinlock requires a read-modify-write operation, while a qspinlock can be released with a simple write. So, while the main performance benefits of qspinlocks are to be seen on large systems, most systems should see at least a small improvement. That should be enough to get this code merged as soon as the 3.15 development cycle.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>