The 3.11 merge window remains open
; see the separate article, below,
for details on what has been merged in the last week.
Stable updates: 188.8.131.52 was
released on July 3,
184.108.40.206 on July 9, and
220.127.116.11 on July 5.
Comments (none posted)
When experienced developers tell you that you are mistaken, you
need to make an effort to understand what the mistake was so you
can learn from it and not make the same mistake again. If you make
the same mistakes again, maintainers will get annoyed and ignore
you (or worse), which is not a good situation to be in when you
want to get your patches merged.
— Arnd Bergmann
Scalability is not an afterthought anymore - new filesystem and
kernel features need to be designed from the ground up with this in
mind. We're living in a world where even phones have 4 CPU
— Dave Chinner
Copy and paste is a convenient thing, right? It just should have a
pop up window assigned which asks at the second instance of copying
the same thing whether you really thought about it.
— Thomas Gleixner
Comments (none posted)
Kernel development news
As of this writing, Linus has pulled 8,275 non-merge changesets into the
mainline repository for the 3.11 development cycle. Once again, a lot of
the changes are internal improvements and cleanups that will not be
directly visible to users of the kernel. But there has still been quite a
bit of interesting work merged since last
Some of the more noteworthy user-visible changes include:
- There is a new "soft dirty" mechanism that can be employed by user
space to track the pages written to by a process. It is intended for
use by the checkpoint/restart in user space code, but other uses may be
possible; see Documentation/vm/soft-dirty.txt for
- The Smack security module now works with the IPv6 network protocol.
- The ICMP socket mechanism has gained
support for ping over IPv6.
- The ptrace() system call has two new operations
(PTRACE_GETSIGMASK and PTRACE_SETSIGMASK) to
retrieve and set the blocked-signal mask.
- 64-Bit PowerPC machines can now make use of the transparent huge pages
- The kernel NFS client implementation now supports version 4.2 of the
NFS protocol. Also supported on the client side is labeled NFS,
allowing mandatory access control to be used with NFSv4 filesystems.
- The kernel has new support for LZ4 compression, both in the
cryptographic API and for compression of the kernel binary itself.
- Dynamic power management support for most AMD Radeon graphics chips
(the r600 series and all that came thereafter)
has been merged. It is a huge amount of code and is still considered
experimental, so it is disabled by default for now; booting with the
radeon.dpm=1 command-line option will turn this feature on
for those who would like to help debug it.
- The low-latency network polling
patches have been merged after a
- The Open vSwitch subsystem now supports tunneling with the generic
routing encapsulation (GRE) protocol.
- New hardware support includes:
Renesas R-Car display units and
AMD Radeon HD 8000 "Sea Islands" graphics processors.
Huion 580 tablets,
ELO USB 4000/4500 touchscreens,
OLPC XO-1.75 and XO-4 keyboards and touchpads, and
Cypress TrueTouch Gen4 touchscreens.
Toumaz Xenif TZ1090 pin controllers,
Intel Baytrail GPIO pin controllers,
Freescale Vybrid VF610 pin controllers,
Maxim MAX77693 voltage/current regulators,
TI Adaptive Body Bias on-chip LDO regulators,
NXP PCF2127/29 real-time clocks,
SiRF SOC real-time clocks,
Global Mixed-mode Technology Inc G762 and G763 fan speed PWM
Wondermedia WM8xxx SoC I2C controllers,
Kontron COM I2C controllers,
Kontron COM watchdog timers,
NXP PCA9685 LED controllers,
Renesas TPU PWM controllers, and
Broadcom Kona secure DHCI controllers.
Allwinner A10 EMAC Ethernet interfaces,
Marvell SD8897 wireless chipsets,
ST-Ericsson CW1100 and CW1200 WLAN chipsets,
Qualcomm Atheros 802.11ac QCA98xx wireless interfaces, and
Broadcom BCM6345 SoC Ethernet adapters.
There is also a new hardware simulator for near-field
communications (NFC) driver development.
Realtek ALC5640 codecs,
Analog Devices SSM2516 codecs, and
M2Tech hiFace USB-SPDIF interfaces.
Changes visible to kernel developers include:
- Two new device callbacks — offline() and online() —
have been added at the bus level. offline() will be called
when a device is about to be hot-unplugged; it should verify that the
device can, indeed, be unplugged, but not power the device down yet.
Should the unplug be aborted, online() will be called to put
the device back online. The purpose behind these calls is to ensure
that hot removal can be performed before committing to the action.
- The checkpatch utility has a new, experimental --fix option
that will attempt to automatically repair a range of simple formatting
The merge window should remain open for the better part of another week.
Next week's Kernel Page will include a summary of the final changes pulled
into the mainline for the 3.11 development cycle.
Comments (11 posted)
Dave Miller's networking git tree is a busy place; it typically feeds over
1,000 changesets into the mainline each development cycle. Linus clearly
sees the networking subsystem as being well managed, though, and there are
rarely difficulties when Dave puts in his pull requests. So it was
surprising to see Linus reject Dave's request for the big 3.11 pull. In
the end, it came down to the low-latency
Ethernet device polling patches
, which had to go through some urgent
repairs while the rest of the networking pull request waited.
The point of this patch set is to enable low-latency data reception by
applications that are willing to busy wait (in the kernel) if data is not
available when a read() or poll() operation is performed
on a socket. Busy waiting is normally avoided in the kernel, but, if
latency matters more than anything else, some users will be willing to
accept the cost of spinning in the kernel if it allows them to avoid
the cost of context switches when the data arrives. The hope is that
functionality in the kernel will lessen the incentive for certain types of
users to install user-space networking stacks.
Since this patch set was covered here in May, it has seen a few changes.
As was predicted, a setsockopt() option (SO_LL) was added
so that the polling behavior could be adjusted on a per-socket basis;
previously, all sockets in the system would use busy waiting if the feature
was enabled in the kernel. Another flag (POLL_LL) was added for
the poll() system call; once again, it causes busy waiting to
happen even if the kernel is otherwise configured not to use it. The
runtime kernel configuration itself was split into two sysctl knobs:
low_latency_read to set the polling time for read()
operations, and low_latency_poll for poll() and
select(). Setting either knob to zero (the default) disables busy
waiting for the associated operation.
When the time came to push the networking changes for 3.11, Dave put the
low-latency patches at the top of his list of new features. Linus was not impressed, though. He had a number of
complaints, ranging from naming and documentation through to various
implementation issues and the fact that changes had been made to the core
poll() code without going through the usual channels. He later retracted some of his complaints, but still
objected to a number of things. For example, he called out code like:
if (ll_flag && can_ll && can_poll_ll(ll_start, ll_time))
saying that it "should have made anybody sane go 'WTF?' and wonder
about bad drugs." More seriously, he strongly disliked the
"low-latency" name, saying that it obscured the real effect of the patch.
That name, he said, should be changed:
The "ll" stands for "low latency", but that makes it sound all
good. Make it describe what it actually does: "busy loop", and
write it out. So that people understand what the actual downsides
are. We're not a marketing group.
So, for example, he was not going to accept POLL_LL in the
user-space interface; he requested POLL_BUSY_LOOP instead.
Beyond that, Linus disliked how the core polling code worked, saying that
it was more complicated than it needed to be. He made a number of
suggestions for improving the implementation. Importantly, he wanted to be
sure that polling would not happen if the need_resched flag is set
in the current structure. That flag indicates that a
higher-priority process is waiting to run on the CPU; when it is set, the
current process needs to get out of the way as quickly as possible.
Clearly, performing a busy wait for network data would not be the right
thing to do in such a situation. Linus did not say that the proposed patch
violated that rule, but it was not sufficiently clear to him that things
would work as they needed to.
In response to these perceived shortcomings, Linus refused the entire patch
set, putting just
over 1,200 changes on hold. He didn't reject the low-latency work
End result: I think the code is salvageable and people who want
this kind of busy-looping can have it. But I really don't want to
merge it as-is. I think it was badly done, I think it was badly
documented, and I think somebody over-sold the feature by
emphasizing the upsides and not the problems.
As one might imagine, that put a bit of pressure on Eliezer Tamir, the
author of the patches in question. The merge window is only two weeks
long, so the requested changes needed to be made in a hurry. Eliezer was
up to the challenge, though, producing the requested changes in short
order. On July 9, Dave posted a new pull
request with the updated code; Linus pulled the networking tree the
same day, though not before posting a
complaint about some unrelated issues.
In this case, the last-minute review clearly improved the quality of the
implementation; in particular, the user-visible option to poll()
is now more representative of what it really does
(SO_LL remains unchanged, but it will become SO_BUSY_WAIT
before 3.11 is released). The cost, of course,
was undoubtedly a fair amount of adrenaline on Eliezer's part as he
imagined Dave busy waiting for the fixes. Better
review earlier in the process might have allowed some of these issues to be
found and fixed in a more relaxed manner. But review bandwidth is, as
is the case in most projects, the most severely limited resource of all.
Comments (5 posted)
The full dynamic tick
feature that made its
debut in the 3.10 kernel can be good for users who want their applications
to have full use of one or more CPUs without interference from the kernel.
By getting the clock tick out of the way,
this feature minimizes kernel overhead and the potential latency problems.
Unfortunately, full dynamic tick operation also has the potential to increase
power consumption. Work is underway to fix that problem, but it turns out
to require a bit of information that is surprisingly hard to get: is the
system fully idle or not?
The kernel has had the ability to turn off the periodic clock interrupt on
idle processors for many years. Each processor, when it goes idle, will
simply stop its timer tick; when all processors are idle, the system will
naturally have the timer tick disabled systemwide. Fully dynamic tick —
where the timer tick can be disabled on non-idle CPUs — adds an interesting
complication, though. While most processors can (when the conditions are
right) run without the clock tick, one processor must continue to keep the
tick enabled so that it can perform a number of necessary system
timekeeping operations. Clearly, this "timekeeping CPU" should be able to
disable its tick and go idle if nothing else is running in the system, but,
in current kernels, there is no way for that CPU to detect this situation.
A naive solution to this problem will come easily to mind: maintain a
global counter tracking the number of idle CPUs. Whenever a processor goes
idle, it increments the counter; when the processor becomes busy again, it
decrements the counter. When the number of idle CPUs matches the number of
CPUs in the system, the kernel will know that no work is being done and the
timekeeping CPU can take a break.
The problem, of course, is that cache contention for that global counter
would kill performance on larger systems. Transitions to and from idle are
common under most workloads, so the cache line containing the counter would
bounce frequently across the system. That would defeat some of the point
of the dynamic tick feature; it seems likely that many users would prefer
the current power-inefficient mode to a "solution" that carried such a
So something smarter needs to be done. That's the cue for an entry by Paul
McKenney, whose seven-part full-system idle
patch set may well be the solution to this problem.
As one might expect, the solution involves the maintenance of a per-CPU
array of idle states. Each CPU can update its status in the array without
contending with the other CPUs.
But, once again, the naive solution is inadequate.
With a per-CPU array, determining whether the system is fully idle requires
iterating through the entire array to examine the state of each CPU. So,
while maintaining the state becomes cheap, answering the "is the system
idle?" question becomes expensive if the number of CPUs is large. Given
that the timekeeping code is
likely to want to ask that question frequently (at each timer tick, at
least), an expensive implementation is not indicated; something else must
Paul's approach is to combine the better parts of both naive solutions. A
single global variable is created to represent the system's idle state and
state easy to query quickly. That variable is updated from a scan over the
individual CPU idle states, but only under specific conditions that
minimize cross-CPU contention. The result should be the best of both
worlds, at the cost of delayed detection of the full-system idle state and
the addition of some tricky code.
The actual scan of the per-CPU idle flags is not done in the scheduler or
timekeeping code, as one might expect. Instead (as others might expect),
Paul put it into the read-copy-update (RCU) subsystem. That may seem like
a strange place, but it makes a certain sense: RCU is already tracking the
state of the system's CPUs, looking for "grace periods" during which
unused RCU-protected data structures can be reclaimed. Tracking whether
each CPU is fully idle is a relatively small change to the RCU code. As an
added benefit, it is easy for RCU to avoid scanning over the CPUs when
things are busy, so the overhead of maintaining the global full-idle state
vanishes when the system has other things to do.
The actual idleness of the system is tracked in a global variable called
full_sysidle_state. Updating this variable too often would bring
back the cache-line contention problem, though, so the code takes a more
roundabout path. Whenever the system is perceived to be idle, the code
keeps track of when the last processor went idle. Only after a delay will
the global idle state be changed. That delay drops to zero for "small"
machines (those with no more than eight processors), it increases linearly
as the number of processors goes up. So, on a very large system, all
processors must be idle for quite some time before
full_sysidle_state will change to reflect that state of affairs.
The result is that detection of full-system idle will be delayed on larger
systems, possibly by a significant fraction of a second. So the timer tick
will run a little longer than it strictly needs to. That is a cost
Paul's approach, as is the fact that his patch set adds some 500 lines of
core kernel code for what is, in the end, the maintenance of a single
integer value. But that, it seems, is the price that must be paid for
scalability in a world where systems have large numbers of CPUs.
Comments (11 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>