The 3.6 kernel was released on September 30
. In the announcement
When I did the -rc7 announcement a week ago, I said I might have to
do an -rc8, but a week passed, and things have been calm, and I
honestly cannot see a major reason to do another rc. So here it
is, 3.6 final.
Notable features in 3.6 include TCP small
queues, the client-side TCP fast open
implementation (server side has been merged for 3.7), IOMMU groups, the Btrfs send/receive feature, the VFIO device virtualization mechanism, and
more. See the KernelNewbies
3.6 page for details.
Stable updates: 3.5.5, 3.4.12 and 3.0.44 were released on October 2; each
contains a longer-than-usual list of important fixes.
Comments (none posted)
It's not a very advanced regular expression, but I still find this
a bit alarming in the Linux kernel:
$ git log --no-merges v3.5..v3.6 | \
egrep -i '(integer|counter|buffer|stack|fix) (over|under)flow' | \
How many were security relevant? How many got CVEs?
— Kees Cook
I chose SHA-512 because everyone knows it's 512 times more secure
— Rusty Russell
A familiar test case that makes 5 million random accesses to a 1GB
memory area goes from 20 seconds down to 0.43 seconds with THP
enabled on my SPARC T4-2 box.
— minor performance improvements from David
I added "having no life" as a skill on my Linked In profile. Please
Comments (3 posted)
One outcome of the recently-concluded Linux Security Summit was the
decision to form a workgroup around Linux security issues. That workgroup
now exists; it will be using the existing kernel-hardening list for its
The charter of the workgroup is to provide on-going security
verification of Linux kernel subsystems in order to assist in securing the
Linux Kernel and maintain trust and confidence in the security of the Linux
This may include, but is not limited to, topics such as tooling to assist
in securing the Linux Kernel, verification and testing of critical
subsystems for vulnerabilities, security improvements for build tools, and
providing guidance for maintaining subsystem security.
The group intends to discuss a wide range of approaches including tool
development, static analysis, verification efforts, and even the
possibility of tightening the rules for patch signing. Interested people
are encouraged to join in.
Full Story (comments: none)
Rusty Russell has announced a proposal to standardize the virtio
I/O subsystem. He
I believe that a documented standard (aka virtio 1.0) will increase
visibility and adoption in areas outside our normal linux/kvm
universe. There's been some of that already, but this is the
clearest path to accelerate it. Not the easiest path, but I
believe that a solid I/O standard is a Good Thing for everyone.
The plan is to start an OASIS working group which would help in the
development (and standardization) of version 1.0 of the virtio
specification. He is asking for comments on the idea, but few have been
posted as of this writing.
Full Story (comments: none)
One of the headline features in the 3.6 release was the long-awaited advent
of security restrictions
that change the
handling of hard and soft links in world-writable directories. One of the
reasons this change took so long to merge was concerns about breaking
programs and scripts on user systems. The case was finally made that
problems would be limited to malware, and the feature was merged.
Now, a single report of trouble on the
linux-kernel list has developers questioning the change — or, at least,
whether it should be turned on by default. Linus fears that this report could be followed by
However, I suspect we'll see more. And once that happens, we're not
going to keep a default that breaks peoples old scripts, and we're
going to have to rely on distributions (or users) explicitly
Compatibility is just too important.
Other developers have argued for making the
change as soon as the 3.6.1 stable update. Needless to say, agreement on
this point is not universal; Kees Cook, the author of the change, argues that the benefits far outweigh the
pain. The kernel community is committed to not breaking things that used
to work, though; if this change appears to be causing problems more widely,
it will probably be reversed in the near future.
Comments (none posted)
Kernel development news
A mere 72 days after the beginning of the 3.6 development cycle, the
process has started again with the opening of the 3.7 merge window. As of
this writing, some 5540 non-merge changesets have been pulled into the
mainline, with more to come. Some of the more interesting user-visible
changes merged thus far include:
- The arm64 patch set, adding support
for ARM's 64-bit "AARCH64" architecture, has been merged.
- The perf kvm tool has gained a "stat" command for
analysis of event data. Extensive bash completion support for perf
(for both commands and event names) has also been added.
- The new perf trace tool is meant to function like the
strace utility, but with the ability to show events beyond
system calls. This tool appears to be just getting started; the
message reads "It gets stuck sometimes, but hey, it works
- Applications on the s/390 architecture can now make use of the
System zEC12 hardware transactional memory feature.
- Support for the Intel supervisor mode
access prevention feature has been added.
- The CIFS filesystem now has complete SMB2.1 support; SMB2 is still
marked as experimental, but that's a step forward from its previous
- The ARM subtree cleanup continues; the Tegra subarchitecture is now
fully converted to the device tree mechanism. The unloved and unused
Philips Nexperia PNX4008 subarchitecture support has been removed.
- Extended attributes are now implemented on the control directories for
control groups. This is a Systemd-inspired feature allowing ancillary
information to be attached to control groups.
non-hierarchical control group controllers are used with nested
control groups, a warning will now be emitted. The behavior of those
controllers in that situation might change in the future; see this article for more information.
- The Generic
Routing Encapsulation (GRE) tunneling protocol is now supported
over IPv6. Network address translation (NAT) is also now available
- Server-side support for the TCP fast
open protocol enhancement has been merged.
- The kernel now has support for the VXLAN
tunneling protocol. See Documentation/networking/vxlan.txt for
- The IMA integrity appraisal security
extension has been merged.
- Subject to a configuration option, the "Yama" security module can be
automatically stacked regardless of
which security module is the "primary" module.
- A number of changes improving support for trusted platform module
(TPM) devices have gone in. There is now support for TPM modules
supporting the TCG TIS 1.2 specification and Infineon's I2C 0.20
specification. IBM virtual TPMs are now supported. The "physical
presence interface" mechanism is also supported, making TPM
- New hardware support includes:
- Boards and processors:
Broadcom BCM2835 SoCs,
Raspberry Pi boards, and
Micrel KS8695 SoC-based boards.
s/390 "storage class memory" devices,
Calxeda Highbank SATA controllers, and
QLogic ISP83xx iSCSI host adapters.
Sony PS3 BD remote controls.
Fairchild FAN53555 regulators,
Maxim 8907 voltage regulators,
Freescale i.MX28 LRADC analog to digital converters (ADCs),
Analog Devices AD7787, AD7788, AD7789, AD7790 and AD7791 SPI ADCs,
Analog Devices AD5755/AD5755-1/AD5757/AD5735/AD5737 ADCs,
TI LP8788 ADCs,
Maxim MAX197 ADCs,
Analog Devices ADT7410 temperature monitoring chips,
Samsung GPIO/pinmux controllers,
Nomadik DB8540 pin controllers,
Freescale IMX35 pin controllers,
Avionic Design N-bit GPIO expanders,
Broadcom BCM2835 GPIO units,
Freescale MXS SPI controllers, and
NXP SC18IS602/603 SPI controllers.
Silicom Bypass network interface cards,
Freescale XGMAC MDIO controllers, and
Microchip MRF24J40 transceivers.
NXP SCCNXP serial ports,
NXP LPC32XX high speed serial ports,
Maxim MAX3108 UARTs, and
Digi Realport remote serial devices.
Broadcom BCM63xx peripheral controllers,
Marvell USB 3.0 PHY controllers,
ZTE USB to serial devices, and
Cambridge Electronic Design 1401 USB devices (described as
"whatever that is" in the Kconfig entry).
Changes visible to kernel developers include:
- The regulator subsystem now supports a "bypass mode" wherein the
input is connected directly to the output.
- The handling of read-copy-update grace periods has been pushed into a
set of kernel threads, allowing for better preemptability and reduced
power consumption; The October 11 LWN Weekly Edition will
include an article on this work. RCU has also seen work to allow
user-mode execution to be seen as a sort of quiescent state; this is a
necessary precondition to fully tickless execution.
- There is a new "parking" facility for kernel threads. The primary
purpose is to provide a lightweight mechanism to get these threads out
of the way when CPU hotplug events are processed.
- The new TIMER_IRQSAFE timer flag causes the timer function to
be executed with interrupts off. It exists to make it possible to
safely wait for (and cancel) timers from within interrupt handlers.
- There is a new sensor framework for human input devices; it registers
a multifunction device for each sensor hub and enumerates the sensors
found attached to it. See Documentation/hid/hid-sensor.txt for
- The firmware caching API has been
merged. This subsystem will pull copies of potentially interesting
device firmware into memory just prior to a system suspend, thus
ensuring that the firmware will be available at resume time.
- The feature-removal.txt file is now a removed feature. Linus
it, saying: "There is never any reason to add stuff to this
idiotic file. Either something isn't getting used, and you should
just remove it, or there is no excuse for removing it in the first
place. Just stop the idiocy."
- Initial multiplatform support for the ARM architecture has been
merged. This is an important step toward the "single zImage" goal,
where one kernel can run on a wide variety of ARM systems, but there
is still a lot of work to be done before that goal can be reached.
- The non-reentrant workqueues patch has
been merged. There are also new mod_delayed_work() and
mod_delayed_work_on() functions to modify the expiration time
for delayed work items.
- The user namespace conversion work
continues, meaning that the newish kuid_t and kgid_t
types are appearing in more kernel subsystems.
The 3.7 merge window can be expected to stay open until approximately
October 14. That said, Linus has warned the community that he will be
traveling during this time; he, along with your editor, will be at the
Linux Foundation's Korea
Linux Forum. If the travel interferes with the merging process — which
hasn't been a problem in previous merge windows — this merge window may be
extended to compensate.
Comments (8 posted)
Anyone who follows Linux kernel security discussions has probably heard of
the "LSM stacking issue". It is a perennial topic on the mailing lists and
solutions have been proposed from time to time. The basic problem is that
only one Linux Security Module (LSM) can be active in a running kernel, and
that single slot is often occupied by one of the "monolithic" solutions
(e.g. SELinux or AppArmor) supplied by distributions. That leaves some of
the smaller or more special-purpose LSMs—or users who want to use
multiple approaches—out in the cold.
Back in February 2011, David Howells proposed
a stacking solution for LSMs. At the time, Casey Schaufler mentioned a
solution he had been working on that would be posted in a "day or two".
prediction turns out to have been overly optimistic, but his solution
has surfaced—more than a year-and-a-half later. He also discussed the patches in a lightning talk at the
recently held Linux Security Summit.
There are three types of LSMs available in the kernel today and there are
use cases for combining them in various ways. Administrators might want to
add some AppArmor restrictions on top of the distribution-supplied SELinux
configuration—or use SELinux-based sandboxes on a TOMOYO
system. The two "labeled" LSMs,
SELinux and Smack, require that files have extended attributes (xattrs)
containing labels that are used for access decisions. The two "path-based"
LSMs, AppArmor and TOMOYO, both base their access decisions on the paths
used to access
files in the system. The only other LSM currently available is Yama, which
is something of a container for discretionary access control (DAC)
Yama is the LSM that is perhaps most likely to be stacked. It adds some
restrictions to the
ptrace() attach operation that Ubuntu and ChromeOS use, and other
distributions are considering it as well. In fact, Yama developer Kees
Cook has proposed making the LSM
unconditionally stackable via the CONFIG_SECURITY_YAMA_STACKED
kernel build option (which was merged for 3.7). Over the years, though,
various other security ideas
have been proposed and pointed in the direction of the LSM API, so other
targeted LSMs may come about down the road. Making each separately
stackable is less than ideal, so a more general solution is desirable.
In addition, combining labeled and path-based solutions manually can't
really be sanely done.
When Howells posted his solution, he explicitly disallowed combining the
two labeled LSMs because of implementation difficulties (mainly with
respect to the LSM-specific secid which is used by SELinux and
Smack, but none of the others). There was also a
belief that mixing SELinux and Smack (or AppArmor and TOMOYO for that
matter) is not a particularly sought-after feature. But Schaufler thought
an unnecessary restriction, one that he was trying to address in his
As it turns out, Schaufler ended up at the same place. His proposal also
defers stacking (or "composing") SELinux and Smack, noting that it
"has proven quite a
challenge". But he was able to get the other combinations
working—at least to the extent that the kernel would boot without
complaints in the logs. The Smack tests passed as well. Performance for
Smack with AppArmor, TOMOYO, and
Yama enabled is "within the noise", he said.
Schaufler's version ensures that the hooks for each enabled LSM are
called, which is different than Howells's approach that short-circuited
the other hooks if one denied the access. Instead, Schaufler patches call
each LSM's hooks, remembering the last non-zero return (denial or error of
some sort) as the return value for the hook. His argument is that an LSM
could reasonably expect to see—and possibly record information
about—each access decision, even if it has been denied by another LSM.
Much of the "guts" of the changes are described in the infrastructure
patch, which is the largest of the five patches. The others make
fairly modest (if pervasive) changes to SELinux, Smack, TOMOYO, and
AppArmor to support stacking. As it turns out, Yama "required no
change and gets in free". The changes to the individual LSMs are
optional, as they can still be used (in a non-stackable way) without them.
Stacking is governed by the CONFIG_SECURITY_COMPOSER option. If
that is not chosen, all of the existing LSMs function as they do today.
If stacking is built in, the security= boot parameter can then be
used to control which
LSMs are enabled. For example, security=selinux,apparmor will
enable those two. If nothing is specified on the boot command line,
all of the LSMs built into the kernel will be enabled. The
/proc/PID/attr/current interface has also been changed to report
information from any of the active LSMs (only SELinux, Smack, and AppArmor
actually use that interface today).
Existing kernels store pointers to the hooks implemented by an LSM in a
struct security_operations called
security_ops. Schaufler's patch replaces that with an array of
security_operations pointers called composer_ops. That
array is indexed based on
the order that is assigned to each LSM as it is registered. The
first entry (composer_ops) is reserved for the Linux capabilities
hooks. Those have been manually "stacked" into the LSMs for some time, so
entries in composer_ops get zeroed out if one of the other LSMs
implements the hook (as the capabilities checks will be done there). If
there is no entry in composer_ops, each of the hooks in the
other entries in that array are called, as described above.
The security "blobs" (private storage for each LSM) are still managed by
the LSMs, but because there are blob pointers sprinkled around various
kernel data structures (e.g. inodes, files, sockets, keys, etc.), a
"composer blob" is used. That blob contains pointers to each of the active
LSM blobs, and new calls are used to get and set the blob pointers
(e.g. lsm_get_inode() or lsm_set_sock()). Most of the
changes for the individual LSMs are converting to use this new interface.
So far, most of the comments have been about implementation details;
Schaufler addressed those in the second version of the patch set. Notably
missing, at least so far, were some of the concerns about strange
interactions between stacked LSMs leading to vulnerabilities that have come
up in earlier discussions. But, without
any major complaints, one would guess some more testing will be done,
including gathering some additional performance numbers, before the
linux-kernel gauntlet will be run. The rest of the kernel developers have
heard about the need for stacking LSMs enough times that it seems likely
that Schaufler's patches (or something derived from them) will eventually
Comments (4 posted)
In mid-September, the 3.6 kernel appeared to be stabilizing nicely. Most
of the known regressions had been fixed, the patch volume was dropping, and
Linus was relatively happy. Then Nikolay Ulyanitsky showed up with a problem
: the pgbench
benchmark ran 20% slower than under 3.5. The resulting discussion shows
just how hard scalability can be on contemporary hardware and how hard
scheduling can be in general.
Borislav Petkov was able to reproduce the problem; a dozen or so bisection
iterations later he narrowed down the problem to this
patch, which was duly reverted. There is just one little problem left: the
offending patch was, itself, meant to improve scheduler performance.
Reverting it fixed the PostgreSQL regression, but at the cost of losing an
optimization that improves things for many (arguably most) other
workloads. Naturally, that led to a search to figure out what the real
problem was so that the optimization could be restored without harmful
effects on PostgreSQL.
What went wrong
The kernel's scheduling domains mechanism
exists to optimize scheduling decisions by modeling the costs of moving
processes between CPUs. Migrating a process from one CPU to a
hyperthreaded sibling is nearly free; cache is shared at all levels, so the
moved process will not have to spend time repopulating cache with its
working set. Moving to another CPU within the same physical package will
cost more, but mid-level caches are still shared, so such a move is still
much less expensive than moving to another package entirely. The current
scheduling code thus tries to keep processes within the same package
whenever possible, but it also tries to spread runnable processes across the
package's CPUs to maximize throughput.
The problem that the offending patch (by Mike Galbraith) was trying to
solve comes from the fact that the number of CPUs built into a single
package has been growing over time. Not too long ago, examining every
processor within a package in search of an idle CPU for a runnable process
was a relatively quick affair. As the number of CPUs in a package
increases, the cost of that search increases as well, to the point that it
starts to look expensive. The current scheduler's behavior, Mike said at
the time, could also result in processes bouncing around the package
excessively. The result was less-than-optimal performance.
Mike's solution was to organize CPUs into pairs; each CPU gets one "buddy"
CPU. When one CPU wakes a process and needs to find a processor for that
process to run on, it examines only the buddy CPU. The process will be
placed on either the original CPU or the buddy; the search will go no
further than that even if there might be a more lightly loaded CPU
elsewhere in the package. The cost of iterating
over the entire package is eliminated, process bouncing is reduced, and
things run faster. Meanwhile, the scheduler's load balancing code can
still be relied upon to distribute the load across the available CPUs in
the longer term. Mike reported significant improvements in tbench
benchmark results with the patch, and it was quickly accepted for the 3.6
So what is different about PostgreSQL that caused it to slow down in
response to this change? It seems to come down to the design of the
PostgreSQL server and the fact that it does a certain amount of its own
scheduling with user-space spinlocks. Carrying its own spinlock
implementation does evidently yield performance benefits for the PostgreSQL
project, but it also makes the system more susceptible to problems
resulting from scheduler changes in the underlying system.
In this case, restricting the set of CPUs on which a newly-woken process
can run increases the chance that it will end up preempting another
PostgreSQL process. If the new process needs a lock held by the preempted
process, it will end up waiting until the preempted processes manages to
run again, slowing things down. Possibly even worse is that preempting the
process — also more likely with Mike's patch — can slow the flow of tasks
to all PostgreSQL worker processes; that, too, will hurt performance.
Making things better
What is needed is a way to gain the benefits of Mike's patch without making
things worse for PostgreSQL-style loads. One possibility, suggested by Linus, is to try to reduce the
cost of searching for an idle CPU instead of eliminating the search
outright. It appears that there is some low-hanging fruit in this area,
but it is not at all clear that optimizing the search, by itself, will
solve the entire problem. Mike's patch eliminates that search cost, but it
also reduces movement of processes around the package; a fix that only
addresses the first part risks falling short in the end.
Another possibility is to simply increase the scheduling granularity,
essentially giving longer time slices to running processes. That will
reduce the number of preemptions, making it less likely that PostgreSQL
processes will step on each other's toes.
granularity does, indeed, make things better for the pgbench
load. There may be some benefit to be had from messing with the
granularity, but it is not without its risks. In particular, increasing
the granularity could have an adverse effect on desktop interactivity;
there is no shortage of Linux users who would consider that to be a bad
Yet another possibility is to somehow teach the scheduler to recognize
processes — like the PostgreSQL dispatcher — that should not be preempted
by related processes if it can be avoided. Ingo Molnar suggested investigating this idea:
Add a kernel solution to somehow identify 'central' processes and
bias them. Xorg is a similar kind of process, so it would help
other workloads as well. That way lie dragons, but might be worth
an attempt or two.
The problem, of course, is the dragons. The O(1) scheduler, used by Linux
until the Completely Fair Scheduler (CFS) was merged for 2.6.23, had, over
time, accumulated no end of heuristics and hacks designed to provide the
"right" kind of scheduling for various types of workloads. All these
tweaks complicated the scheduler code considerably, making it fragile and
difficult to work with — and they didn't even work much of the time. This
complexity inspired Con Kolivas's "staircase deadline scheduler" as a much
simpler solution to the problem; that work led to the writing (and merging)
Naturally, CFS has lost a fair amount of its simplicity since it was merged;
contact with the real world tends to do that to scheduling algorithms. But
it is still relatively free of workload-specific heuristics. Opening the
door to more of them now risks driving the scheduler in a less
maintainable, more brittle direction where nothing can be done without
a significant chance of creating problems in unpredictable places. It
seems unlikely that the development community wants to go there.
A potentially simpler alternative is to let the application itself tell the
scheduler that one of its processes is special. PostgreSQL could request
that its dispatcher be allowed to run at the expense of one of its own
workers, even if the normal scheduling algorithm would dictate otherwise.
That approach reduces complexity, but it does so by pushing some of the
cost into applications. Getting application developers to accept that cost
can be a challenge, especially if they are interested in supporting
operating systems other than Linux. As a general rule, it is far better if
things just work without the need for manual intervention of this type.
So, in other words, nobody really knows how this problem will be solved at
this time. There are several interesting ideas to pursue, but none that
seem like an obvious solution. Further research is clearly called for.
One good point in all of this is that the problem was found before the
final 3.6 kernel shipped. Performance regressions have a way of hiding,
sometimes for years, before they emerge to bite some important workload.
Eventually, tools like Linsched may help to
find more of these problems early, but we will always be dependent on users
who will perform this kind of testing with workloads that matter to them.
Without Nikolay's 3.6-rc testing, PostgreSQL users might have had an
unpleasant surprise when this kernel was released.
Comments (25 posted)
Patches and updates
- Linus Torvalds: Linux 3.6 .
(October 1, 2012)
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>