Kernel development

Brief items

Kernel release status

The 3.6 kernel was released on September 30. In the announcement Linus said:

When I did the -rc7 announcement a week ago, I said I might have to do an -rc8, but a week passed, and things have been calm, and I honestly cannot see a major reason to do another rc. So here it is, 3.6 final.

Notable features in 3.6 include TCP small queues, the client-side TCP fast open implementation (server side has been merged for 3.7), IOMMU groups, the Btrfs send/receive feature, the VFIO device virtualization mechanism, and more. See the KernelNewbies 3.6 page for details.

Stable updates: 3.5.5, 3.4.12 and 3.0.44 were released on October 2; each contains a longer-than-usual list of important fixes.

Comments (none posted)

Quotes of the week

It's not a very advanced regular expression, but I still find this a bit alarming in the Linux kernel:

    $ git log --no-merges v3.5..v3.6 | \
	  egrep -i '(integer|counter|buffer|stack|fix) (over|under)flow' | \
	  wc -l
    31

How many were security relevant? How many got CVEs?

— Kees Cook

I chose SHA-512 because everyone knows it's 512 times more secure than SHA-1.

— Rusty Russell

A familiar test case that makes 5 million random accesses to a 1GB memory area goes from 20 seconds down to 0.43 seconds with THP enabled on my SPARC T4-2 box.

— minor performance improvements from David Miller

I added "having no life" as a skill on my Linked In profile. Please endorse me!

— Jon Masters

Comments (3 posted)

Linux security workgroup formed

One outcome of the recently-concluded Linux Security Summit was the decision to form a workgroup around Linux security issues. That workgroup now exists; it will be using the existing kernel-hardening list for its discussions.

The charter of the workgroup is to provide on-going security verification of Linux kernel subsystems in order to assist in securing the Linux Kernel and maintain trust and confidence in the security of the Linux ecosystem.

This may include, but is not limited to, topics such as tooling to assist in securing the Linux Kernel, verification and testing of critical subsystems for vulnerabilities, security improvements for build tools, and providing guidance for maintaining subsystem security.

The group intends to discuss a wide range of approaches including tool development, static analysis, verification efforts, and even the possibility of tightening the rules for patch signing. Interested people are encouraged to join in.

Full Story (comments: none)

Standardizing virtio

Rusty Russell has announced a proposal to standardize the virtio I/O subsystem. He says:

I believe that a documented standard (aka virtio 1.0) will increase visibility and adoption in areas outside our normal linux/kvm universe. There's been some of that already, but this is the clearest path to accelerate it. Not the easiest path, but I believe that a solid I/O standard is a Good Thing for everyone.

The plan is to start an OASIS working group which would help in the development (and standardization) of version 1.0 of the virtio specification. He is asking for comments on the idea, but few have been posted as of this writing.

Full Story (comments: none)

Questioning link restrictions

One of the headline features in the 3.6 release was the long-awaited advent of security restrictions that change the handling of hard and soft links in world-writable directories. One of the reasons this change took so long to merge was concerns about breaking programs and scripts on user systems. The case was finally made that problems would be limited to malware, and the feature was merged.

Now, a single report of trouble on the linux-kernel list has developers questioning the change — or, at least, whether it should be turned on by default. Linus fears that this report could be followed by others:

However, I suspect we'll see more. And once that happens, we're not going to keep a default that breaks peoples old scripts, and we're going to have to rely on distributions (or users) explicitly setting it.

Compatibility is just too important.

Other developers have argued for making the change as soon as the 3.6.1 stable update. Needless to say, agreement on this point is not universal; Kees Cook, the author of the change, argues that the benefits far outweigh the pain. The kernel community is committed to not breaking things that used to work, though; if this change appears to be causing problems more widely, it will probably be reversed in the near future.

Comments (none posted)

Kernel development news

3.7 Merge window part 1

By Jonathan Corbet
October 3, 2012

A mere 72 days after the beginning of the 3.6 development cycle, the process has started again with the opening of the 3.7 merge window. As of this writing, some 5540 non-merge changesets have been pulled into the mainline, with more to come. Some of the more interesting user-visible changes merged thus far include:

The arm64 patch set, adding support for ARM's 64-bit "AARCH64" architecture, has been merged.
The perf kvm tool has gained a "stat" command for analysis of event data. Extensive bash completion support for perf (for both commands and event names) has also been added.
The new perf trace tool is meant to function like the strace utility, but with the ability to show events beyond system calls. This tool appears to be just getting started; the commit message reads "It gets stuck sometimes, but hey, it works sometimes too!"
Applications on the s/390 architecture can now make use of the System zEC12 hardware transactional memory feature.
Support for the Intel supervisor mode access prevention feature has been added.
The CIFS filesystem now has complete SMB2.1 support; SMB2 is still marked as experimental, but that's a step forward from its previous "broken" status.
The ARM subtree cleanup continues; the Tegra subarchitecture is now fully converted to the device tree mechanism. The unloved and unused Philips Nexperia PNX4008 subarchitecture support has been removed.
Extended attributes are now implemented on the control directories for control groups. This is a Systemd-inspired feature allowing ancillary information to be attached to control groups.
If non-hierarchical control group controllers are used with nested (hierarchical) control groups, a warning will now be emitted. The behavior of those controllers in that situation might change in the future; see this article for more information.
The Generic Routing Encapsulation (GRE) tunneling protocol is now supported over IPv6. Network address translation (NAT) is also now available for IPv6.
Server-side support for the TCP fast open protocol enhancement has been merged.
The kernel now has support for the VXLAN tunneling protocol. See Documentation/networking/vxlan.txt for more information.
The IMA integrity appraisal security extension has been merged.
Subject to a configuration option, the "Yama" security module can be automatically stacked regardless of which security module is the "primary" module.
A number of changes improving support for trusted platform module (TPM) devices have gone in. There is now support for TPM modules supporting the TCG TIS 1.2 specification and Infineon's I2C 0.20 specification. IBM virtual TPMs are now supported. The "physical presence interface" mechanism is also supported, making TPM administration easier.
New hardware support includes:
- Boards and processors: Broadcom BCM2835 SoCs, Raspberry Pi boards, and Micrel KS8695 SoC-based boards.
- Block: s/390 "storage class memory" devices, Calxeda Highbank SATA controllers, and QLogic ISP83xx iSCSI host adapters.
- Input: Sony PS3 BD remote controls.
- Miscellaneous: Fairchild FAN53555 regulators, Maxim 8907 voltage regulators, Freescale i.MX28 LRADC analog to digital converters (ADCs), Analog Devices AD7787, AD7788, AD7789, AD7790 and AD7791 SPI ADCs, Analog Devices AD5755/AD5755-1/AD5757/AD5735/AD5737 ADCs, TI LP8788 ADCs, Maxim MAX197 ADCs, Analog Devices ADT7410 temperature monitoring chips, Samsung GPIO/pinmux controllers, Nomadik DB8540 pin controllers, Freescale IMX35 pin controllers, Avionic Design N-bit GPIO expanders, Broadcom BCM2835 GPIO units, Freescale MXS SPI controllers, and NXP SC18IS602/603 SPI controllers.
- Networking: Silicom Bypass network interface cards, Freescale XGMAC MDIO controllers, and Microchip MRF24J40 transceivers.
- Serial: NXP SCCNXP serial ports, NXP LPC32XX high speed serial ports, Maxim MAX3108 UARTs, and Digi Realport remote serial devices.
- USB: Broadcom BCM63xx peripheral controllers, Marvell USB 3.0 PHY controllers, ZTE USB to serial devices, and Cambridge Electronic Design 1401 USB devices (described as "whatever that is" in the Kconfig entry).

Changes visible to kernel developers include:

The regulator subsystem now supports a "bypass mode" wherein the input is connected directly to the output.
The handling of read-copy-update grace periods has been pushed into a set of kernel threads, allowing for better preemptability and reduced power consumption; The October 11 LWN Weekly Edition will include an article on this work. RCU has also seen work to allow user-mode execution to be seen as a sort of quiescent state; this is a necessary precondition to fully tickless execution.
There is a new "parking" facility for kernel threads. The primary purpose is to provide a lightweight mechanism to get these threads out of the way when CPU hotplug events are processed.
The new TIMER_IRQSAFE timer flag causes the timer function to be executed with interrupts off. It exists to make it possible to safely wait for (and cancel) timers from within interrupt handlers.
There is a new sensor framework for human input devices; it registers a multifunction device for each sensor hub and enumerates the sensors found attached to it. See Documentation/hid/hid-sensor.txt for details.
The firmware caching API has been merged. This subsystem will pull copies of potentially interesting device firmware into memory just prior to a system suspend, thus ensuring that the firmware will be available at resume time.
The feature-removal.txt file is now a removed feature. Linus zapped it, saying: "There is never any reason to add stuff to this idiotic file. Either something isn't getting used, and you should just remove it, or there is no excuse for removing it in the first place. Just stop the idiocy."
Initial multiplatform support for the ARM architecture has been merged. This is an important step toward the "single zImage" goal, where one kernel can run on a wide variety of ARM systems, but there is still a lot of work to be done before that goal can be reached.
The non-reentrant workqueues patch has been merged. There are also new mod_delayed_work() and mod_delayed_work_on() functions to modify the expiration time for delayed work items.
The user namespace conversion work continues, meaning that the newish kuid_t and kgid_t types are appearing in more kernel subsystems.

The 3.7 merge window can be expected to stay open until approximately October 14. That said, Linus has warned the community that he will be traveling during this time; he, along with your editor, will be at the Linux Foundation's Korea Linux Forum. If the travel interferes with the merging process — which hasn't been a problem in previous merge windows — this merge window may be extended to compensate.

Comments (8 posted)

Another LSM stacking approach

By Jake Edge
October 3, 2012

Anyone who follows Linux kernel security discussions has probably heard of the "LSM stacking issue". It is a perennial topic on the mailing lists and solutions have been proposed from time to time. The basic problem is that only one Linux Security Module (LSM) can be active in a running kernel, and that single slot is often occupied by one of the "monolithic" solutions (e.g. SELinux or AppArmor) supplied by distributions. That leaves some of the smaller or more special-purpose LSMs—or users who want to use multiple approaches—out in the cold.

Back in February 2011, David Howells proposed a stacking solution for LSMs. At the time, Casey Schaufler mentioned a solution he had been working on that would be posted in a "day or two". That prediction turns out to have been overly optimistic, but his solution has surfaced—more than a year-and-a-half later. He also discussed the patches in a lightning talk at the recently held Linux Security Summit.

There are three types of LSMs available in the kernel today and there are use cases for combining them in various ways. Administrators might want to add some AppArmor restrictions on top of the distribution-supplied SELinux configuration—or use SELinux-based sandboxes on a TOMOYO system. The two "labeled" LSMs, SELinux and Smack, require that files have extended attributes (xattrs) containing labels that are used for access decisions. The two "path-based" LSMs, AppArmor and TOMOYO, both base their access decisions on the paths used to access files in the system. The only other LSM currently available is Yama, which is something of a container for discretionary access control (DAC) enhancements.

Yama is the LSM that is perhaps most likely to be stacked. It adds some restrictions to the ptrace() attach operation that Ubuntu and ChromeOS use, and other distributions are considering it as well. In fact, Yama developer Kees Cook has proposed making the LSM unconditionally stackable via the CONFIG_SECURITY_YAMA_STACKED kernel build option (which was merged for 3.7). Over the years, though, various other security ideas have been proposed and pointed in the direction of the LSM API, so other targeted LSMs may come about down the road. Making each separately stackable is less than ideal, so a more general solution is desirable. In addition, combining labeled and path-based solutions manually can't really be sanely done.

When Howells posted his solution, he explicitly disallowed combining the two labeled LSMs because of implementation difficulties (mainly with respect to the LSM-specific secid which is used by SELinux and Smack, but none of the others). There was also a belief that mixing SELinux and Smack (or AppArmor and TOMOYO for that matter) is not a particularly sought-after feature. But Schaufler thought that was an unnecessary restriction, one that he was trying to address in his solution.

As it turns out, Schaufler ended up at the same place. His proposal also defers stacking (or "composing") SELinux and Smack, noting that it "has proven quite a challenge". But he was able to get the other combinations working—at least to the extent that the kernel would boot without complaints in the logs. The Smack tests passed as well. Performance for Smack with AppArmor, TOMOYO, and Yama enabled is "within the noise", he said.

Schaufler's version ensures that the hooks for each enabled LSM are called, which is different than Howells's approach that short-circuited the other hooks if one denied the access. Instead, Schaufler patches call each LSM's hooks, remembering the last non-zero return (denial or error of some sort) as the return value for the hook. His argument is that an LSM could reasonably expect to see—and possibly record information about—each access decision, even if it has been denied by another LSM.

Much of the "guts" of the changes are described in the infrastructure patch, which is the largest of the five patches. The others make fairly modest (if pervasive) changes to SELinux, Smack, TOMOYO, and AppArmor to support stacking. As it turns out, Yama "required no change and gets in free". The changes to the individual LSMs are optional, as they can still be used (in a non-stackable way) without them.

Stacking is governed by the CONFIG_SECURITY_COMPOSER option. If that is not chosen, all of the existing LSMs function as they do today. If stacking is built in, the security= boot parameter can then be used to control which LSMs are enabled. For example, security=selinux,apparmor will enable those two. If nothing is specified on the boot command line, all of the LSMs built into the kernel will be enabled. The /proc/PID/attr/current interface has also been changed to report information from any of the active LSMs (only SELinux, Smack, and AppArmor actually use that interface today).

Existing kernels store pointers to the hooks implemented by an LSM in a struct security_operations called security_ops. Schaufler's patch replaces that with an array of security_operations pointers called composer_ops. That array is indexed based on the order that is assigned to each LSM as it is registered. The first entry (composer_ops[0]) is reserved for the Linux capabilities hooks. Those have been manually "stacked" into the LSMs for some time, so entries in composer_ops[0] get zeroed out if one of the other LSMs implements the hook (as the capabilities checks will be done there). If there is no entry in composer_ops[0], each of the hooks in the other entries in that array are called, as described above.

The security "blobs" (private storage for each LSM) are still managed by the LSMs, but because there are blob pointers sprinkled around various kernel data structures (e.g. inodes, files, sockets, keys, etc.), a "composer blob" is used. That blob contains pointers to each of the active LSM blobs, and new calls are used to get and set the blob pointers (e.g. lsm_get_inode() or lsm_set_sock()). Most of the changes for the individual LSMs are converting to use this new interface.

So far, most of the comments have been about implementation details; Schaufler addressed those in the second version of the patch set. Notably missing, at least so far, were some of the concerns about strange interactions between stacked LSMs leading to vulnerabilities that have come up in earlier discussions. But, without any major complaints, one would guess some more testing will be done, including gathering some additional performance numbers, before the linux-kernel gauntlet will be run. The rest of the kernel developers have heard about the need for stacking LSMs enough times that it seems likely that Schaufler's patches (or something derived from them) will eventually pass muster.

Comments (4 posted)

How 3.6 nearly broke PostgreSQL

By Jonathan Corbet
October 2, 2012

In mid-September, the 3.6 kernel appeared to be stabilizing nicely. Most of the known regressions had been fixed, the patch volume was dropping, and Linus was relatively happy. Then Nikolay Ulyanitsky showed up with a problem: the pgbench PostgreSQL benchmark ran 20% slower than under 3.5. The resulting discussion shows just how hard scalability can be on contemporary hardware and how hard scheduling can be in general.

Borislav Petkov was able to reproduce the problem; a dozen or so bisection iterations later he narrowed down the problem to this patch, which was duly reverted. There is just one little problem left: the offending patch was, itself, meant to improve scheduler performance. Reverting it fixed the PostgreSQL regression, but at the cost of losing an optimization that improves things for many (arguably most) other workloads. Naturally, that led to a search to figure out what the real problem was so that the optimization could be restored without harmful effects on PostgreSQL.

What went wrong

The kernel's scheduling domains mechanism exists to optimize scheduling decisions by modeling the costs of moving processes between CPUs. Migrating a process from one CPU to a hyperthreaded sibling is nearly free; cache is shared at all levels, so the moved process will not have to spend time repopulating cache with its working set. Moving to another CPU within the same physical package will cost more, but mid-level caches are still shared, so such a move is still much less expensive than moving to another package entirely. The current scheduling code thus tries to keep processes within the same package whenever possible, but it also tries to spread runnable processes across the package's CPUs to maximize throughput.

The problem that the offending patch (by Mike Galbraith) was trying to solve comes from the fact that the number of CPUs built into a single package has been growing over time. Not too long ago, examining every processor within a package in search of an idle CPU for a runnable process was a relatively quick affair. As the number of CPUs in a package increases, the cost of that search increases as well, to the point that it starts to look expensive. The current scheduler's behavior, Mike said at the time, could also result in processes bouncing around the package excessively. The result was less-than-optimal performance.

Mike's solution was to organize CPUs into pairs; each CPU gets one "buddy" CPU. When one CPU wakes a process and needs to find a processor for that process to run on, it examines only the buddy CPU. The process will be placed on either the original CPU or the buddy; the search will go no further than that even if there might be a more lightly loaded CPU elsewhere in the package. The cost of iterating over the entire package is eliminated, process bouncing is reduced, and things run faster. Meanwhile, the scheduler's load balancing code can still be relied upon to distribute the load across the available CPUs in the longer term. Mike reported significant improvements in tbench benchmark results with the patch, and it was quickly accepted for the 3.6 development cycle.

So what is different about PostgreSQL that caused it to slow down in response to this change? It seems to come down to the design of the PostgreSQL server and the fact that it does a certain amount of its own scheduling with user-space spinlocks. Carrying its own spinlock implementation does evidently yield performance benefits for the PostgreSQL project, but it also makes the system more susceptible to problems resulting from scheduler changes in the underlying system. In this case, restricting the set of CPUs on which a newly-woken process can run increases the chance that it will end up preempting another PostgreSQL process. If the new process needs a lock held by the preempted process, it will end up waiting until the preempted processes manages to run again, slowing things down. Possibly even worse is that preempting the PostgreSQL dispatcher process — also more likely with Mike's patch — can slow the flow of tasks to all PostgreSQL worker processes; that, too, will hurt performance.

Making things better

What is needed is a way to gain the benefits of Mike's patch without making things worse for PostgreSQL-style loads. One possibility, suggested by Linus, is to try to reduce the cost of searching for an idle CPU instead of eliminating the search outright. It appears that there is some low-hanging fruit in this area, but it is not at all clear that optimizing the search, by itself, will solve the entire problem. Mike's patch eliminates that search cost, but it also reduces movement of processes around the package; a fix that only addresses the first part risks falling short in the end.

Another possibility is to simply increase the scheduling granularity, essentially giving longer time slices to running processes. That will reduce the number of preemptions, making it less likely that PostgreSQL processes will step on each other's toes. Increasing the granularity does, indeed, make things better for the pgbench load. There may be some benefit to be had from messing with the granularity, but it is not without its risks. In particular, increasing the granularity could have an adverse effect on desktop interactivity; there is no shortage of Linux users who would consider that to be a bad trade.

Yet another possibility is to somehow teach the scheduler to recognize processes — like the PostgreSQL dispatcher — that should not be preempted by related processes if it can be avoided. Ingo Molnar suggested investigating this idea:

Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two.

The problem, of course, is the dragons. The O(1) scheduler, used by Linux until the Completely Fair Scheduler (CFS) was merged for 2.6.23, had, over time, accumulated no end of heuristics and hacks designed to provide the "right" kind of scheduling for various types of workloads. All these tweaks complicated the scheduler code considerably, making it fragile and difficult to work with — and they didn't even work much of the time. This complexity inspired Con Kolivas's "staircase deadline scheduler" as a much simpler solution to the problem; that work led to the writing (and merging) of CFS.

Naturally, CFS has lost a fair amount of its simplicity since it was merged; contact with the real world tends to do that to scheduling algorithms. But it is still relatively free of workload-specific heuristics. Opening the door to more of them now risks driving the scheduler in a less maintainable, more brittle direction where nothing can be done without a significant chance of creating problems in unpredictable places. It seems unlikely that the development community wants to go there.

A potentially simpler alternative is to let the application itself tell the scheduler that one of its processes is special. PostgreSQL could request that its dispatcher be allowed to run at the expense of one of its own workers, even if the normal scheduling algorithm would dictate otherwise. That approach reduces complexity, but it does so by pushing some of the cost into applications. Getting application developers to accept that cost can be a challenge, especially if they are interested in supporting operating systems other than Linux. As a general rule, it is far better if things just work without the need for manual intervention of this type.

So, in other words, nobody really knows how this problem will be solved at this time. There are several interesting ideas to pursue, but none that seem like an obvious solution. Further research is clearly called for.

One good point in all of this is that the problem was found before the final 3.6 kernel shipped. Performance regressions have a way of hiding, sometimes for years, before they emerge to bite some important workload. Eventually, tools like Linsched may help to find more of these problems early, but we will always be dependent on users who will perform this kind of testing with workloads that matter to them. Without Nikolay's 3.6-rc testing, PostgreSQL users might have had an unpleasant surprise when this kernel was released.

Comments (25 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.6 ?

Greg KH Linux 3.5.5 ?

Greg KH Linux 3.4.12 ?

Steven Rostedt 3.2.30-rt45 ?

Greg KH Linux 3.0.44 ?

Architecture-specific

Andi Kleen perf PMU support for Haswell ?

Joerg Roedel Interrupt remapping support for AMD IOMMU ?

Core kernel code

Daniel Santos Generic Red-Black Trees ?

Kent Overstreet Extensible AIO interface ?

Development tools

Jan Kiszka Add gdb python scripts as kernel debugging helpers ?

Device drivers

Zhang Rui Introduce INT33B1 I2C controller driver Aug 24

Arun Murthy modem_shm: U8500 SHaRed Memory driver(SHRM) ?

Vineet.Gupta1@synopsys.com serial/arc-uart: Add new driver ?

Nicholas A. Bellinger target: Reenable buffered FILEIO + add iscsi-target MXDSL logic ?

Roland Stigge gpio: Add a block GPIO API to gpiolib ?

Vineet.Gupta1@synopsys.com serial/arc-uart: Add new driver ?

Alexandra Chin [PATCH] Input: Add new driver into Input Subsystem for Synaptics DS4 touchscreen I2C devices ?

Jon Mason PCI-Express Non-Transparent Bridge Support ?

Tomasz Stanislawski Integration of videobuf2 with DMABUF ?

Filesystems and block I/O

Vivek Goyal Use vdisktime based scheduling logic for cfq queues ?

Jeff Layton audit/getname/estale patch series ?

Memory management

Kirill A. Shutemov Virtual huge zero page ?

John Stultz Volatile Ranges (v7) & Lots of words ?

Networking

Stephen Hemminger vxlan: virtual extensible lan ?

Security-related

Jeff Garzik Add SHA-3 hash algorithm ?

Virtualization and containers

Daniel Kiper xen: Initial kexec/kdump implementation ?

Miscellaneous

Stephen Hemminger iproute2 v3.6.0 ?

Page editor: Jonathan Corbet
Next page: Distributions>>