Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.10-rc3, released on January 8. Linus said: "It still feels a bit smaller than a usual rc3, but for the first real rc after the merge window (ie I'd compare it to a regular rc2), it's fairly normal."

Stable updates: 4.9.1, 4.8.16, and 4.4.40 were released on January 6, followed by 4.9.2, 4.8.17, and 4.4.41 on January 9. Note that 4.8.17 is the final 4.8.x update.

The (large) 4.9.3 and 4.4.42 updates are in the review process as of this writing; they can be expected on or after January 12.

Comments (none posted)

Quotes of the week

Every time KASLR makes something work differently, a kitten turns all Schrödinger on us.

— Andy Lutomirski

For me, mm/ is a blank spot on the map marked "Hic sunt dracones." While that's better than "Lasciate ogne speranza, voi ch'intrate", it's still very intimidating.

— "George Spelvin"

Comments (none posted)

Last-minute control-group BPF ABI concerns

By Jonathan Corbet
January 11, 2017

One of the features pulled into the mainline during the 4.10 merge window is the ability to attach a BPF program to a control group; that program can then filter packets received or transmitted by processes within the control group. The feature itself is relatively uncontroversial (though some would prefer a different implementation). Until recently, the feature's interface and semantics were also uncontroversial — or at least not closely examined. Since the feature was merged, however, some concerns have been raised. The development community will have to decide whether changes need to be made, or the feature temporarily disabled, before the 4.10 release sets the interface in stone.

The conversation was started by Andy Lutomirski, who played with the new capability for a while and found a few things that worried him. The first of these is that the bpf() system call is used to attach the program to the control group. This is, he thinks, fundamentally a control-group operation, not a BPF operation, so it should be handled through the control-group interface. If, in the future, somebody adds the ability to impose other types of controls — controls that don't involve BPF programs — then the use of bpf() would make no sense. And, in any case, he said, bpf() is an increasingly unwieldy multiplexer system call.

This objection didn't get far; there does not seem to be a large contingent of developers interested in adding other packet-filtering mechanisms to control groups. BPF developer Alexei Starovoitov dismissed the idea, suggesting that any other mechanism could be just as easily implemented in BPF. Networking maintainer David Miller agreed with Starovoitov on this issue, so it seems that little is likely to change on this point.

The next issue runs a little deeper. Control groups are hierarchical in nature and, with version 2 of the control-group interface, all controllers are expected to behave in a fully hierarchical manner. The BPF filter mechanism is not a proper controller (a bit of an interface oddity in its own right), but its behavior in control-group hierarchies is still of interest. Controller policies are normally composed as one moves down the hierarchy. For example, if a control group is configured with the CPU controller to have 10% of the available CPU time, then a sub-group of that group is configured to get 50%, it will end up with 50% of the 10% the parent group has, or 5% in absolute terms.

If a process is running in a two-level control group hierarchy, where both levels have filter programs attached, one might think that both filters would be run — that the restrictions imposed by those filters would be additive. But that is not what happens; instead, only the filter program at the lowest level is run, while those at higher levels are ignored. The upper level filter might prohibit certain kinds of traffic, but the mere existence of a lower-level filter overrides that prohibition. In a setting where one administrator is setting filters at all levels, these semantics might not be a problem. But if one wants to set up a system with containers and user namespaces, where containers can add filter programs of their own, this behavior would allow the system-level policy to be circumvented.

Starovoitov acknowledged that, at a minimum, there might be a use case for composing all the filters in a given hierarchy. But he also asserted that "the current semantics is fine for what it's designed for" and said that different behavior can be implemented in the future. The problem with that approach is that changing the semantics would be a significant ABI change that could easily break systems that were designed around the 4.10 semantics; such a change would not be allowed. In the absence of a plan for how the new semantics could be added in a compatible way, it has to be assumed that, if 4.10 is released with the current behavior, nobody will be able to change it going forward.

Other developers (Peter Zijlstra and Michal Hocko) have expressed concerns about this behavior as well. Zijlstra asked control-group maintainer Tejun Heo for his thoughts on the matter, but no such thoughts have been forthcoming as of this writing. Starovoitov seems convinced that the current semantics are not problematic, and that they can be changed in some (unspecified) way without breaking compatibility in the future.

Lutomirski's final worry is a bit more nebulous. Until now, control groups have been concerned with resource control; the addition of BPF filters changes the game. These programs could be another way for an attacker to run hostile code; they could, for example, interfere with the input to a setUID program, leading to potential privilege escalation issues. The programs could also stash useful information where an attacker could find it.

This sounds a lot like seccomp with a narrower scope but a much stronger ability to exfiltrate private information.

Unfortunately, while seccomp is very, very careful to prevent injection of a privileged victim into a malicious sandbox, the CGROUP_BPF mechanism appears to have no real security model. There is nothing to prevent a program that's in a malicious cgroup from running a setuid binary.

For now, attaching a network filter program is a privileged operation, so exploits are not an immediate concern. But as soon as somebody tries to make it work within user namespaces a whole new can of worms would be opened up. Lutomirski put out a "half-baked proposal" that would prevent the creation of "dangerous" control groups (those that have filter programs attached) unless various conditions were met to prevent privilege escalation issues in the future.

That proposal has not met with a lot of approval. Once again, such restrictions would need to be imposed from the outset to limit the risk of breaking systems in the future; that would imply that this feature would need to be disabled for the 4.10 release. But there seems to be little interest in doing that; while Starovoitov agreed early on that there was work to be done in the security area, he once again said that it could be done at some future point.

That is where the discussion stands, as of this writing. If no action is taken, 4.10 will be released with a new feature despite the existence of concerns about its ABI and security. History has some clear lessons about what can happen when new ABIs are shipped with this kind of unanswered question; indeed, one need not look beyond control groups for examples of the kinds of problems that can be created. Given the probable outcome here, one can only hope that the BPF developers are correct that some way can be found to address the semantic and security issues without creating ABI compatibility problems.

Comments (2 posted)

Bulk memory allocation without a new allocator

By Jonathan Corbet
January 10, 2017

The kernel faces a number of scalability challenges resulting from the increasing data rates that can be handled by peripherals like storage devices and network interfaces. Often, the key to improved throughput is doing work in batches; in many cases, the overhead of performing a series of related operations is not much higher than for performing a single operation. Memory allocation is one place where batching offers the potential for significant performance improvements, but there has, so far, been no agreement on how that batching should be done. A new patch set from Mel Gorman might just show how this problem can be solved.

Network interfaces tend to require a lot of memory; all those incoming packets have to be put somewhere, after all. But the overhead of allocating that memory is high, to the point that it can limit the maximum throughput of the system as a whole. In response, driver developers are resorting to workarounds like allocating (then splitting up) high-order pages, but high-order page allocation can stress the system as a whole and runs counter to normal kernel development practice. It would be good to have a better way.

At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, networking developer Jesper Dangaard Brouer proposed the creation of a new memory allocator designed from the beginning for batch operations. Drivers could use it to allocate many pages in a single call, thus minimizing the per-page overhead. The memory-management developers at this session understood the problem, but disagreed with the idea of creating a new allocator. Doing so, they said, would make the memory-management subsystem less maintainable. Additionally the new allocator would tend to repeat the mistakes of the existing allocators and, by the time it had all the necessary features, it might not be any faster.

The right solution, from the memory-management perspective, is to modify the existing page allocator, reducing overheads and making it more friendly to multi-page allocations. This has not been done so far for a simple reason: most memory users immediately zero every page they allocate, an operation that is far more expensive than the allocation itself. That zeroing is not necessary for pages that will be overwritten with incoming packet data by a network interface, though, so high-performance networking workloads are more seriously affected by the overhead in the allocator. Fixing that overhead in the existing page allocator would solve the problem for the networking subsystem while avoiding the addition of a new allocator and providing improved performance for all parts of the kernel.

The idea made sense, but only had one shortcoming: nobody had actually done the work to improve the existing page allocator in this way. That situation has changed, though, with the posting of Gorman's bulk page allocator patch set. The patches are relatively small, but the claimed result is a significant improvement in page-allocator performance.

Two fundamental changes are required to support both allocations; both take the same form. The first of these addresses the function buffered_rmqueue(), which removes a page from a per-CPU free list in preparation for handing it out in response to an allocation request. Since the list is per-CPU, there is no locking required before making changes, but it is necessary to disable interrupts on the relevant CPU to prevent concurrent access from an interrupt handler. Disabling and restoring interrupts takes some significant time, and that time adds up if it must be done repeatedly for each page being allocated.

Gorman's patch set splits up this function in a way that is common in kernel programming. A new function (__rmqueue_pcplist()) removes a page from the list but does not concern itself with disabling interrupts; that is expected to be handled by the caller. A call to rmqueue_pcplist() (without the leading underscores) will disable interrupts and allocate the page in the usual way. But now other code can disable interrupts once, then call __rmqueue_pcplist() multiple times to allocate a whole set of pages.

Similarly, __alloc_pages_nodemask() spends a fair amount of time figuring out which zone of memory should be used to satisfy a given request, then returns a page. Here, too, those two operations can be split apart, so that the zone calculation can be reused for multiple page allocations rather than being performed anew for each page.

With these two changes in place, Gorman's patch set can add a new allocation function:

    unsigned long alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
				   unsigned long nr_pages, struct list_head *list);

This function will attempt to allocate nr_pages pages in an efficient manner, storing them in the given list. The order argument suggests that any size of allocation can be done in bulk but, in the current patch, any order other than zero (single pages) will result in a failure return.

The result of using this interface, he says, is a "roughly 50-60% reduction in the cost of allocating pages". That should help the networking developers in their quest to improve packet throughput rates. They will find that some assembly is required, though; Gorman went as far as to show that the memory-allocator overhead can be reduced, but stopped short of creating an API with all of the features that those developers need. His plan is to merge the preparatory patches without the alloc_pages_bulk() API with the idea that the actual bulk-allocation API should be designed by the developers who need it. Thus, once these changes find their way into the mainline, it will be up to the networking crew to do something useful with them.

Comments (2 posted)

Linus Torvalds Linux 4.10-rc3 Jan 08

Greg KH Linux 4.9.2 Jan 09

Greg KH Linux 4.9.1 Jan 06

Greg KH Linux 4.8.17 Jan 09

Greg KH Linux 4.8.16 Jan 06

Greg KH Linux 4.4.41 Jan 09

Greg KH Linux 4.4.40 Jan 06

Jintack Lim Nested Virtualization on KVM/ARM Jan 09

Yury Norov ILP32 for ARM64 Jan 09

Khalid Aziz sparc64: Add support for Application Data Integrity (ADI) Jan 04

Vikas Shivappa x86/intel_rdt: Memory b/w Allocation support Jan 10

David Carrillo-Cisneros optimize ctx switch with rb-tree Jan 10

Davidlohr Bueso sched: Introduce rcuwait Jan 11

Amelie Delaunay Add support for STM32 RTC Jan 05

M'boumba Cedric Madianga Add support for the STM32F4 I2C Jan 05

Benjamin Gaignard Add PWM and IIO timer drivers for STM32 Jan 05

gabriel.fernandez@st.com Introduce STM32F7 Clocks Jan 06

Anup Patel Broadcom FlexRM ring manager support Jan 05

Chris Packham Support for Marvell switches with integrated CPUs Jan 05

Hoegeun Kwon Add support for the S6E3HA2 panel on TM2 board Jan 05

Sudip Mukherjee add gpio support to exar Jan 05

sean.wang@mediatek.com media: rc: add support for IR receiver on MT7623 SoC Jan 06

Alexander Loktionov net: ethernet: aquantia: Add AQtion 2.5/5 GB NIC driver Jan 06

Ursula Braun net/smc: Shared Memory Communications - RDMA Jan 09

Jaghathiswari Rankappagounder Natarajan Support for ASPEED AST2400/AST2500 PWM and Fan Tach driver Jan 09

Vivek Gautam phy: USB and PCIe phy drivers for Qcom chipsets Jan 10

YT Shen MT2701 DRM support Jan 11

George Cherian Add Support for Cavium Cryptographic Acceleration Unit Jan 11

Matt Redfearn MIPS: Remote processor driver Jan 11

Gregory CLEMENT mmc: Add support to Marvell Xenon SD Host Controller Jan 11

eajames.ibm@gmail.com drivers: hwmon: Add On-Chip Controller driver Jan 11

Peter Chen power: add power sequence library Jan 05

Heikki Krogerus USB Type-C Connector class Jan 05

Sricharan R Add support for privileged mappings Jan 06

Rob Herring Serial slave device bus Jan 06

Michal Hocko scope GFP_NOFS api Jan 06

Jérôme Glisse HMM (Heterogeneous Memory Management) v15 Jan 06

Mel Gorman Fast noirq bulk page allocator v2r7 Jan 09

David Rientjes mm, thp: add new defer+madvise defrag option Jan 10

Tim Chen mm/swap: Regular page swap optimizations Jan 11

David Lebrun add support for IPv6 Segment Routing Jan 10

Jason A. Donenfeld Introduce The SipHash PRF Jan 07

Eric Auger KVM PCIe/MSI passthrough on ARM/ARM64 and IOVA reserved regions Jan 05

Jan Kiszka Jailhouse 0.6 released Jan 09

Punit Agrawal Add support for monitoring guest TLB operations Jan 10

Jes Sorensen ANNOUNCE: mdadm 4.0 - A tool for managing md Soft RAID under Linux Jan 09

Andi Kleen New attempt at adding an disassembler to perf v2 Jan 09

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

Last-minute control-group BPF ABI concerns

Bulk memory allocation without a new allocator

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous