Kernel development
Brief items
Kernel release status
The current development kernel is 4.10-rc3, released on January 8. Linus said: "It still feels a bit smaller than a usual rc3, but for the first real rc after the merge window (ie I'd compare it to a regular rc2), it's fairly normal."
Stable updates: 4.9.1, 4.8.16, and 4.4.40 were released on January 6, followed by 4.9.2, 4.8.17, and 4.4.41 on January 9. Note that 4.8.17 is the final 4.8.x update.
The (large) 4.9.3 and 4.4.42 updates are in the review process as of this writing; they can be expected on or after January 12.
Quotes of the week
Kernel development news
Last-minute control-group BPF ABI concerns
One of the features pulled into the mainline during the 4.10 merge window is the ability to attach a BPF program to a control group; that program can then filter packets received or transmitted by processes within the control group. The feature itself is relatively uncontroversial (though some would prefer a different implementation). Until recently, the feature's interface and semantics were also uncontroversial — or at least not closely examined. Since the feature was merged, however, some concerns have been raised. The development community will have to decide whether changes need to be made, or the feature temporarily disabled, before the 4.10 release sets the interface in stone.The conversation was started by Andy Lutomirski, who played with the new capability for a while and found a few things that worried him. The first of these is that the bpf() system call is used to attach the program to the control group. This is, he thinks, fundamentally a control-group operation, not a BPF operation, so it should be handled through the control-group interface. If, in the future, somebody adds the ability to impose other types of controls — controls that don't involve BPF programs — then the use of bpf() would make no sense. And, in any case, he said, bpf() is an increasingly unwieldy multiplexer system call.
This objection didn't get far; there does not seem to be a large contingent of developers interested in adding other packet-filtering mechanisms to control groups. BPF developer Alexei Starovoitov dismissed the idea, suggesting that any other mechanism could be just as easily implemented in BPF. Networking maintainer David Miller agreed with Starovoitov on this issue, so it seems that little is likely to change on this point.
The next issue runs a little deeper. Control groups are hierarchical in nature and, with version 2 of the control-group interface, all controllers are expected to behave in a fully hierarchical manner. The BPF filter mechanism is not a proper controller (a bit of an interface oddity in its own right), but its behavior in control-group hierarchies is still of interest. Controller policies are normally composed as one moves down the hierarchy. For example, if a control group is configured with the CPU controller to have 10% of the available CPU time, then a sub-group of that group is configured to get 50%, it will end up with 50% of the 10% the parent group has, or 5% in absolute terms.
If a process is running in a two-level control group hierarchy, where both levels have filter programs attached, one might think that both filters would be run — that the restrictions imposed by those filters would be additive. But that is not what happens; instead, only the filter program at the lowest level is run, while those at higher levels are ignored. The upper level filter might prohibit certain kinds of traffic, but the mere existence of a lower-level filter overrides that prohibition. In a setting where one administrator is setting filters at all levels, these semantics might not be a problem. But if one wants to set up a system with containers and user namespaces, where containers can add filter programs of their own, this behavior would allow the system-level policy to be circumvented.
Starovoitov acknowledged that, at a
minimum, there might be a use case for composing all the filters in a given
hierarchy. But he also asserted that "the current semantics is fine
for what it's designed for
" and said that different behavior can be
implemented in the future. The problem with that approach is that changing
the semantics would be a significant ABI change that could easily break
systems that were designed around the 4.10 semantics; such a change would
not be allowed. In the absence of a plan for how the new semantics could
be added in a compatible way, it has to be assumed that, if 4.10 is
released with the
current behavior, nobody will be able to change it going forward.
Other developers (Peter Zijlstra and Michal Hocko) have expressed concerns about this behavior as well. Zijlstra asked control-group maintainer Tejun Heo for his thoughts on the matter, but no such thoughts have been forthcoming as of this writing. Starovoitov seems convinced that the current semantics are not problematic, and that they can be changed in some (unspecified) way without breaking compatibility in the future.
Lutomirski's final worry is a bit more nebulous. Until now, control groups have been concerned with resource control; the addition of BPF filters changes the game. These programs could be another way for an attacker to run hostile code; they could, for example, interfere with the input to a setUID program, leading to potential privilege escalation issues. The programs could also stash useful information where an attacker could find it.
Unfortunately, while seccomp is very, very careful to prevent injection of a privileged victim into a malicious sandbox, the CGROUP_BPF mechanism appears to have no real security model. There is nothing to prevent a program that's in a malicious cgroup from running a setuid binary.
For now, attaching a network filter program is a privileged operation, so exploits are not an immediate concern. But as soon as somebody tries to make it work within user namespaces a whole new can of worms would be opened up. Lutomirski put out a "half-baked proposal" that would prevent the creation of "dangerous" control groups (those that have filter programs attached) unless various conditions were met to prevent privilege escalation issues in the future.
That proposal has not met with a lot of approval. Once again, such restrictions would need to be imposed from the outset to limit the risk of breaking systems in the future; that would imply that this feature would need to be disabled for the 4.10 release. But there seems to be little interest in doing that; while Starovoitov agreed early on that there was work to be done in the security area, he once again said that it could be done at some future point.
That is where the discussion stands, as of this writing. If no action is taken, 4.10 will be released with a new feature despite the existence of concerns about its ABI and security. History has some clear lessons about what can happen when new ABIs are shipped with this kind of unanswered question; indeed, one need not look beyond control groups for examples of the kinds of problems that can be created. Given the probable outcome here, one can only hope that the BPF developers are correct that some way can be found to address the semantic and security issues without creating ABI compatibility problems.
Bulk memory allocation without a new allocator
The kernel faces a number of scalability challenges resulting from the increasing data rates that can be handled by peripherals like storage devices and network interfaces. Often, the key to improved throughput is doing work in batches; in many cases, the overhead of performing a series of related operations is not much higher than for performing a single operation. Memory allocation is one place where batching offers the potential for significant performance improvements, but there has, so far, been no agreement on how that batching should be done. A new patch set from Mel Gorman might just show how this problem can be solved.Network interfaces tend to require a lot of memory; all those incoming packets have to be put somewhere, after all. But the overhead of allocating that memory is high, to the point that it can limit the maximum throughput of the system as a whole. In response, driver developers are resorting to workarounds like allocating (then splitting up) high-order pages, but high-order page allocation can stress the system as a whole and runs counter to normal kernel development practice. It would be good to have a better way.
At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, networking developer Jesper Dangaard Brouer proposed the creation of a new memory allocator designed from the beginning for batch operations. Drivers could use it to allocate many pages in a single call, thus minimizing the per-page overhead. The memory-management developers at this session understood the problem, but disagreed with the idea of creating a new allocator. Doing so, they said, would make the memory-management subsystem less maintainable. Additionally the new allocator would tend to repeat the mistakes of the existing allocators and, by the time it had all the necessary features, it might not be any faster.
The right solution, from the memory-management perspective, is to modify the existing page allocator, reducing overheads and making it more friendly to multi-page allocations. This has not been done so far for a simple reason: most memory users immediately zero every page they allocate, an operation that is far more expensive than the allocation itself. That zeroing is not necessary for pages that will be overwritten with incoming packet data by a network interface, though, so high-performance networking workloads are more seriously affected by the overhead in the allocator. Fixing that overhead in the existing page allocator would solve the problem for the networking subsystem while avoiding the addition of a new allocator and providing improved performance for all parts of the kernel.
The idea made sense, but only had one shortcoming: nobody had actually done the work to improve the existing page allocator in this way. That situation has changed, though, with the posting of Gorman's bulk page allocator patch set. The patches are relatively small, but the claimed result is a significant improvement in page-allocator performance.
Two fundamental changes are required to support both allocations; both take the same form. The first of these addresses the function buffered_rmqueue(), which removes a page from a per-CPU free list in preparation for handing it out in response to an allocation request. Since the list is per-CPU, there is no locking required before making changes, but it is necessary to disable interrupts on the relevant CPU to prevent concurrent access from an interrupt handler. Disabling and restoring interrupts takes some significant time, and that time adds up if it must be done repeatedly for each page being allocated.
Gorman's patch set splits up this function in a way that is common in kernel programming. A new function (__rmqueue_pcplist()) removes a page from the list but does not concern itself with disabling interrupts; that is expected to be handled by the caller. A call to rmqueue_pcplist() (without the leading underscores) will disable interrupts and allocate the page in the usual way. But now other code can disable interrupts once, then call __rmqueue_pcplist() multiple times to allocate a whole set of pages.
Similarly, __alloc_pages_nodemask() spends a fair amount of time figuring out which zone of memory should be used to satisfy a given request, then returns a page. Here, too, those two operations can be split apart, so that the zone calculation can be reused for multiple page allocations rather than being performed anew for each page.
With these two changes in place, Gorman's patch set can add a new allocation function:
unsigned long alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
unsigned long nr_pages, struct list_head *list);
This function will attempt to allocate nr_pages pages in an efficient manner, storing them in the given list. The order argument suggests that any size of allocation can be done in bulk but, in the current patch, any order other than zero (single pages) will result in a failure return.
The result of using this interface, he says, is a "roughly 50-60%
reduction in the cost of allocating pages
". That should help the
networking developers in their quest to improve packet throughput rates.
They will find that some assembly is required, though; Gorman went
as far as to show that the memory-allocator overhead can be reduced, but
stopped short of creating an API with all of the features that those
developers need. His plan is to merge the
preparatory patches without the alloc_pages_bulk() API with the
idea that the actual bulk-allocation API should be designed by the
developers who need it.
Thus, once these changes find their way into the mainline, it
will be up to the networking crew to do something useful with them.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
