Kernel development
Brief items
Kernel release status
The current development kernel is 4.5-rc3, released on February 7. Linus said: "It's slightly bigger than I'd like, but not excessively so (and not unusually so). Most of the patches are pretty small, although the diff is utterly dominated by the (big) removal a couple of staging rdma drivers that just weren't going anywhere. Those removal patches are 90% of the bulk of the diff."
Stable updates: none have been released in the last week.
Kernel development news
Netconf discussions, part 1
For two days prior to the Netdev 1.1 conference in Seville, Spain, kernel networking developers gathered for Netconf, an informal summit to discuss recent issues face to face and to hash out approaches to take for upcoming work. What follows is a recap of how those discussions progressed on the first day of the event; an account of the second day is forthcoming.Checksums
The first issue on the table was checksum offloading, in particular with respect to tunneling protocols. Since we first looked at checksum offloading in December, an updated patch set has been posted. Checksums allow the kernel to verify that a packet was unaltered during transit; computing them is a task that it would be nice to offload to hardware. For tunneling there are two checksums to cope with: inner checksums cover a packet encapsulated within another protocol (the tunneled payload) while outer checksums cover the entire packet, including the payload and its wrapper.
In short, Local Checksum Offloading (LCO) allows the kernel to compute the outer checksum of an outgoing packet in software, while offloading the job of calculating the inner checksum to the hardware device. This is a bit of clever sleight-of-hand; it can be done efficiently since the outer checksum is merely the checksum of the outer header—a substantially smaller set of bytes, which the kernel is already processing in memory—added to the inner checksum in one's-complement arithmetic. For incoming packets, most network interfaces are capable of offloading only the outer checksum but, again, the kernel can derive the inner checksum relatively easily by processing only the outer header.
Several factors continue to make checksum offloading a convoluted issue from the kernel's perspective, though. As Alex Duyck explained, some newer networking devices are designed to compute the checksum of the innermost recognized packet type. If the inner packet is VXLAN, the device will compute the inner checksum; if not, the device will compute the outer checksum instead. And additional work is required to get checksum offloading to play nicely with TCP segmentation offload (TSO), where the network hardware splits up packets before sending them out over the outgoing interface. Thus, the device performing the hardware checksum computations must be told the proper offsets at which to begin the checksum calculations.
In addition to the implementation details, which at this point seem to be relatively well-understood, the need to improve the documentation of the checksum-offloading features was raised. As Jesse Brandeburg pointed out based on his experiences at Intel, it can be rather difficult for the authors of new device drivers to make sense out of all of the flags in use. Tom Herbert commented that it is probably incorrect to call many of the calculations involved "checksums" to begin with, since they are, in fact, cyclic redundancy checks (CRCs). It does not look like that terminology change, however, should be expected any time soon.
IPv4 containers on IPv6 hosts
Next, Thomas Graf proposed creating a new socket type that would enable an IPv6-only system to support containerized applications that are, internally, IPv4-aware only. The idea would be that a containerized application explicitly asking for an IPv4 address would be given a socket that was transparently bound to an external IPv6 address. Doing so would do away with the need to set up an IPv4 routing table and other overhead—which is required, currently, whenever an application explicitly asks for an IPv4 address on an IPv6 system.
Not everyone was persuaded that the approach was a worthwhile idea; Shrijeet Mukherjee, for example, asked whether setting up a 6to4 tunnel instead would suffice. The end goal, Graf said, was to make containerized applications as lightweight as possible, thus making life simpler for systems that have to cope with ancient containers that are no longer updated (or whose developers refuse to transition to proper IPv6 support). There was enough interest to convince Graf to pursue the idea further, though, and he agreed to work on a patch and submit it for comment.
SCTP
Eric Dumazet asked the group how important it was to support the Stream Control Transmission Protocol (SCTP)—the implication being that SCTP is rarely used and, therefore, not worth expending excess effort on. Networking subsystem maintainer David Miller agreed that it was difficult to know how many users there were of the SCTP code (as is true of other protocols) and floated the idea of asking Linux distributions to gather anonymized usage statistics—with end-user permission—much as they already collect statistics on package installations. If such information were available, he said, it could lead to further investment in networking features; it is difficult to get traction on work like SCTP support unless that work is funded. Anecdotally, many SCTP users seem to use proprietary, out-of-tree SCTP stacks, so this would seem to be an area where there is a need for more investment.
Virtual routing and forwarding
David Ahern presented an update on Virtual Routing and Forwarding (VRF), a feature used to create a virtual layer-3 (IP) routing domain in the network stack. Basic support for IPv4 VRFs was added in kernel 4.3, and IPv6 support in kernel 4.4. Several usability issues remain, however. Creating the routing-table rules required for the VRF is still a cumbersome process. Miller has already rejected a patch that would have allowed a VRF driver to automatically create "simple" routing rules; Ahern noted that there were other possible solutions, such as defining a vrf subcommand for the ip utility.
A more serious usability problem is that every time a VRF's enslaved network devices (that is, the individual interfaces that are combined into the VRF) are brought down and then back up, they lose their IPv6 addresses. The cycling is required, since it is used to flush the neighbor cache and routes. But, technically, flushing the entire interface is not required—only wiping the cache and route data. Ahern said that a patch was being developed to perform more of a "pause/resume" or "soft down" operation. Miller noted, however, that the problem is not limited to VRF; ultimately one would like to be able to cycle an interface without losing IPv6 addresses. Arriving at a fix for that underlying problem is going to take considerable effort, Miller said, but he encouraged Ahern to proceed with the VRF patch anyway, "and let's see what happens."
Ahern also listed several features missing from VRF, starting with the ability to run tasks in a VRF context. Control groups seem like the right approach for implementing that feature, but Tejun Heo objected to that idea when Ahern sent an RFC in January. Miller noted that Heo has spent quite a bit of time cleaning up control groups and is likely to balk at uses that violate the new model he is moving toward. Miller said he would attempt to smooth the path to acceptance in that regard, but suggested that Ahern look at alternative solutions in case he is unable to persuade Heo.
Ahern also noted that the VRF developers would like more netfilter hooks on transmit and receive paths, and would like to be able to bind a socket to an enslaved device. Finally, he noted that there is a major roadblock to using VRF with switchdev: currently, switchdev disables layer-3 offload on a system if any IP rules are installed in the Forwarding Information Base (FIB). That makes it impossible to use VRF with switchdev's hardware-offloading capabilities. He suggested that the "overly cautious" ban on IP rules be relaxed, perhaps allowing rules for non-hardware (i.e., virtual) ports or allowing the "simple" rules needed for VRF. He conceded, though, that it could be challenging to find a solution.
Header (un)alignment
Herbert then discussed how to approach byte-alignment problems. It is a fundamental issue that the headers on a network packet do not always arrive conveniently aligned on four-byte boundaries as the kernel would prefer, thus decreasing performance. In some cases, the misalignment is entirely predictable—for example, in tunneling, when the outer IP header is stripped off, whatever is inside will be offset by two bytes. While several approaches to alleviating the performance impact have been raised (shifting headers with memmove(), for example), they do not attack the underlying issue head-on.
Miller advocated taking a moderate, fix-the-major-cases approach rather than attempting an attack on the root of the problem that could prove disruptive. On Sparc systems, for instance, unaligned headers trigger scores of kernel warnings. While it might be possible to wrap every access on Sparc in a get_unaligned() macro, he said, that solution would certainly be rejected by Linus. In the end, Duyck posted a short patch that will correct unaligned headers in the Generic Routing Encapsulation (GRE) code, which Herbert signed off on.
Documentation
Jeff Kirsher proposed refreshing the skeleton driver (the "outline" that developers are intended to use when writing a new network device driver), which has not been updated to show new features like offloading. In addition, he suggested that the documentation could be improved in several areas—code comments most obviously, but perhaps more papers as well, like Herbert's 2015 white paper on checksum offloading. He also asked about adding per-queue statistics. Miller did not find the idea appealing, since network interface dumps are "enormous" already. But he noted that perhaps "extended statistics" could be made available with filter bits; there are several other cases where more statistics need to be exposed, such as network interface cards (NICs) that include built-in switches.
TCP performance
Dumazet discussed progress on the new TCP congestion algorithm in use at Google. Though it is working well enough so far, there is concern over the fact that it is not currently possible to tell if a packet was dropped because of a bottleneck or because of the action of a firewall (or some other interfering device) somewhere along the route. He also reported that work was still underway exploring random packet spraying (RPS), which gives up attempting to force all TCP packets in a stream to follow the exact same path. The idea of using RPS was first floated after a paper [PDF] on the subject was published by researchers at Purdue University. The challenge is how to reorder TCP packets that arrive on different paths; using flow labels was discussed briefly, but there may be no simple solution.
Wireless
Johannes Berg gave a brief report on recent work on the wireless side. At present, major effort is being directed at receive side scaling (RSS), he said, splitting up everything serialized in order to make it parallel. Currently, the effort is focused on implementing RSS on a single TCP stream; the developers would like to support aggregating TCP streams, but real world experience seems to indicate that most common wireless devices (e.g., mobile phones) will use only one stream at a time anyway.
Miller asked if anything is "getting in the way," to which Berg replied that the wireless developers would like to have more transmit queues, to better support multi-user MIMO. That feature allows a wireless access point to simultaneously stream to different devices via separate antennas (on the same frequency band), which works if the receiving devices are physically far apart, and provides better overall throughput. Modern access points have as many as eight antennas—all sharing a single network interface. The kernel, therefore, needs to be able to manage multiple transmit queues to keep the antennas filled. Work is underway on supporting the feature, although only on Intel hardware so far.
Statistics
Jamal Hadi Salim closed out the first day by proposing a new message type for netlink that could be used to subscribe to periodic status updates. The use case is recording statistics in high volume (say, once every one to five seconds) but, specifically, statistics that are intended to profile system performance. Collecting data in this fashion is different from subscribing to other netlink updates, where one generally wants to hear about an event immediately. There did not seem to be any serious objection to the idea, so a new "stats" message would appear to be a possibility.
With that, the discussion wound down for day one.
[The author would like to thank the Netconf and Netdev organizers for travel assistance to Seville.]
Measuring packet classifier performance
At Netdev 1.1 in Seville, Spain, Jamal Hadi Salim presented the results of recent tests he has done to assess the performance characteristics of the kernel's various network-packet classifiers. While the raw numbers may be of interest to those seeking the best possible performance, the testing process itself revealed other factors of interest to the kernel networking community.
There are a number of reasons to run such tests, he said. First, there have been two new classifiers added in the past year: flower and the extended Berkeley Packet Filter (eBPF) classifier. Naturally, that makes one curious how they compare to the older mechanisms. Second, he had spent several years working with software-defined networking (SDN) running on big application-specific integrated circuit (ASIC) hardware switches, and was interested in seeing how much could be done within the kernel itself. Third, an examination of kernel performance would also serve to counterbalance all the "noise" one hears about the superiority of the Data Plane Development Kit (DPDK) and similar user-space libraries. And, he added, it just sounded like an easy topic for a paper—although that notion turned out to be delusional.
![Jamal Hadi Salim [Jamal Hadi Salim]](https://static.lwn.net/images/2016/02-netdev-jamal-sm.jpg)
Constructing a test
In order to not compare apples and oranges, he said, some care has to be taken in designing both the tests and the test system. If the goal is to allow a comparison between the kernel and ASICs, then the actions tested in kernel classifiers needed to be limited to those a typical ASIC would perform—no hashing or using masks to group flows. As to testing the new classifiers, a baseline is required, so he choose to profile the "Swiss-army knife of packet filtering," the u32 classifier.
He then spent a few minutes comparing the design and functionality of the various classifiers. "Classic" BPF, he pointed out, was a register-based virtual machine allowing classification rules that branch but, notably, could not include loops. Classic BPF has now been superseded by eBPF, but the prohibition on loops remains. Still, the virtual-machine byte code suggests a natural mapping to hardware CPU instructions, and the recent work to add just-in-time (JIT) compilation provided substantial performance improvements. In fact, they were substantial enough that he dropped all non-JIT eBPF data from his performance tests, as the JIT version was always significantly superior.
The new eBPF classification engine allows one to compile multiple, independent "proglets," which are then loaded into the tc classifier. More importantly, the proglets can be combined to create policy loops within the tc framework.
The flower classifier was written by Jiří Pírko and, cleverly, makes use of several "commodity" kernel features. As a packet traverses the stack, the flow it uses is cached. Flower then stores the flow in an rhashtable. Subsequent packets matching a classifier rule can quickly be directed into the correct flow by retrieving the cached copy from the rhashtable. Currently, flower supports rules based on just a subset of the possible flow parameters (source and destination address, ingress and egress ports, MAC addresses, etc), but the design supports extending the set of supported tuples, and it will most likely continue to grow.
The "u" in the name of the u32 classifier stands for "ugly," he said, and that is a more or less apt description. Consequently, few users know u32 well, but it can surprise them when it is well-tuned. The design is centered on a set of hash tables. Each filter rule can direct a matching packet to any of a series of buckets, each of which can then optionally point to another hash table. All flows begin at the root bucket; a user that knows the exact flow a packet should take can script the u32 rules to be extremely efficient. The fact that few do so points to usability problems, not to performance limitations.
For his tests, Hadi Salim choose to measure data-path throughput performance. It is the simplest metric and easy to understand: one fires off a bunch of packets and then counts how many make it through in a given time period. He hopes to continue the work and measure latency as well, but did not have time to collect those results to present. All of the classifiers were tested on the same hardware: a quad-core Intel NUC with 16GB of the fastest RAM available (1600MHz). The machine ran kernel net-next 4.4.1-rc1, patched to support flower.
The classifiers were attached at ingress and connected to an egress queue on a dummy network device (to remove any influence of device driver performance). Several "baseline" tests were run to measure the system characteristics at other points (such as dropping all packets at ingress) in order to properly account for the affects of the rest of the system. Then, each classifier was run with a variety of rule sets (from a single rule up to 1000 rules), and across a variety of packet sizes.
Results and observations
After several thousand test runs, he said, the most interesting conclusions he drew were not which classifier had the highest throughput—the throughput winner was u32 in essentially every test permutation—but the unusual performance characteristics observed in the system across the variables. For instance, there was no measurable variance on any of the classifiers: the mean scores were indistinguishable from the maximums and minimums.
More importantly, though, he initially tested classifier rules that forwarded packets, but the performance hit was so substantial that it overshadowed the other factors. The test machine was able to produce a peak throughput of 60Gbps (the maximum throughput being observed with 1020-byte packets) when copying packets to the dummy interface. But when forwarding the packets to a "blackhole" destination IP, the throughput rather mysteriously dropped to 25Gbps.
The lack of variance is probably a good sign from a reliability standpoint, but the forwarding performance suggests that the kernel's forwarding code should get a closer look. The tests also suggested that memory latency is a significant factor in throughput. He said he hopes to find RAM chips with a different latency and re-run the same tests.
As to the actual throughput numbers of the classifiers, which many were interested in seeing plotted together, the "best case" performance test measured the classifiers on a set of rules where the first rule matched every packet. u32 processed around 180Gbps, eBPF 160Gbps, and flower 60Gbps. The "worst case" test measured the classifiers with 1000-rule sets, for which the last rule matched. Under those circumstances, u32 processed 463Mbps, flower 88Mbps, and eBPF 73Mbps. In between, all three classifiers' throughput dropped off at about the same rate as the size of the rule set increased. He noted, though, that the tests always showed flower's worst-case performance scenario, since the test framework forced flower to cache every packet's flow. Its real-world behavior can only be better.
Time ran short on the conference schedule, so he had to skip over the details of several test runs (and never got back to the question of comparing the kernel's performance to ASIC hardware), but Hadi Salim took a few minutes at the end to point out that "throughput" is hardly the only factor worth considering. In particular, the design and execution of the tests provided practical insights along the way. The eBPF classifier is the best option for extensibility, he said, but flower is the clear winner on usability. The flower command-line interface is the most human-friendly option, and it is the easiest classifier to control from an external program. In contrast, time spent scripting the u32 classifier could produce a 4x performance improvement compared to u32's unscripted usage, but there are few network administrators who seem to regard that time spent as time well-used.
[The author would like to thank the Netconf and Netdev organizers for travel assistance to Seville.]
Writing your own security module
Casey Schaufler started off his linux.conf.au 2016 talk by noting that a prospective security-module author may be told that writing a new module is not necessarily a good idea. They'll hear that we already have a good selection of security modules, or it can all be done with SELinux anyway and, besides, kernel programming is hard. But, he said, there are indeed good reasons for wanting to write a new security module; he was there to discuss those reasons and give some guidance for those unfamiliar with this part of the kernel.So why might one want to write a new security module? Our current modules, he said, are showing their age; they follow a design that dates from a time when users actually sat in a computer room to do their work. These modules are a poor fit to the concerns we have now; they were not designed with systems like smartphones in mind. We need to start working on ideas that don't date from the era of paper tape. And, in any case, there are things that cannot be done with SELinux.
It is important to understand what security modules can do. They are a way to add new restrictive controls to operations that a process might try to perform; they cannot replace the existing discretionary access control checks, which will be done anyway. Thus, they offer a new opportunity to say "no" to an operation, but they cannot authorize an action that the user would not have otherwise been able to perform.
There are a few basic rules to bear in mind when contemplating a new security module. The first of those is to avoid duplicating existing modules. If SELinux can do what you need, you're better off joining its community and pushing things in the direction that you need. Another rule is to not rely heavily on user-space helpers. There is a proprietary module out there, he said, that asks user space to make the actual security decisions. This solution is inefficient and provides a hook for proprietary code; either reason would be enough to keep it from ever going upstream. Remember that you're playing in kernel space, so try not to upset the kernel developers; he named Al Viro in particular as one not to inflame.
The most important rule, though, is to plagiarize freely from other security modules; don't reinvent things that already exist and work. When he wrote Smack, he took a lot of things from SELinux. That, he said, is how we do things in the Linux community.
How security modules work
Casey started into the mechanics of security modules by talking about the hooks that they rely on. There are function calls scattered around the kernel that exist to give security modules a chance to make a decision on a specific action; their names all start with "security_". Modules use these hooks to do access checks and perform general data management; it is only necessary to write the hooks that are relevant to the task at hand. Following normal kernel conventions, hooks return zero for success (generally allowing the requested operation); EACCES should be used to indicate a policy denial, while EPERM means that a necessary privilege is missing.
Many (or most) hooks are object-based, in that they relate to a specific object in kernel space. Hooks often deal with inode structures, for example, since those structures represent files within the kernel. Some hooks, though, work with pathnames instead. Paths are more human-friendly, since that's how people deal with files, but they may not uniquely identify the underlying object (files can have more than one name, for example).
"Security blobs" are data structures attached to objects by a security
module. One will often see the terms secctx, which refers to a
text string associated with a blob, and secid, a 32-bit ID number
for a blob. There are two types of modules, called "major" and "minor,"
with the difference being that major modules use blobs. There can only be
one major module active in the kernel, and it runs after any other
module. There is no mechanism, yet, for sharing the security-blob pointers
between modules; Casey allowed as to how fixing that is an ongoing crusade of his. Minor modules, thus,
have no blobs and don't maintain any per-object state. They run after the
discretionary checks, but before the major module, if one exists.
There are some questions that should be answered before attempting to design a new security module, starting with: what resource is to be protected by the module? If you can't answer that, Casey said, you're not thinking about security. Answers might be files created by a particular user, or specific paths within the filesystem, or one might want to protect specific processes from each other.
That leads to the second question: what is that resource to be protected from? Traditionally, security has been based around users, but now we think about things like malicious apps. So, rather than protecting users from each other, we're now concerned with protecting Facebook's data from Netflix.
Finally, a security-module developer needs to know how that protection will be done. The traditional answer is to simply deny access to unauthorized users, but other approaches are possible. Logging of such attempts is often done, of course. One could consider more wide-ranging changes, such as changing the ownership of a file to match the last user who wrote to it via group access.
Various details
Security modules can make information available under /proc. One should, however, resist the temptation to reuse the attribute names already used by SELinux; a new security module should make its own names.
Objects contain numerous attributes that can be used by a security module; these include ownership, access mode, object types and sizes, etc. Modules can use them any way they like, but they should not change their fundamental meaning. The user ID should not identify the application, as "a certain system" has done. Security modules can make decisions based on pathnames if that works best, though the interface to pathnames inside the kernel is not the most convenient.
With regard to networking, Casey said, there may not be much for a security-module developer to do. Linux has the netfilter subsystem that can make all kinds of access-control decisions; that approach should always be tried first. If a module must be written, there are hooks for various socket operations, and for packet delivery as well. The SO_PEERSEC socket operation can be used to pass security attributes to another process. Working with Unix-domain sockets is easy, he said, because the security module has access to both ends of the connection. Internet sockets are harder, since only one end is available. The CIPSO mechanism can be used to send attributes across a link; support for CALIPSO, which will make similar functionality available under IPv6, is coming.
Casey suggested that modules probably want to log access denials, to help with policy debugging if nothing else. Helpful stuff can be found in <linux/lsm_audit.h>. The actual data to be logged is up to the module author; various utilities are available for formatting that data in user space.
Non-trivial modules probably need control interfaces; administrators will want to be able to change access rules, look at statistics, and more. Casey advised against adding new ioctl() calls or system calls; instead, a virtual filesystem should be created. A call to sysfs_create_mount_point() makes that easy; he recommended borrowing the relevant code from an existing module.
Finally, what about stacking of security modules? At this point, stacking of minor modules is easy and supported, but there can only be one major module at a time, since there is only one security-blob pointer. There is, he said, a way to cheat on systems where the kernel is built from source: simply add the new security blob to the SELinux blob. Work is in progress to allow multiple major modules, but they cannot be supported now.
In the end, Casey said, anybody looking to write a security module should have a good reason for doing it. Some stuff, after all, really does belong in user space. If you write a module, do it properly: provide documentation and support the code. Don't reinvent the wheel; generic security has long since been done, show us something new instead. Nobody has done a good application resource-management security policy, for example. There is interesting potential around policies tied to the sensors found on current devices. Security, he said, does not have to be dull.
The video of this talk is available on the LCA site.
[Your editor thanks LCA for assisting with his travel expenses.]
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Page editor: Jonathan Corbet
Next page:
Distributions>>