Kernel development
Brief items
Kernel release status
The current development kernel is 4.5-rc4, released on Valentine's day, February 14. Linus said: "So in between romancing your significant other, go out and test."
Stable updates: 4.4.2 and 3.14.61 were released on February 17. The 4.3.6 and 3.10.97 updates are in the review process as of this writing; they can be expected at any time. Note that 4.3.6 will be the final update in the 4.3 series.
Kernel development news
An in-kernel file loading interface
One of the many interesting aspects to kernel development is that much of the kernel's functionality is, itself, not available to the kernel. Most system calls are not intended to be called internally. Traditionally, this feature gap has extended to the reading of files from the filesystem, an act which tends to look like the implementation of policy within the kernel and, potentially, opens up security issues; thus it has long been discouraged.Over time, though, we have seen the introduction of kernel code that does, indeed, read files. The first step in that direction was probably the in-kernel module loader, which replaced the user-space loader back in 2002. The module loader does not actually open files; it depends on user space to hand it a file descriptor corresponding to the module to be loaded. But, given that, it does read the module code directly, perform the necessary symbol resolution, and bind it into the kernel.
The door opened wider when the firmware-loading mechanism was moved in-kernel; in this case, the file containing the firmware is being opened by name from within the kernel. The integrity management architecture code also has to open files, and it seems likely that other uses will show up over time. Since there is no standard way to open and read a file within the kernel, there is a separate implementation for each of these users, each of which does things in its own way.
Mimi Zohar recently decided that it was time to make file reading a first-class supported operation within the kernel; the result is this patch set adding a common file loader. It makes this operation easier to perform, but, as will be seen, it still seems like it's not really meant for common use.
At the lowest level, Mimi's patch set adds a new function to read a file's contents into memory:
int kernel_read_file(struct file *file, void **buf, loff_t *size,
loff_t max_size, enum kernel_read_file_id id);
This function will read the data from the open file indicated by file; up to max_size bytes will be read. It will allocate a buffer (using vmalloc()) to hold the file's contents, storing a pointer in *buf; the caller should free the buffer when it is no longer needed. The actual length of the file will be placed in *size. If the file is larger than max_size, nothing will be allocated or read, and -EFBIG will be returned.
The id argument is, arguably, where the interface (intentionally) loses a bit of generality. It is an enum type meant to indicate the purpose for which the file is being read; the values defined in the patch are READING_KEXEC_IMAGE, READING_KEXEC_INITRAMFS, READING_FIRMWARE, READING_MODULE, and READING_POLICY. The READING_POLICY option appears to be the motivation for the patch set; the IMA code can use it to read the policy and perform signature checking on the policy file. Developers wanting to use this interface will, most likely, have to add their own kernel_read_file_id constant to describe what they are doing.
There are a couple of helpers built on top of kernel_read_file():
int kernel_read_file_from_path(char *path, void **buf, loff_t *size,
loff_t max_size,
enum kernel_read_file_id id);
int kernel_read_file_from_fd(int fd, void **buf, loff_t *size,
loff_t max_size, enum kernel_read_file_id id);
As might be expected, the first one opens and reads a file given its pathname, while the second takes an open file descriptor and reads from that.
One advantage to implementing this functionality in a single place is that it becomes possible to apply a uniform security policy in all settings where the kernel tries to read a file. To that end, Mimi's patch set adds two new security hooks (security_kernel_read_file() and security_kernel_post_read_file()) that can pass judgment on file-reading operations. The security_kernel_module_from_file() and security_kernel_fw_from_file() hooks have been removed in favor of the new hooks. This is the purpose of the kernel_file_read_id parameter described above; it is passed to the loaded security module(s) and can be checked by the current security policy.
This patch set has been through a few revisions and has gotten acknowledgments from a number of the relevant developers. At this point, there would appear to be few obstacles between it and the mainline kernel. So, in the near future, the kernel is likely to have a set of generic functions for opening and reading files, but any future users will have to tell the kernel what they are up to.
Netconf discussions, part 2
On September February 8 and 9, the Monday and Tuesday before the Netdev 1.1 conference in Seville, Spain, kernel
networking developers gathered for Netconf, an
informal roundtable to discuss
recent issues face to face and to debate
upcoming work. Last week, we covered the discussions that took place
on the event's first day; what follows
is a recap of how those discussions progressed on the second day.
SR-IOV
First, Alex Duyck raised the issue of supporting single-root I/O virtualization (SR-IOV), which is often employed to share a network interface device between virtual machines (VMs). The kernel has SR-IOV support for several devices, but there is no formal specification that the in-kernel implementations adhere to, which leads to increasing complexity. Intel and Mellanox devices operate differently, he said; in particular, some are capable of learning about the virtual network, while others need the hypervisor to explicitly pass down most configuration information.
There are also newer SR-IOV devices that include an embedded switch, which raises the question of whether all SR-IOV devices should be supported through the switchdev driver model. There is a case to be made that SR-IOV hardware, in general, mediates access between physical and virtual network devices, which is "switch-like," but not everyone is persuaded—Jesse Brandeburg said that Intel, for one, was not sure if it should take the switchdev approach. Networking subsystem maintainer David Miller, however, was strongly in favor of the approach. He noted that switchdev was intended to support a number of abstract models that amount to "packets flowing through non-net_device devices." He also pointed out that using switchdev would provide better netfilter support for SR-IOV.
eBPF
Alexei Starovoitov then provided an update on the extended Berkeley Packet Filter (eBPF). He highlighted, among other things, the BPF Compiler Collection (bcc) toolkit from IO Visor, the ability to attach eBPF programs to tracepoints, and the perf integration. These recent additions to eBPF have opened the door to significantly better tracing from user space, including better profiling of network performance. There is still more to be done in that area, however; he wants to add three or four new tracepoints in strategic places and to add some metadata to struct sock to eliminate the need for another lookup.
He also discussed upcoming changes and ideas for eBPF, starting with the ability to map eBPF maps into user-space memory with mmap(). That should make it possible to avoid locks and write zero-copy eBPF programs. He is also developing a kernel sampling counter that can be set from user space and used to collect periodic statistics. Further out, he would like to add eBPF support for generic segmentation offload (GSO) and generic receive offload (GRO) for UDP-based protocols like QUIC, which he hoped would discourage people from bypassing the kernel to implement new protocols in user space. He is also interested in adding support for bounded loops and vector instructions, he said, but he lacked time to explain the full rationale. He noted, however, that he has heard several use cases for those features, including vector instructions "which I first thought was crazy."
Starovoitov also proposed a different kind of "offload"—offloading the execution of eBPF programs to hardware. He knows of an implementation of eBPF that runs on a field-programmable gate array (FPGA), he said, which is reported to be released soon as open-source hardware. eBPF could also run on dedicated chips; the important thing is that the API remain stable and that the tool chain provide what developers need.
That change would make eBPF programs more like firmware, which several in the group seemed to regard as an unwise move. Miller, in particular, was not a fan of the concept, particularly since the eBPF virtual machine can run arbitrary code. That puts it in a different class than (for example) WiFi adapters, which load firmware designed to implement a well-known API. Jeff Kirsher noted that the debate was similar to the one over how much code should be offloaded from the kernel to switch hardware, and proposed tabling further discussion for now, to take up both topics together further down the line.
Replacing ethtool
Next, Brandeburg proposed writing a modernized replacement for the increasingly outdated ethtool network-interface control utility, most likely written on top of netlink. Because ethtool has numerous limitations, there is general agreement that a replacement is needed, although there are differing ideas about what constitute the critical features. Thomas Graf suggested that every operation should be split into a validation step and an execution step, thus making transactional operations possible. Shrijeet Mukherjee said that operations need to be asynchronous, in order to avoid locking when hundreds of commands are processed in short order—even on "read" commands, which are important to collecting high-quality statistics.
Mukherjee also suggested splitting such a replacement tool into a separate daemon and front-end, although Brandeburg and Graf both felt that such a split might be unwarranted. Miller observed that many people seem to want to query multiple hardware devices at the same time, so multi-device support should probably be made a design goal as well. Graf asked whether that included "multi-SET" commands to update several parameters at once; Miller replied that developers lives would be much easier if they do not try to support that model.
There was consensus that all of ethtool's existing functions map onto netlink functions, and that ethtool will likely never disappear completely. So a migration to the successor utility seems to be the path forward. Murkherjee noted that Cumulus has written a generic netlink front-end tool, and volunteered to release it as an RFC. Other than that, it may be a while before the plan takes solid form—although Brandeburg did suggest the name "nettool."
Devlink
Next, Jiří Pírko discussed devlink, a tool he has written to simplify the administration of physical hardware devices that provide more than one network port. Such devices include certain network interface cards (NICs), splitter cables for some newer NICs, and various application-specific integrated circuit (ASIC) switch devices. In each case, the hardware has device-wide capabilities a level above what the net_device interface provides. There is currently no generic solution to managing such devices; many of them also do not map easily to existing configuration tools.
Pírko had already posted an RFC about devlink to the mailing list, so he demonstrated it and fielded questions from around the room. Mukherjee worried that the tool was adding yet another command-line utility to what is already a lengthy list; Miller countered that a similar set of hurdles is already facing users who want to take advantage of the newer features in many WiFi chipsets.
In the end, however, Miller concurred that there is probably a need for some higher-level interface to these devices, and noted that it may overlap with user-space tools desired for switchdev. He also pointed out that the network developers have recognized the need for some higher-level object for quite some time. Known colloquially as "the thing," what this abstraction will eventually become is far from clear. But it is clear that the kernel is having to cope with configuring and managing devices above the "NIC level," so new tools will undoubtedly follow.
Lightweight tunnels and MPLS
Roopa Prabhu then spoke about lightweight tunnel (lwt) support. She noted that there are currently two distinct user classes for tunneling: those that make use of a net_device (such as VXLAN and GENEVE tunnels) and those that do not, instead redirecting packets via Multiprotocol Label Switching (MPLS) or Identifier Locator Addressing (ILA). The latter class constitutes the use case for lwt.
However, the redirection of lwt packets needs optimization. For instance, outgoing MPLS packets are redirected too early, before IP fragmentation is done. Incoming ILA packets, though, are redirected too late, after they are demultiplexed. In both cases, the timing of the redirection hurts throughput. Work is underway to fix the redirection examples cited, but Prabhu also suggested that it might be worth adding additional redirection hooks at other strategically placed points in the lwt-processing pipeline.
In addition, she reported on some other patches still in the works for MPLS. Included are additional statistics reporting, support for MPLS-based VPNs, ping and traceroute support, and hardware offload.
Netlink API
Prabhu then raised some potential improvements that could be made to the netlink API. A chief concern is how to extend the API as new functionality (such as switchdev) is merged into the kernel. Adding new attributes and extending existing attributes is not complicated, but the kernel does not return errors when it encounters an unknown attribute; it ignores them. Thus, users who copy and paste routing examples from kernel documentation end up with silent failures on later kernel releases, when the API has changed.
Several potential fixes to this problem were discussed, from providing a "features" bitmask that software could retrieve to examine the attributes available to exporting the entire network hierarchy (perhaps filtered by protocol). Prabhu also suggested writing an official set of guidelines documenting the use of the netlink API. Finally, she reported on some ongoing work; patches should be expected soon to provide IGMP and per–virtual-LAN statistics for bridges, and to add full support for netlink's bridge and bond attributes in iproute2.
Miller then asked whether it would ever be possible to get rid of the netlink mmap() functions, which are widely regarded as a failed experiment. The functionality is rarely—if ever—used, likely mmap()ing outgoing traffic makes it difficult to verify, although it does make dumps easier. The consensus seemed to be that the feature could be removed, since all user-space code would fall back onto the non-mmap() paths anyway.
APD and control protocols
Mukherjee then discussed Cumulus's work on adding support for ACPI Platform Description (APD) to the kernel. APD is not networking-specific, but it enables a "self-describing" infrastructure that the kernel could use to describe the available network hardware. The most important examples, he said, are the lane maps (that is, which lines are for transmit and which for receive).
He also reported on some work in progress to protect control protocols. Frequently, he said, when a control protocol misses its heartbeat, it will freeze up, which can trigger a cascade effect and "melt down the rest of the network." The team is currently exploring using deadline scheduling, so that if a heartbeat is missed, the kernel will treat it as a "hard miss" and take the network device offline.
Specifications
Miller asked the attendees whether the kernel community should write "NIC specifications" to tell hardware vendors what the kernel wants. Too often, it seems, network-device vendors expect to have easily accessible documents that describe the interfaces and behavior the operating system needs—largely because Microsoft was in the habit of publishing such specifications for Windows. The kernel community, however, has never had such documents, so even as Linux has supplanted Windows as the networking OS of choice, vendors have continued to design products around the Microsoft specifications.
The trick is deciding how the kernel community would write and publish such documents, as they do not fit into the kernel's established development process. But there was sufficient interest in the idea that it may be kicked around further. Herbert, for example, noted that during the recent effort to implement checksum offloading, it became apparent that none of the NIC vendors had heard of the kernel community's strong preference for a "generic" checksum feature rather than protocol-specific checksums. Had that information been better disseminated, perhaps more vendors would have implemented generic checksum offloading.
Netfilter
Pablo Neira Ayuso closed out the second day with a report on Netfilter. He highlighted a long list of enhancements made over the previous year, such as cleanups to bridge-netfilter, the addition of per-namespace netfilter hooks, and support for the new unified control group hierarchy. There have also been many improvements to nftables over the same time period, including the addition of garbage collection, generic packet mangling, a new tracing infrastructure, and the ability to set timeouts.
He also reported on the addition of switchdev support in the nf_tables kernel module, which allows the user to offload network access-control lists (ACLs) to switch hardware. The idea is to provide a netlink front-end for creating rulesets, which are mapped to an intermediate representation (IR); the IR will then be pushed to the switch device driver, which is responsible for converting the IR into the necessary internal representation. The goal is to maintain compatibility between rulesets written for software nf_tables and for devices with hardware offload. Daniel Borkmann asked if the IR could be used to generate eBPF programs; Neira replied that it has been discussed, but that such an effort is not yet underway.
Finally, he discussed some ongoing work. Command autocompletion is being added to the nftables user-space tool, along with support for the new tracing infrastructure. Connection tracking will be added for bridge filtering, and several new features are in the works for the ingress hook (including connection tracking, logging, and queueing). There is also a high-level library under development, although it is still experimental at this stage; Neira said that the project was in no rush to publish it as of yet.
After the last presentation, Miller wrapped up the day by expressing thanks to the organizers and to each of the presenters. The attendees gathered for a group photo, then dispersed to prepare for the coming three days of the Netdev conference.
[The author would like to thank the Netconf and Netdev organizers for travel assistance to Seville.]
The switchdev driver model
Over the years, Linux has grown to a position of dominance in many branches of computing, but there are still niche markets—some of them surprising—where it has yet to make a serious impact. One of those markets is in high-end networking hardware: the managed switches and high-density routers that connect many large-scale networks or data centers. Such hardware is often marketed as "Linux capable" or "built with Linux" but, in reality, has its functionality implemented only in a proprietary blob. In 2015, that situation led the kernel networking developers to adopt a new driver model called switchdev that is aimed at replacing those proprietary blocks with standard kernel interfaces. Early on, it may have seemed like a rather speculative move, but at the Netdev 1.1 conference in Seville, Spain, switchdev was front and center.
The market for the switches in question is dominated by a few companies: Cisco is the largest vendor by a comfortable margin, followed by Arista, Juniper, and several other smaller companies. These appliance vendors often advertise Linux support for their products, but the model they employ is, at best, a small Linux system connected to a proprietary application-specific integrated circuit (ASIC) that actually handles all packet processing in hardware. The Linux system boots to a shell prompt, where proprietary command-line utilities are used to manipulate the switch chip. Furthermore, those utilities are generally built around a binary-blob "software development kit" (SDK) provided by the chip vendor. Thus, for Linux to break through the appliance vendor's lock-in, the chip vendor's lock-in must be broken as well.
The design of switchdev originates in Open vSwitch project, and was first described by Jiří Pírko in a September 2014 patch. At the Netdev 0.1 conference in February 2015, the networking developers decided to expand and adopt switchdev as a general solution for hardware switch chips and to make a concerted effort to break the binary-blob "SDK" stranglehold.
An overview of switchdev can be found at Documentation/networking/switchdev.txt. Each physical port on a device is registered with the kernel as a net_device, as is done for existing network interface cards (NICs). Ports can be bonded or bridged, tunneled or divided into virtual LANs (VLANs) using the existing tools (such as bridge, ip, and iproute2). The advantage of a switchdev driver is that such switching constructs can be offloaded to the switch hardware. As such, the driver mirrors each entry in the forwarding database (FDB) down to the hardware, and monitors for changes.
The kernel's netlink facility issues a NETDEV_CHANGEUPPER notification whenever there is a modification to the "upper" (that is, software) side of the mapping, so that the driver can make the corresponding adjustment to the "lower" (that is, hardware) side. That allows users to manage the switch hardware with the standard complement of tools. On the flip side, switch hardware is also often capable of learning about the network, such as which MAC addresses are found on which VLAN/port combinations. The driver sends a notification (for example, SWITCHDEV_FDB_ADD when it learns a new {port, MAC, VLAN} tuple), which is then used to update the FDB.
Up until now, much of the emphasis has been on modeling layer-2 (Ethernet) switching in switchdev, but layer-3 (IP) offload is supported as well for devices that perform IP routing. Currently, that support is limited to IPv4, but IPv6 support is in the works.
Initially, the only device supported by switchdev was the "rocker" software switch for QEMU. But, in July 2015, support was introduced for the Mellanox SwitchX-2 ASIC. Mellanox has continued to fund in-kernel driver development since. At the low end of the spectrum, Broadcom BCM53xx switches (typically found in home router hardware) gained switchdev support, too, although the emphasis remained on the large switch devices and the vendors of the chips within.
As our coverage of the 2016 Netconf meetings (part 1, part 2) notes, switchdev is increasingly the layer at which new kernel networking features are being planned—such as single-root I/O virtualization (SR-IOV) or Virtual Routing and Forwarding (VRF). But in-kernel usage of the API and a single chip vendor hardly equates with breaking the existing switch-vendors' logjam.
What is significant, however, is a major telecom carrier throwing its weight behind switchdev. Damascene Joachimpillai from Verizon did just that in his keynote talk at this year's Netdev. Joachimpillai (or "DJ" as he is know around the community) runs the Network Architecture team for the network-applications division inside Verizon—the group that develops and deploys the carrier's own mobile apps, for everything from contacts synchronization to video streaming. The division runs several data centers in the United States, he said, but it is the switches and other "middle boxes" that impose the most significant requirements, rather than the servers.
To that end, he has grown dissatisfied with the user-space network-configuration tools provided by most switch vendors, and is backing switchdev development. Several dozen switches that each require a separate CLI application for management becomes a support nightmare, he said. Primarily, that is because the proprietary CLI tools do not make it easy to automate operations; his division needs to adjust network policies on the fly to adapt to changing traffic. The switch vendors are typically reluctant to make programmatic interfaces to their tools, and even those that do, he said, seem to change those interfaces with each product release.
Joachimpillai took numerous questions from the audience, most of which dealt with what a large carrier wants to see from switchdev and from the kernel in general. In reply, he cited uniformity as a critical feature, in contrast to the ever-changing interfaces of proprietary switch vendors' tools. He also noted that hardware offload was vital, and that Verizon tends to purchase hardware that can be upgraded later in the field. He hopes to choose hardware today, he said, that has good software support for features now that will be hardware-accelerated in future kernels.
An audience member asked whether Joachimpillai's division was doing any kernel development internally, or if using existing knobs was sufficient. He replied that, so far, the team has just been working with outside developers who were already working on projects that it was interested in. The kernel community has been quite responsive to change requests so far, he added, in particular with its recent work on improving IPsec support.
Throughout the rest of Netdev, it was clear that Verizon's endorsement of switchdev was seen as a big win for the project. Whether it prompts any proprietary switch vendors to open up their product lines certain remains to be seen—although, on the following day, Intel developers did announce that the company would be writing switchdev drivers for its own switch chips.
Together, Mellanox and Intel still only account for a fraction of the overall switch device market, but their migration to switchdev is likely to influence other small vendors in the short term. And, if big customers like Verizon start purchasing switch hardware based on switchdev support, the proprietary switch vendors will surely begin to feel the heat.
[The author would like to thank the Netconf and Netdev organizers for travel assistance to Seville.]
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
