Netconf 2017, Part 2, day 1 report
The two day Linux kernel networking development plenary session, called Netconf, was held in Seoul, South Korea, on November 6th and 7th. Covered here is day one of the sessions, attended by 17 developers (one virtually). A subsequent article will cover day two of the event.
Data Structure Shrinkage (David S. Miller)
David discussed the major data structures in the Linux kernel networking, how big they are right now, and what can be done to decrease their size.
The main focus was "struct dst_entry" which serves as the base class of all route types in the kernel. For example, ipv4 routes start with a "struct dst_entry" member and then continue with the ipv4 specific portions. Likewise for other protocols.
Several members of dst_entry are actually only used by some, but not all, subclasses. If we start "un-commoning" dst_entry, by moving such members down to where they are actually used, we can shrink some subclasses.
After un-commoning we can rearrange the dst_entry layout such that the special padding used to align the refcnt on a cacheline is no longer necessary.
RFC patches have been posted to the netdev list, and after some TODO items have been addressed it should be possible to upstream these changes.
Geneve Options (Simon Horman)
Simon discussed Geneve tunneling options, and how to represent them both from the user side and the data path.
These options are TLVs and people want to match on structured data. Should we expose the TLVs directly or represent them structurally?
Two kinds of matching might be wanted. First, we want to know if an option appears at all in a packet. There also may be circumstances where the ordering of options matters.
Simon's current game plan is to support simple matching in the initial implementation and then expand it in the future to support order-dependent matches.
Hardware GRO (Michael Chan)
Michael talked to us next about hardware GRO offload. Many people know what TSO, GSO, software GRO, and even hardware LRO is all about but less so hardware GRO.
LRO is deficient since you cannot reconstruct the original packet stream from the LRO batched receive packet. Therefore, bridging and routing cannot be used with LRO.
GRO maintains all of the information necessary for packet stream reconstitution. It can be viewed as a more strict implementation of LRO.
Chips which can do this in hardware are bnxt_en, bnx2x, and qede.
Michael explained what some of the challenges are with respect to tunneling, in order to support all possible cases.
He also showed some perf profiles when using SW vs. HW GRO. With hardware GRO we are able to execute the GRO stack 1 time rather than N times.
AF_PACKET V4 (John Fastabend)
John discussed the ongoing work on AF_PACKET V4. The main idea is to push packets raw into userspace as fast as possible using zero copy. A primary design component is an abstracted packet ring layout. The driver hooks for this mechanism convert between the virtualized format into their hardware specific one.
Control plane elements are necessary to map the right flows into the RX queue which is using AF_PACKET V4 zero copy receive. Currently this would be done using TC or flow director.
The user gives a set of buffers to the kernel to use for this new AF_PACKET V4 mode. So one discussion is whether this is better than the existing AF_PACKET modes which allocate buffers in the kernel and then map them into userspace.
John expressed that after looking at several example user applications, and considerations such as the use of huge pages, the user buffer allocation approach turned out to be much more useful.
Some have suggested moving this new facility into a new socket family, called AF_CAPTURE or similar, in order to avoid all of the backwards compatibility and old cruft that exists in the AF_PACKET after all of these years. There was considerable agreement on this proposal.
Another idea is to make sure that even if the driver doesn't support the new virtual ring buffer mechanism, we simulate it in software so that applications can be written to one API and be unaware whether it is accelerated in the driver or not.
Informal discussion on the flow dissector
The two main users of the flow dissector are flow separation and the TC flower offloads.
When the TC flower filter was added, the initial proposed implementation had it's own private flow dissector. It provided very similar functionality as the generic flow dissector that already existed in the kernel along with support for dissecting some more fields.
At the time it seemed prudent to just extend the generic flow dissector for the TC flower filter's needs. The idea was that we could benefit from having shared code.
The TC flower aspects of the flow dissector have grown considerably. The flow dissector's size and complexity is a concern since it is used for fundamental operations like receive packet hash computation and flow separation.
Simon Horman suggested that we separate the paths that do the more TC flow specific tunneling support from generic bits. TC flower would then make two function calls, one to the generic part and then one for the decapsulation handling.
One could also use eBPF to build custom flow dissection programs.
Updates From Plumbers, BPF as used at Facebook (Alexei Starovoitov)
Alexei first discussed what was happened at Plumbers recently about tracing infrastructure and eBPF usage.
Please see the LPC2017 tracing microconference summary for more information.
It was suggested that we should have more of a networking presence at Plumbers, and likewise we should seek to have some kind of a tracing presence at netdev/netconf.
Facebook's usage of various eBPF technologies is quite pervasive.
XDP is used for analytics, DDoS protection, firewalling, load balancing and ILA. cls_bpf is used for analytics and shaping. Kprobes, uprobes, and tracepoints are used for analytics and kernel debugging. Lightweight tunnels via cls_bpf are used for ILA. And finally cgroups with bpf are using for resource control, isolation, tcp tuning and SO_REUSEPORT.
Facebook has, through their experience, discovered several pain points. The main issue is the lack of common code and libraries. Much code is copied over and over.
This comes from the lack of a true function call in bpf. BPF tail calls are inadequate for this purpose. The plan is to add real function calls which can be used to build more complicated bpf programs from reusable components and object files.
We can then have static linking of bpf object files together at either compilation time or bpf program load time. One could also create a "BPF library" object. Here, BPF programs can link to BPF libraries already loaded into the kernel. Code can then be physically shared in the running kernel image.
Other features being considered include indirect calls, bounded loops, global variables and arrays, jump tables, and read-only sections.
TCP-BPF and TCP Congestion Control (ex. BBR)
Lawrence Brakmo discussed TCP-BPF which allows tuning existing applications by associating a BPF program with a cgroup. An application placed into such a cgroup will have various TCP sockopt settings adjusted by the bpf program.
This allows wide deploying of application tuning changes without having to modify the applications.
Future work will expand the list of settings that can be modified such as TOS, traffic classes, flow labels, congestion windows, slow-start threshold, round trip times, etc.
One future idea involves supporting new TCP header options using bpf programs. Two entities testing such a new option need only exchange the bpf program, and would not need to change their running kernels.
Another idea is to allow implementing congestion control algorithms using bpf. Our fully pluggable TCP congestion control model should make this quite easy.
Bridge, ECMP, and TC issues (Jamal Hadi Salim)
For TC testing, Jamal would like to see some features such as having the ability to describe tests that have dependencies. Another related topic was using Scapy as part of the test suite and whether we thought this was OK.
A discussion ensued about skipping tests in selftests if a dependency is not present. Generally the attendees agreed that this wasn't acceptable. Skipped tests are a missed opportunity.
Next he showed Arachne, a framework for constructing large Clos networks in software. This allows running tests on large scale switched racks. The tools create a topology, display the output, and then push configurations to the nodes. The network is deployed and you can add TC rules on startup, run the test, and then collect the results.
Some minor kernel and tool issues were encountered, each of which seemed relatively easy to fix. For example, they hit the limit on the number of ports possible on a Linux bridge.
Finally, there was a discussion about allowing iproute2 to support a per-ns UTS name for dhcp resolution.
TCP Issues (Eric Dumazet)
At 100gbit speeds with large round trip times we hit very serious scalability issues in the TCP stack.
Particularly, retransmit queues of senders perform quite poorly. It has always been a simple linked list.
With a gigabyte of in-flight data, it doesn't work.
To solve this, Eric reimplemented TCP's retransmit queue using rbtree. For small windows this is, of course, slightly slower than the simple linked list but no reasonable alternative seems to exist.
Next, on the receiver, we have spikes in cpu cost during loss. If we have a one gigabyte window and a packet is lost, we accumulate an enormous amount of work when the retransmit arrives to fill in that hole.
Suddenly, we have nearly two gigabytes of data to pass down to the user, along with cleaning up all of the packet metadata.
Eric's idea is to allow the user to register pinned memory in userspace for the kernel where out-of-order receive data can be placed.
If the kernel can place out-of-order data in userspace, it can liberate the kernel side copy immediately, amortizing the cost more smoothly over time.
XDP + Performance (Jesper Dangaard Brouer)
Jesper primarily focused on XDP developments that have occurred since his presentation at netdev 2.1 in Montreal.
One major change is that we've basically killed the "1-page per packet" restriction. A consequence of this is we are now tied to a refcnt based page model.
Next, XDP programs can redirect packets using the XDP_REDIRECT action to "other places". These are controlled by the bpf MAP which the XDP_REDIRECT program uses. Currently network devices and cpus are supported.
A device XDP_REDIRECT sends the packet to another network device.
A "cpumap" XDP_REDIRECT sends the packet to another cpu. This causes the SKB metadata for the packet to be allocated on the remote cpu. And, indirectly, we get an effective RX bulking mechanism.
Jesper was disappointed that XDP_REDIRECT has limited driver support right now. However, there is hope that XDP_REDIRECT, since it is so extensible via bpf MAP types, could be the last XDP action we need to ever add.
Future improvement involves extending XDP_REDIRECT to support redirecting multiple packets at a time. It currently takes a pointer to a single XDP buffer.
GRO support is also planned for cpumap redirects. Jesper feels that
redirects will perform better than RPS/RFS.
