Netconf 2018 Day 2
This article covers day two of Netconf, the plenary sessions for Linux kernel networking, that was held in Boston, Massachusetts. (For coverage of day one of the sessions, see this article.) ==================== Being Less Indirect (David S. Miller) David presented a brief overview of Spectre and the implications it has for the kernel. The networking subsystem is particularly vulnerable because per-design there are indirect function calls all over the data paths for protocol demux and running filters or rules. A side effect of Spectre is that optimization schemes which proved not so effective when tried in the past, are now open for a full re-evaluation. For example, various methods of batching which did not give clear performance gains should be looked into once more. We can also use eBPF to translate filtering rules into straight line code paths. This linearization eliminates all of the indirect calls except the one necessary to call the eBPF code stream itself. Some side topics were discussed next, the first being SKB lists. The SKB metadata is one of the few core data structres into kernel which implements it's own linked list handling. Different parts of the networking use the list next and prev pointers in different ways, which complicates any conversion over to the kernel's generic list_head. The main offender is GRO, and Eric Dumazet said that he planned to change that singly linked list scheme into a per-NAPI hash table. This would facilitate the conversion over to generic list handling significantly. The various robots we have operating on the kernel tree were discussed next. From the kbuild robot onwards to technologies like syzbot, we have a lot of automated critters building and testing our code. Is this overall a good thing? The verdict is not entirely clear. Kbuild obviously provides value, it can detect build problems before they propagate too deeply into permanently committed changes. Early on, kbuild had trouble figuring out exactly what tree a set of changes apply to. But it is much better at this now. Syzbot tends to have a roller coaster pattern of activity. Every time it learns how to explore a new area of the kernel, it finds a lot of new bugs. Developers scramble to fix the onslaught of new bug reports, and then once that initial peak subsides things are quiet for a while. This sequence repeats over and over again. The worry is that this kind of tool can become a crutch. People can just decide to run syzbot in order to test a set of new changes. Also, there could be less and less human by-hand auditing of the sources. Such by-hand auditing is an effective way to find problems and also become more familiar with the code base itself. David really likes the amount of tests that are being added for networking under selftests, and he hopes the trend continues. Perhaps at some point we can require new tests for any set of changes that add a new feature to the networking. ==================== TC Flower Tunneling (Simon Horman) Simon began explaining how encapsulation key dissection works. This is a sub-part of the generic networking flow dissector and is currently only used by TC flower. The key point is that this facility allows differentiation between inner and outer header keys. Most of the networking dissects on the inner header before decapsulation using skb_flow_dissect(). Encapsulation keys, on the other hand, are seeded from the outer header during decapsulation using skb_flow_dissect_tunnel_info(). One main limitation of the current encapsulation keys is that they key on addresses and ports. This basically limits them to UDP based encapsulations (VXLAN, Geneve, etc.) and further limits them to the RFC defined protocol ports. The latter is because the port in the key is what is used to identify the tunnel type. Simon proposed to add an encapsulation type field to the encapsulation key. This would make the type of tunnel explicit, allow non-UDP based tunneling, and also allow UDP tunnels to run on non-standard ports. On the topic of matching on geneve tunnel options, it's not exactly clear the right way to expose a configuration mechanism for this. Do we accept raw TLV blobs? At the very least we should verify the TLV so that we only accept geneve TLVs for geneve tunnels. That works for setting, but matching TC Flower needs support added for maskable TLVs. Also, what about ordering? Does TLV matching require exact ordering, or is any-order allowed? ==================== Who Fears the Spectres? (Paolo Abeni) Paolo provided some performance graphs for UDP small packet RX performance. He covered several scenarios. First, with no mitigations. Next, with PTI enabled. Then, with just retpolines. And finally, with both PTI and retpolines. Generally, it can be seen that PTI has the largest effect. However, retpolines[c][d] give a noticeable effect as well. Some perf output snapshots were shown, and how each of the mitigations adjust what we end up seeing with perf. PTI causes the syscall entry/exit to show up clearly. The retpoline effect is less clear. What we see is not exactly an indirect call at the top of the perf output, but rather some point right after that indirect call. Another test, this time using pktgen, was shown. Here, since there are really no transitions between the kernel and userspace, the PTI effect is basically zero. But we can clearly see a performance hit from retpolines. Some suggestions for fighting these problems. Batching was mentioned first. We have batching already in GRO, GSO, and the scheduling qdisc dequeue. But these don't really trigger for forwarding workloads. Furthermore, we only have partial support for segmentation offloading of UDP. A lot of our indirect calls in the kernel are actually not really indirect. An interesting case is skb->destruct, which only takes on one of several values. We could instead encode an integer, and take a direct call to the desired destructor using that. Eric Dumazet mentioned that there have been experiments done to formalize this kind of technique. For example, a macro at the protocol demux points that expands to a series of tests on the protocol value which all lead to direct invocations of that protocol's packet input routine. Three more ideas were brought up. First, it may be possible to move a lot of our COMPAT layer code into a new UMH helper of some sort. Next, Paolo suggested to bring back to life an old idea of Eric Dumazet which is to perform remote SKB freeing (always free on the cpu the packet came from). Finally, he asked if early socket demux for unconnected sockets is still worth pursuing. There was also a discussion about TCP regressions due to how skb_scrub_packet() works. One side effect of this operation is that it NULLs out the SKBs socket pointer, which in turn causes us to lose the TX queue mapping in use by that socket. So this breaks flow separation completely. After some discussions on various ways to deal with this problem, it was generally agreed upon that we can preserve the socket pointer during a scrub operation. ==================== TLS, Crypto, and ULP's (Dave Watson) Dave discussed the current state of the in-kernel TLS layer. Right now we have full software support for RX and TX. The Chelsio driver can also do both TX and RX. And finally the Mellanox driver can do TX. A basic design principle of the TLS layer is that the handshake is still done in userspace. The challenges of zerocopy were discussed next. And furthermore how this all fits together when one tries to use kTLS over NBD. In particular NBD needs larger buffer sizes to achieve zerocopy. kTLS currently only supports 128-bit keys, and Dave would like to remove that limitation soon. Also, TLS 1.3 support is on the way and this encompasses several things. One nice aspect of TLS 1.3 is that rekeying has been removed which can be tricky to support properly in this multi-modal design. Next, the limitations and problems of using the strparser layer for kTLS were listed. In particular it's handling of SKB sharing needs to be improved. It uses skb_clone, and this interacts badly with skb_cow_data() and friends. We have various implementations of various crypto algorithms, all with different performance metrics. The C version supports all key sizes and full scatter-gather, but is the slowest. The AVX and SSE versions are faster, but can support only certain key sizes or lack scatter-gather. The next issue in the crypto layer is that the FPU usage is inefficient. A round of crypto using SSE/AVX can be done by first loading the key material into the FPU registers. If we are doing multiple rounds we should only do this key load once. However that is not what happens currently, the keying material is loaded every single time around. ULPs are a hot topic of conversation because we'd like to use them in new and interesting ways. Current users of ULPs are kTLS and BPF. ULPs operate by overloading the proto and socket operations, so that they can sit in the middle. This works fine when you have only one ULP active, but once you try to do several at a time it gets messy. For example, for some callbacks you want to invoke the callbacks starting from the "topmost" ULP to the lowest. And in other situations you want to do the exact opposite. Also, once we have multiple ULPs stacked together, how do you unload one of them? ==================== TC changes and "ethlink" (Jiri Pirko) Jiri first discussed a feature he is working on called chain templates. One limitation of hardware offloading is that the driver can't assume anything about the width of the key masks that will be coming in future operations. Therefore, it allocates perhaps more space in the chip tables than is really necessary. Chaining hopes to give the driver a better idea on how wide such values can ever be. It is all about using HW tables, which are a scarce resource, in a more optimal way. Vlad Buslov has been working to increase the rate of TC rule insertion. All inserts are currently not parallel, and are synchronized by the RTNL semaphore. The idea is to elide RTNL locking, and refcounting and small locks to do more fine grained control. The initial work will target TC actions and the Flower classifier. Future work will involve the batching of changes. An overview of devlink was given as well as some discussion on the ongoing work to add parameters by Moshe Shemesh. Different kinds of parameter changes can require that the hardware be reset in different ways in order for the change to take effect. And this concept is built directly into the new facility. The parameters that can be changed have a value which describes what kind of reset is necessary, and when changing a parameter the user asks explicitly for one of those reset types. Alex Vesker has been working on support for resources in devlink, it allows the user to operate on memory regions which are usually part of the firmware image. Users can snapshot, dump, and read these regions. The final topic of conversation here was 'ethlink' which is Jiri's name for exposing ethtool operations via netlink. He believes it is a mistake to place this into devlink. Devlink is for performing operations on objects which are not netdevs, whereas ethtool is firmly netdev based. A high level design, as well as a migration strategy were provided. As a first step, we could implement the netlink attributes for ethlink and start with code that simply emits netlink events when ethtool operations are performed. ==================== RX Batching, GRO, Megaflow merging, ARFS, BPF Verifier (Edward Cree) Edward touched upon a variety of topics, starting with RX batching. This is a patch series he proposed a while ago, which at the time received a lukewarm reception. He hopes that now with Spectre it may gain more traction. The idea is to take packets into a list at NAPI poll time, and walk them through the networking stack as a group even if they are somewhat unrelated. He went on to describe the listification algorithm he uses, and how this leverages all of the existing networking stack. It is simply hooking up the existing datapath in a new way. Plans are underway to rebase his changes, clean them up, and get more recent performance Numbers. Another experimental idea is to extend this concept even to XDP. The verifier emits a prologue and epilogue which transparently makes an XDP program capable of processing a list of packets instead of just one. Megaflow merging is a scheme which can hopefully improve the performance of OVS/Flower at least for certain workloads. The trick is in finding what is the optimal set of flows to merge. Edward has some test code in Python which tries to achieve this in various ways. It is still very early experimental work but the initial results are promising. Loose GRO was the next topic. LRO is generally frowned upon because, unlike GRO, it doesn't maintain enough state to allow emitting exactly the original packet stream. LRO can coalesce, therefore, in some situations where GRO can't so at least theoretically it might be faster in some circumstances. Although this is unproven, Edward said that nevertheless many customers think they need LRO. The proposal was to allow less strict segmentation in GRO, ala LRO style, when some option or sysctl is set. In reality, we only need to respect the resegmentability of the packet stream when we are forwarding. So perhaps that could act as a key to control the behavior. Edward is interested in upstreaming something like this as it would allow him to remove a bit of code from the Solarflare Driver. What to key on was discussed some more, and one idea from Eric Dumazet is to perform early demux of sockets even earlier than we do now. If we have a local socket an SKB maps to, we know we won’t be forwarding the packet so we can safely do loose GRO in that case. ARFS flapping can occur when interrupt affinities are misconfigured. He didn't realize this was the cause of ARFS flapping initially, so he implemented code to avoid ARFS flapping explicitly. In the end he thinks that interrupt affinities should be set properly, so his changes probably aren't very useful any more. Recently there have been some disagreement on how to proceed in making the eBPF verifier more powerful. Edward has one set of approaches he would like to use, and Alexei Starovoitov has another. These disagreements have been fermenting on the mailing lists for some time. So it was great to be able to have both of them in the room to discuss these differences and try to work things out. One of Edward's chief complaints seems to be the hard push to stick to traditional and tested userspace compiler algorithms for this work. He would like to see more freedom to explore more inventive approaches. Alexei is very concerned about the verifier in the long term. He would like to see it able to handle a one million instruction program efficiently. Because of new developments in the pipeline (bounded loops, real function calls, libraries) he thinks getting this right is a life or death situation for BPF. This is why he wants to try at first to use tried and tested mechanisms for data flow analysis. ==================== SCTP offload and tunnel ICMP handling (Xin Long) Xin gave a detailed explanation of how SCTP chunking works. This is important to understand because it has a huge impact on how GRO and GSO need to work. Unlike a normal stream protocol like TCP, SCTP can transport multiple data streams at once. These are managed by the stream scheduler, and packaged into SCTP frames using what are called chunks. So the data stream is not just the SCTP payload. There is a further level of header parsing and demuxing that must occur. And it is this which complicates a GSO implementation considerably. Right now the SCTP GSO implementation puts the packets together using SKB fragment lists. Not many drivers support receiving SKB fragment lists, so the SCTP GSO frame has the be unraveled right before it is given to the driver. The discussion went on to cover how feasible it would be to get drivers to support SKB fragment lists, in order to handle this. This isn't easy, because once you have fragment lists, the drivers have trouble figuring out exactly how much space in their TX ring might be necessary to send an arbitrary packet. Eric Dumazet suggested that it might be possible to encode SCTP GSO packets using paged frags just like TCP does, and that the current implementation of SCTP GSO should be reconsidered. He made it clear that doing this is much easier than getting drivers to accept SKB fragment lists. Next, Xin presented various cases of ICMP handling over tunnels that are very troublesome. The gist of the issue is that the current way UDP sockets are looked up for UDP based tunnels doesn't work when processing ICMP messages. Xin feels that we can adjust the demux in certain ways such that we find the correct UDP socket in most cases. He went on to show that L2TP tunnels have a similar set of issues, although the path to solving the problem in this case is much less clear. ==================== BPF and the Future of Kernel Extensibility (Alexei Starovoitov) Alexei explained that the goal of BPF is simply to let non-kernel developers safely and easily modify kernel behavior, and this means making BPF easy to use. He used various visuals (a Transformer, a bunch of children playing with legos) to show the difference between BPF in the past (cool and powerful but only in limited forms) and BPF in the present (a giant lego set with no instruction manual). However, even without this instruction manual, people have built lots of interesting and powerful things. He went on to describe why people learn BPF, and why the kernel needs BPF in order to stay relevant in this quickly changing world. The BPF verifier, as always, was a major topic. Programs up until this point have been relatively simple for the verifier to work with. But this is going to change, especially because of BPF to BPF calls. There is also work afoot to track pointer life times in the verifier. This kind of analysis will allow things like lock/unlock and malloc/free to be verified in BPF programs. Another major area of work is supporting bounded loops. Other things we have to look forward to are global variables, local storage, indirect calls which are statically verified and patched, libraries, and dynamic linking. Alexei would like to see the verifier move away from it's brute force "walk all instructions" approach, and he feels that this is fundamentally necessary in order to efficiently handle huge programs. Support for BTF (BPF Type Format) was merged recently and more is coming. It describes all types in a given program. In the future BTF can integrate with the verifier for safety checks. Also, LLVM could emit code that references struct members symbolically and the BPF loader can fix them up using BTF information. Facebook recently fully open sourced their XDP based load balancer called katran. Alexei explained the key aspects which were enabled by XDP in their design. He would also like to see a move to more file descriptor based APIs for BPF objects. These solve all the weird concurrency and lifetime issues we see these days. Another new feature coming up is cgroup local storage. Variables are annotated with "__cgroup" just like one can use "__thread" in userspace. This variable becomes local to that cgroup and can be used accordingly. Drivers are getting more and more complex. So it is proposed to make the memory management aspect of adding XDP support as common as possible. Alexei feels that this is part of why XDP uptake in drivers goes so slowly. Finally, the topic of firmware came up. The firmware running on your hardware these days is a different beast than that which we had in the past. It used to be just a shim over the hardware, but now it is something much more. In fact, at many shops the firmware team is several times larger than the kernel driver team. Such a huge piece of unpublished code is just a security problem waiting to happen. It is also full of secret features and, even more unfortunately, bugs. Therefore firmware needs to become more open, part of the driver, and the kernel GIT tree before this problem becomes even worse. It will be a huge uphill challenge to achieve this, however.
Posted Jun 11, 2018 20:28 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
/me runs and hides
Posted Jun 11, 2018 20:29 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 11, 2018 22:13 UTC (Mon)
by nybble41 (subscriber, #55106)
[Link]
That doesn't seem so unreasonable to me, provided the JIT-compilation occurs within a restricted user-mode helper and targets a subset of the native machine language amenable to verification (as with NaCl). The verifier, of course, would need to run in a separate user-mode helper context. LLVM bitcode is not fully portable, though—something like WebAssembly might be a better choice.
Posted Jun 12, 2018 6:50 UTC (Tue)
by magfr (subscriber, #16052)
[Link]
Netconf 2018 Day 2
Netconf 2018 Day 2
Netconf 2018 Day 2
Netconf 2018 Day 2
Lisp is also nice since it naturally promotes some of the eBPF limitations.
