Netdev 2.2 day 2 report

[Posted November 13, 2017 by jake]

This is coverage from day two of Netdev 2.2; other days (and Netconf coverage) can be found here.

Whatever Happens in Netconf behind the Closed Door?
Panel moderated by Shrijeet Mukherjee
Reported by Alexander Alemayhu

Netconf is a two day invite only conference, where the attendees are key contributors to the networking stack and it goes back to back with Netdev. The topics range from current problems, bottlenecks, future work and overall improvements. The moderator for the session is Shrijeet Mukherjee. Shrijeet starts of by explaining the goal of the session; What Netconf is and the role it serves for the networking subsystem. Shrijeet has a few questions prepared but is encouraging audience participation for an interactive session.

The panel consists of Jamal Hadi Salim, Eric Dumazet, Jesper Dangaard Brouer, John Fastabend and Simon Horman. At the beginning the panelists introduced themselves. Jesper works for Red Hat and his main focus is to optimize performance in the kernel. Jamal focuses mostly on tc but does other things in the network code. Simon started out with load balancing but currently focuses mostly on hardware offloads. John is focusing mostly on BPF and XDP but has worked with other parts of the networking subsystem. Eric Dumazet works at Google and he is mostly a network guy working on the TCP stack. Rumour has it he is a network ninja.

Question 1 - What is Netconf to you?

Eric: Two days back to back with netdev. Good conversation, good beer. Remembers mostly beer and good food :)

John: Two days true. Talk about current issues, what people work on long term vs just seeing patches on the list.

Simon: A lot of people work in different places. Good chance to have face to face. Propose solutions, and get instantaneous feedback from a focused grouped. Good to get together.

Jamal: Two days confirmed. Opportunity to have attention from David Miller. Problems do get solved. Some questions are hard to discuss over email. Half the format is to present and get feedback. Best place to solve hard problems.

Jesper: Not sure if he can add much. Important to meet face to face.

Shrijeet: It’s all about collaboration to reach agreement on opinions that were originally divergent.

Question 2 - What were the disagreement that were discussed in this Netconf in particular?

Simon: Flow dissector has become more and more complex over time. Point of contention on mailing list. Concerns were raised at Netconf and solutions for centralized infrastructure have been proposed. This outcomes would not be achievable on the mailing list.

John: Important to know the motivation behind the work of others.

Jesper: Good to hear what companies are doing, for example: Why is Google pushing X?

Eric: Does not recall having said anything about Google. No secrets shared. Maybe after the first beer?

Shrijeet: Problem with finding a middle ground between performance and features. Linux is being used in a lot of different places.

Simon: Not so much contention. No tables were damaged this week.

Question 3 - What was the most fun disagreement?

Jesper: Some of the participants originally disagreed with the XDP mechanism before it got merged. The discussions ran high.

Simon: People can get emotional and sometimes can even hit tables. The person who did it last time should remain unnamed :)

Then Shrijeet Injects a question on how much time is good for discussing a problem, 30 min, 1h, 1 day?

Simon: Discussions are very unstructured and people can take the time they need.

John: Most things are resolved with these discussions anyway. Some topics are discussed over several Netconf.

Eric: Sometimes the solutions discussed don’t get implemented because people get involved on other topics and lack time.

Shrijeet: Emphasizes that subjects of high impact will generate more discussions.

Question 4 - What are the pain points the audience is feeling with the network stack?

Dave: Gave the examples of the netlink annotation which can give users some informative error messages instead of some cryptic error code (EINVAL).

Jason A. Donenfeld: thinks more automated testing would be good. Using tracing to enhance debugging instead of the primitive print messages was also mentioned.

Someone from the audience raised current pain points with the xfrm interface. There is no feedback from what is happening, which can makes troubleshooting very difficult.

Dave: says this is a visibility problem and goes into the topic of tracepoints. We should add reasonable tracepoints for common problems.

Shrijeet ends the trip behind the Netconf session by joking on the fact that only four point pints are felt by the audience.

Resource Managerment in Offloaded Switches
Work by Andy Roulin et al at Cumulus
Presented by Shrijeet Mukherjee
Reported by Michio Honda

The talk covered synchronization challenges with updating hardware tables. The update path starts from user space app like iproute2, down to the kernel, down to hardware. This all looks simple until you start factoring in that the installation from user space needs to deal with hardware state, capacity mismatch, rate mismatch and type mismatch. The hardware and software abstraction, and therefore resource mapping, also often lead to an impedance mismatch. These issue could all lead to a request to hardware eventually failing silently. Shrijeet pointed to that fact that these failures can result in major and hard to pin down symptoms. An example of which could be black holes over route updates by BGP, while the hardware update silently failed.

Shrijeet mentioned three options to solve this problem, abstracting it with three-step loops of user to the kernel, kernel to the hardware and hardware to the table. He showed a credit-based, asynchronous approach works well to amortise hw access and transaction latencies over multiple updates, using the case of updating 10K routes from FRR.

He concluded the talk by emphasizing that hardware resource management is non-negotiable for user experience.

AF_PACKET V4 and PACKET_ZEROCOPY
presented by Magnus Karlsson, Björn Töpel, John Fastabend
Reported by Michio Honda

The presenters started by providing motivation for their work: AF_PACKET performance is not something they thought was up to par for modern needs. For this reason they decided to work on a zero copy infrastructure. The scheme works for both receive and transmit. Magnus introduced the approach of AF_PACKET V4 with eliminated system calls, hw descriptors kept private in the kernel and packet array abstractions which contain tpacket4_desc and tpacket4_queue. I note that these structures look very similar to something i am familiar with, namely netmap slot and ring respectively. The AF_PACKET V4 relies on HW steering and dedicating a driver DMA for user space consumption. XDP (work in progress) could also be used to do initial pre-processing before allowing the packet to proceed on to user space.

The presenter emphasized the need for security and isolation properties for zero copy, including guarantees that the kernel won't crash and no kernel data modification between processes. He showed a lot of performance improvement and a very interesting one is one which shows a 20x improvement over tcpdump. The presenter argued the need for unification of XDP and zero copy. He finally listed to-dos, including performance issues, skb conversions for V4, shared packet buffer support, multi-segment packet support, etc.

There was a lot of discussion that happened afterwards. David Miller recommended to have a new address family instead; and that there should be a SW fallback for drivers not having the appropriate NDOs for hardware acceleration. There were questions on how to steer traffic and and a response from John on how ethtool and tc could be used for this; Lorenzo Colitti stated that he saw the need for this in Android to do IPv6 address translations. Drawbacks of SR-IOV and how the use of AF_PACKET V4 could remove these drawbacks were also discussed.

TC filters (rules) insertion rate
Work by: Rony Efraim, Guy Shattah
Presented by: Rony Efraim
Reported by: Jae Won Chung

Rony Efraim from Mellanox presented their effort to improve the TC rule insertion rate. This is important for them because tc is used for offloading flow rules and there are now demands for insertion of a million rules/sec.

In an attempt to improve TC rule insertion performance, the linked-list classifier rule handle store/lookup implementation and the small-bucket hash table action store/lookup implementation are replaced with IDR. These store/lookup operation optimizations boost the performance of TC rule insertion rate to 50K/sec when tested with E3120 Xeon.

To further improve the TC rule insertion throughput towards the 1M/sec goal, the authors proposed two different flavors of rule insertion batch methods: multiple netlink messages and compound netlink message. By parallel processing the insertion rules in the batches, a much higher TC rule insertion throughput can be achieved. Yet, this method requires the rules to be inserted be independent to one another.

During the discussion, it was pointed out that sending multiple netlink messages to the kernel in a batch is already supported. Also, generalizing the (compound) netlink message interface to handle safe processing of dependent rules insertion was suggested as a future work, and compressing the rules in the user space before inserting to TC was suggested as a general recommendation.

TTCN-3 and Eclipse TITAN for testing protocol stacks
Presenter: Harald Welte
Reporter: Hajime Tazaki

Harald Welte, who has been working in the telecom area for a decade now, started his talk by introducing the importance of protocol testing, especially in the telecom area because of the large number of protocols used in that space. He introduced TTCN-3 which is a domain-specific language just for protocol conformance tests. Eclipse TITAN is an open source tool chain released by Ericsson in 2015 which supports TTCN-3.

Harald then walked thorough the language definition and functionality of TTCN-3. The TTCN-3 language is designed with a comprehensive type system, powerful program control statements, a template system, run-time executor for parallel test components (to speed up testing), etc. It is also powerful with the abstracted encoder and decoder to verify that the contents of incoming packets are what a test expected.

Harald then shared his experience with an example test code of a real use case. His demo showed a test for a standardized protocol implementation, Media Gateway Control Protocol (MGCP, RFC3435). Harald showed how he utilized the language features for MGCP testing.

A question was asked if there is an experience to test the Linux kernel network stack. Harald then shared his experience of testing the connection tracking feature of netfilter subsystem, which is highlighting the usefulness of the tool and the language.

Title: TCP-BPF: Programmatically tuning TCP behavior through BPF
Speaker: Lawrence Brakmo
Reported by: Hajime Tazaki

Lawrence Brakmo talked about his newly introduced feature of the Linux TCP stack, called TCP-BPF. The main objective of this feature is to configure run-time parameters of the TCP subsystem, which is not possible (or difficult to do so) with current mechanisms. Lawrence illustrated the capability of TCP-BPF with an example showing buffer size optimization of a socket. To achieve the same effect, the setsockopt(2) system call requires modifications to an application, using sysctl(1) is hard to optimize and fine-tune, and iproute2 utilities are more global making it even more rigid. On the other hand, TCP-BPF is able to optimize those run-time parameters programmatically. Lawrence also mentioned that the primary use case is to optimize per-flow TCP parameters in a production environment to experiment with different tweaking of the TCP subsystem without making kernel changes.

Lawrence then explained the internals of the implementation, which is available since 4.13 version kernel. He showed that TCP-BPF works by utilizing several kernel entry points where TCP activity happens. The kernel calls the bpf code in these hooks which can then make tweaks. He then walked through usage scenarios with a couple of utility programs making use of this feature.

Lawrence presented experimental results showing several plots of achieved goodput, the number of retransmission packets, and delivery delays of each packet as well as the optimized result with TCP-BPF, which clearly confirm the benefit of TCP-BPF.

Finally he described future plans of the project, not only to add more entry and call points which will be available for users, but also the possibility of implementing new congestion control algorithm with the TCP-BPF feature.

A questions was asked if there is a more generic way to define the hooks implemented for TCP-BPF to easily expand the information tunable by users, and David S. Miller, the maintainer of the net subsystem, explained the importance of carefully chosen entry points and suggested this way for the exposed interface to users.

WireGuard: Next-generation Secure Kernel Network Tunnel
Presented by: Jason Donenfeld
Reported by: Pablo Neira Ayuso

Jason Donenfeld talks about his simple VPN in-kernel implementation that can be used as alternative to the existing userspace OpenVPN and the Linux IPSec stack. This new approach implements a tunnel device driver that encapsulates packets in UDP header plus some specific VPN protocol header that he designed himself. Source is barely less than 4K lines of code, so it is rather small compared to existing Linux xfrm and userspace OpenVPN (in the 100s of K).

Configuration is rather simple, you have to use the wg device and the iproute2 command tools to set up tunnels, no need for any userspace daemon.

The code did not receive any in-depth review so far, so Jason has been calling out for reviews. He's planning to send a RFC to the Linux netdev mailing list not later than the beginning of 2018.

Accelerating XDP programs using HW-based hints
Work by: PJ Waskiewicz, Anjali Singhai Jain, Neerav Parikh
Presented by: PJ and Anjali
Reported by Andy Gospodarek

PJ and Anjali started the presentation with a brief primer on XDP. Since it has been covered extensively during Netdev they wisely decided not to provide too many details on the basics of XDP. Since significant parsing is done already by most NIC hardware with the output of this parsing in the receive buffer descriptor. The goal is to create a way to pass as much of this useful information to XDP programs to avoid having to do so much simple packet parsing in an XDP program and reduce the number of CPU cycles and memory accesses needed for XDP programs.

There were several recognized goals for this solution:

Consider only what was possible using present-day hardware.
Avoid temptation to make significant kernel and driver changes.
Target a "best-effort" acceleration based on hints.
Create a framework that can evolve over time.

The recent changes in the upstream kernel that provide a writeable scratch area in the XDP buffer served as a perfect spot to store data available in receive buffer descriptors. This 'data_meta' element provides plenty of space for information that could be easily populated before passing the buffer to the XDP program.

Since Intel XL710 hardware is capable of providing packet-type information for any receive descriptor and samples/bpf/xdp1 in the kernel tree also parses the packet contents for packet-type information it seemed like populating the 'data_meta' area of an 'xdp_buff' seemed like a good way to test if this idea provided the expected performance benefit.

As a baseline 'xdp1' was modified so packet contents were not touched and renamed 'xdp3.' This would provide an indication of the maximum number of packets per second that the particular test system could handle. A new program called 'xdphints' was created that inspected the 'data_meta' area for packet type that stored information populated by the driver. The expectation is that this data is stored in the first cache-line of the packet data and would not be as expensive to access as data further down in the buffer.

Testing proved this to be correct:

Application         Drop Rate
xdp1                7.5Mpps
xdp3                21Mpps
xdphints            21Mpps

The program that was able to use hints provided by the hardware before dropping was able to operate as quickly as a program that did not access any packet data.

PJ and Anjali understand that this proof-of-concept, but wanted to bring these test results to the community with the hope that exposure to a wider audience would spark discussion and ideas they had not even considered. They proposed three different methods for reading and writing the XDP metadata, but realized that this may not be an exhaustive list of the options available for populating metadata.

Define a common layout that could be populated by drivers. This seemed like an easy to implement solution, but the concern that this struct would create an interface that would be UAPI constrained.
eBPF programs that would be specific to a driver. Specific eBPF programs could check values based on the capabilities or elements that are available in the receive buffer descriptor, it would seem that device-specific programs could be a nice option.
Chaining eBPF programs to write metadata. In this case the first program writes information from the receive buffer descriptor into the data_meta area and the second program in the chain could read and act based on this information. A slightly different alternative to this would be to add the program that modifies the data_meta area into a separate ELF section in the file that contains the eBPF program and that will populate the data_meta area of the xdp_buff.

Overall this looks like a promising alternative to assist with packet parsing in eBPF programs. There was an interest in testing more complex eBPF programs and that more prototyping is needed. Based on the number of questions and conversations around this topic it seemed like the decision to involve the community early in the process was an excellent choice.

It certainly seems like this feature will be discussed at future Netdev conferences and on the mailing lists as hardware assisted eBPF hints could provide massive optimizations for eBPF programs.

Performance improvements of Virtual Machine Networking Speakers
Presenter: Jason Wang
Reporter: Michio Honda

Jason started with typical VM networking setup, using TAP (to bridge) or macvtap (to NIC). He discussed the threading model, or I/O thread, pointing out that use of a single thread is inefficient, referring to ELVIS (Abel Gordon) and CMWQ (Bandan Das). He showed study of I/O bottlenecks, and introduced busy-poll-based improvements.

He then introduced TAP improvements, using better producer-consumer interaction with lock-less data structures and pointer arrays, as well as careful cache line invalidation and aggressive batching. He then present virtio improvements.

Jason then switched to XDP for TAP. He introduced challenges in XDP+TAP: multi-queue support, LRO/RSC/Jumbo configuration, data copy on skb_allocation, lack of NAPI, and zero copy. He reported that all the problems have been solved in 4.13.

He showed performance improvements with each new methods, reaching to 4.9 Mpps on RX, and 3 Mpps on TX when all the methods enabled.