|
|
Log in / Subscribe / Register

Netdev 2.2 day 1 report

Netdev is The technical conference on Linux networking. It is a community-driven conference organized by the Netdev Society, a non-profit organization based in Canada. The conference is geared towards Linux netheads. Linux kernel networking and user space utilization of the interfaces to the Linux kernel networking subsystem are the main conference focus. Netdev 2.2 was held in Seoul, Korea between November 8th to 10th, 2017 at SpaceShare Daechi Center. Netdev 2.2 was held back to back with Netconf.

Network Stack Personality in Android Phone Talk
Work by Cristina Opriceana and Hajime Tazaki, presented by Hajime
Reported by Lawrence Brakmo

In this talk, Hajime introduced the use of LKL (Linux Kernel Library) to allow running newer network protocols in Android phones. The example he used was using MPTCP (Multi-Path TCP) in Android phones to achieve better network throughput.

An issue with Android is that the base kernel running in most devices is version 3.10 or earlier (although they do have many backports). Hence there is no easy way to take advantage of newer network features unless one is running a device with a version of the base kernel that has it (i.e. an old feature) or if the feature has been backported to that base kernel version.

The goal of the project was to support Network Stack in User Space (NUSE) while maintaining application compatibility. This is achieved through LKL and the use of LD_PRELOAD to hijack library calls to use LKL. On Android, only socket-related calls are redirected.

As mentioned earlier, Hajime used MPTCP to demonstrate the use and value of their approach. MPTCP is an extension to the TCP subsystem that allows an application to use multiple paths when using TCP in a transparent way. He demoed using MPTCP in a Nexus 5 phone and compared the connection goodput and CPU usage to baseline. When using LKL and standard TCP, the goodput was the same, CPU utilization a little lower for LKL. When using MPTCP, the goodput was higher, but CPU usage was unstable (high variability).

There are limitations to the current implementation such as DHCP only works at boot time (handover will fail) and IPv4 only on cellular.

XDP + Netem = XNetem
Presented by Stephen Hemminger
Reported by Lawrence Brakmo

Stephen shared that the motivation for the work was a desire to work with XDP. However, he also saw the opportunity to achieve higher performance (read support higher packet rates) that what Netem can do. XNetem is not a replacement for Netem, each tool has its strengths and weaknesses and one should use the tool better suited to a particular use case.

Netem was initially developed to test TCP throughput under various network conditions. It supports adding delay, packet losses (in various ways), duplicating packets and mangling packets (i.e. introducing packet errors).

XNetem uses XDP and an associated BPF program to achieve its goals. This program uses 2 BPF maps, one to store its parameters such as loss rate, and another to store its state. These maps are shared with a user CLI program that is used to specify its parameters. The CLI provides an interface that is similar to that used when controlling Netem through tc.

XNetem currently only supports introducing packet losses and packet corruption. There is no support for adding delay, and this will be hard to add (if at all possible). XNetem is work in progress and Stephen is hoping it will create the same type of ecosystem that Netem created.

Receive Side Crypto Offload
Presented by Boris Pismenny
Reported by Jae Won Chung

Boris Pismenny from Mellanox presented Network Interface Card (NIC) support for Transport Layer Security (TLS) receive-side crypto offload. This is an extension to the previous NetDev 1.2 conference talk on the transmit-side crypto offload. Innova-TLS NICs (ConnectX4-Lx + Xilinx FPGA) is integrated with kernel TLS (kTLS) software stack, where kTLS orchestrates the offload capability of the NIC. kTLS is enabled through setsocketopt() system call interface through which information required for the TLS operations is provided to the kernel. The authors extended the API to support TLS_RX socket options.

The challenge of TLS receive-side crypto offload comes when packet reordering and loss occurs, in which case the NIC loses state required to offload the next TLS record such as location of the TLS record frames in the TCP stream and the sequence number. In such a case, the NIC delivers un-decrypted packets until resync is initiated and accepted by the kTLS, relying on the software decryption. Similar problem may occur during TLS key renegotiation, since the hardware may not be informed of the new key in time. When seeing an authentication error by using the old key on TLS records encrypted with the new key, the NIC stops decryption waiting for a resync.

This approach requires 2 bits in the SKB, crypto_done and crypto_success, to notify the software of the decrypt state, which is a design challenge as increasing SKB size may directly impact Linux fastpath performance. Future work includes performance evaluation of the TLS receive-side crypto offload given various packet reordering and loss traffic models.

Arachne: Large Scale Data Center SDN Testing
Presented by: Alexander Aring and Jamal Hadi Salim
Reported by Jae Won Chung

Jamal Hadi Salim and Alexander Aring presented and demonstrated a new tool, Arachne, which can be used to design a Clos network, and test the scalability and robustness of the control path. Control of the datapath and other resources in a Clos Network is typically constructed using Software Defined Network (SDN), where a set of SDN controllers on the management plane configures, manages and monitors the network nodes including switches and hosts through one or more software demons/agents on the nodes. The agents typically communicate with the SDN controllers over a management/control network physically separate from the data network. The scalability and robustness of the SDN controlled management network can be evaluated with Arachne and a single Linux box.

A typical Arachne workflow is to design a Clos network iteratively followed by a weave phase. In the design phase, a network intent is described via the CLI by the designer making choices whether an L2 or L3 network is desired and a description of intended Clos network parameters, such as number of PoDs, spines/racks in a PoD and number of hosts per rack. Arachne can then be asked to graphically show the modeled Clos network to ease the human validation. Arachne uses geographic addressing schemes for L2 (MAC) and L3 (IPv4) addressing. When running in L3 mode it uses static routing for L3 configurations by default although support for dynamic routing protocols is claimed.

After a successful modeling follows a phase know as weave. Arachne consumes the generated dot files from the design phase to build a virtual Clos network using network containers, virtual Ethernet (veth) and Linux bridge. The host containers are connected via veth to a switch container that internally connects to the resident Linux bridge. To emulate a control network each container has management veth connected to a Linux bridge in a host.

Arachne currently SSHes to the emulation host and uses “ip netns” command to configure the network containers, i.e., virtual Clos nodes. During the discussion, using Ansible to push the network configuration to each container via the management interface was suggested as a path to evolve Arachne to a Clos network design and configuration tool. Future work includes extending the tool to support 7-stage Clos networks, IPv6 addressing and the network configuration, and to introduce some variant of chaos monkey to simulate random failure scenarios.

RTNL mutex, the network stack big kernel lock
Presented by Florian Westphal
Reported by Pablo Neira Ayuso

For those unfamiliar with rtnetlink, it is the interface that allows user space to configure and query Linux networking; these include config points such as IPv4/IPv6, network device management, tunneling, neighbour entries, routing, qdisc among many other things. This interface is always built-in by default if your Linux kernel comes with networking support. Florian Westphal started by explaining that the rtnl_lock mutex protects interactions between userspace and the rtnetlink interface. This mutex has existed since the beginning when netlink interface was initially added to the kernel in 2001.

The rtnl_lock mutex ensures that only one request is handled at a time from user space, this includes both netlink requests and dumps to obtain configuration listings. According to Florian, this mutex can be held for long periods of time depending on the size of the objects being dumped. This delay is also aggravated by calls to schedule user space and synchronize_rcu() in its path, preventing other userspace process to run netlink requests concurrently.

For all these reasons Florian set out to remove the dependency on rtnl_lock. Florian provided an overview of general problems he faced in his effort to remove this mutex, such as a lot of calls to netdev_ops indirections that would otherwise result in races, eg. with module removal. He paid close attention to two config subsystems in his talk to show an example analysis of the issues. These were the IP fib and devinet subsystems.

Florian's initial approach is allowing a subsystem to indicate it does not need the rtnl_mutex for a certain number of cases. He is now focusing his efforts on lockless dumps, that was already tried few years ago. However, a number of subsystems behind this mutex make assumptions so it is not so easy to get rid of the rtnl mutex in a straightforward way. Specifically, he refers to qdisc information, xdp information, sr-iov information and link stats.

Work is in progress, so hopefully, this will result in patches upstream soon.

Topic: Status, Open Issues and Extensions for switchdev SR-IOV mode
Presented by Or Gerlitz and Simon Horman
Reported by Andy Gospodarek

Currently there are two modes for NICs that operate in SR-IOV mode. Though the names are at times confusing ('Legacy Mode' and 'Switchdev Mode') things are working well and implementations of Switchdev Mode for SR-IOV-capable NICs are available from Mellanox (mlx5), Netronome (nfp), Broadcom (bnxt) and Cavium (liquidio). There are also patches that have been posted to netdev mailing-list proposing support for Switchdev mode on i40e drivers. Intel developers explained that some concerns about hardware behavior with Switchdev Mode on i40e hardware prevented a second version of their patches from being posted to netdev, but they felt confident these issues could be resolved. Or also made sure it was clear to all in attendance that when a device supports 'Switchdev Mode,' the only hardware datapath offload mechanism that is currently supported by these drivers is Flow offload via TC Flower.

Or then proposed that when operating in 'Switchdev Mode' that devices should also be capable of offload of traditional bridging and routing as well as Flow Offload. This would be possible if the drivers contained support for switchdev ops -- much like dsa and mlxsw drivers have support for switchdev ops. Or commented that Bridging/L2 offload would probably quite easy to implement at both the driver and firmware level, but that Routing/L3 offload may present more challenges. John Fastabend commented that it seems like this API would be easy to support at the kernel and driver level, but there may be significant work needed on the hardware side to implement this. Other vendors agreed with this sentiment. Or also proposed that new features would only be made available in the 'Switchdev Mode' and many agreed this was a reasonable option.

Simon presented ideas for naming convention for netdevices that are PFs, VFs, representers. Consensus was reached that we would rely on systemd/userspace to name these device, but consensus was not reached about whether or not there should be representer netdevs for PFs, VFs, and the MAC/Physical port. There was however an agreement that people would meet on Thurs afternoon during a break to discuss this topic in more detail. The main topic would be whether there would be a need to extend the Legacy and Switchdev mode to the PFs to allow vendors the flexibility to create network devices in the way that will be most beneficial to users while still providing a familiar user experience for those with hardware from multiple vendors.

TC Flower Offload
Presented by Simon Horman
Reported by Jamal Hadi Salim

Simon first described the tc flower classifier. When a packet ingresses in or egresses out it encounters the Traffic Control(tc) network subsystem. The tc subsystem has classifiers of different types which are used to match packets based on different criteria. The tc flower classifier can match packets based on many packet tuples (eg protocol, src or dst IP addresses etc). Upon a match, one or more actions are executed. Actions could act on a packet to account, modify, forward, drop etc. Simon highlighted the pedit, mirred, vlan actions in particular.

Simon showed a tc rule making use of the flower classifier. He then went on to describe how the same rule could be offloaded to the NIC by the user policy specification. The NFP hardware/driver (from Simon's employer Netronome) is capable of doing such offloads. The question that often comes up is: Why would you want to offload? The simple answer is offloading primarily frees off your cpu cycles to do other things. The amount of cpu could be substantial depending on your port speeds.

Simon then proceeded to give history of tc hardware offload. He credited John Fastabend for being the first to introduce the ndo_setup_tc() driver api back in 2.6.39. In addition to current offload Netronome implements, Simon talked about match enhancements ipv6 labels, MPLS, and GENEVE options.

And finally, he talked about conntrack integration into hardware offload. Sometimes it is useful from a security point to be more fine grained within an n-tuple; conntrack provides such a service by being able to quantify that based on certain flow states to perform an action on a packet (eg drop it); such information could then be offloaded for more efficient processing.

Question(Anjali): Can you support offset/length/mask/value kind of classification in your setup (similar to what the u32 classifier supports?). Answer: Currently such a feature is not supported by flower but nfp would be able to support it when the flower classifier does,

XDP for the Rest of Us
Chaired by: Andy Gospodarek and Jesper Dangaard Brouer
Reported by Quentin Monnet

As a sequel to their tutorial from the Netdev 2.1 conference in April 2017, Andy Gospodarek and Jesper Dangaard Brouer created a new workshop on XDP features and workflow. It focused on the latest XDP features and monitoring tools, and also came as an effort to clarify several points about XDP for the users.

Andy and Jesper started with a short reminder, explaining that XDP is a programmable low-level layer in the Linux kernel stack. It acts as a hook at the driver level for attaching eBPF programs, that will be run in the kernel in a safe way. These programs return an action code (drop packet, drop with error code, pass to stack, transmit back on same interface, transmit out of another interface), and they can read and write values in different types of maps, that are also accessible from userspace. This makes it possible to manipulate encapsulation headers, based on a set of rules, for example.

One recent change for XDP is the improvement of introspection features. eBPF programs and maps are now associated unique tags (programs only) and IDs which can be used to reference and manipulate them. Furthermore, several command-line utilities have been developed to help with eBPF and XDP debugging. bpftool, for example, permits to show the names, types, IDs and other metadata of eBPF programs and maps loaded on the system, to dump their contents, and to manipulate them to some extent. xdp_monitor, under samples/bpf, is another utility used for debugging XDP through tracepoints. Both tools have room for improvement, and could be ideal targets for contributors looking for an easy way to get involved.

Two other significant updates were presented. First, XDP recently acquired the possibility to manipulate packet metadata, in order to better cooperate with the network stack, or to provide meaningful information for eBPF programs attached to later hooks. The second change is the addition of the latest return code XDP_REDIRECT, which can be used in conjunction with specific eBPF maps to redirect packets and transmit them through another interface. Both metadata manipulation and packet redirection through XDP are already actively used by some companies.

At the end of the presentation, Andy and Jesper recalled that DDoS protection and load balancing are traditionally the use cases cited for XDP. But they also mentioned additional possibilities such as “fixing” kernel or hardware limitations by handling protocols unknown to the kernel, or offloading the routing stack with XDP.

The tone of the workshop seemed to indicate that eBPF and XDP are gradually spreading, and that an increasing number of users need some “demystification” about the technology. In that, it is in the same spirit as the BPF Questions & Answers recently added to the documentation of the kernel.

Multi-PCI socket network device
Presented by Achiad Shochat
Reported by Michio Honda

Achiad Shochat from Mellanox began with motivating the need for multiple PCIe buses support in a single netdevice: emerging 200Gbps NIC and desire to make better use of NUMA architecture. He spent time detailing how PCIe buses work and NUMA architecture so as to provide context for the talk. Modern NICs (at least the Mellanox ones) provide multiple host interfaces via PCIe.

Achiad explained challenges in multi-PCIe-bus NICs, showing problems with interaction between netdev initialization and PCIe probes, difficulty of a dynamic approach and adoption of a device driver approach. He showed the need for breaking the netdev per-port convention, and explained complexity with TX/RX queue affinity on PCIe bus.

Achiad then presented various use cases that benefit from multi-pci socket network devices. ARFS (Accelerated Receive Flow Steering) where with multi-PCIe one can eliminate DMA traffic on the QPI across NUMA nodes. Another use case is virtualization; he showed the need for VMM to assign VMs with VF taking into account PCIe bus - this work is in progress. Finally, he presented a congestion use case illustrating how modern networks are faster than the single PCIe capacity.

Achiad shared performance tests on a NUMA machine with Mellanox ConnextX-4 100G NIC with two PCIe sockets. He showed NUMA access to remote memory reduces throughput from 94 Gbps (when using Multi-PCIe) to 80 Gbps. He also showed latency improvements when not going remote for memory access: a netperf run of 150 TCP streams was averaging 250us latency while local (taking advantage of multi-PCIe) was in the 50-100us.

A lot of discussion happened after the talk, covering concerns related to TC and the dynamic approach for PCIe-netdev bindings.

Title: Extend TC to support conntrack (CT)
Presented by Rony Efraim
Reported by Jamal Hadi Salim

Rony started by describing how connection tracking operates to keep track of state transition within a flow.

Original user of conntrack was netfilter but recently ovs(action) started to use the conntrack code. Rony then illustrated how ovs uses connection tracking.

His motivation was to add support in tc for conntracking for the purpose of flow offloading. He was excited when he saw the tc connmark action but found it was insufficient because it doesn't keep track of flow state.

He proceeded to show how an outgoing packet will use the new tc action. And how a packet showing up in the kernel from the outside will use that same new tc action. At some point flows get offloaded and treated in the hardware. What doesn't get offloaded gets processed via the kernel (aka the slow path)

Justin Petit suggested that Rony take a look at ovs approach which takes care of both mark and state. Question 1(Jamal): Why could you not extend connmark to just add state tracking already? It already has all that information. Answer: The name is misleading - although a lot of the code in connmark is reusable.

Discussion on when to offload? Rony suggests that to not offload any flow that has not reached the established state. The question at what point during the established state should it be considered reasonable to offload the flow? Rony didn't think the current conntrack could distinguish between mice and elephant flows. Someone asked: How many flows do you offload? Rony claims his hardware can do millions.

The other discussion was in regards to how the software state would be refreshed so one could retrieve stats etc. Rony's response was the approach they are taking is polling the hardware at some defined interval in order to maintain the software timers etc.

General agreement from multiple vendors that attended was that this is a good idea and they would like to partake in the API definition.



to post comments


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds