Netdev 2018 day 2

[Posted August 27, 2018 by ris]

Talk #1: Evolving from AFAP: Teaching NICs about time
Speaker: Van Jacobson
Report by: Boris Pismenny

Industry legend, Van Jacobson gave the keynote at the Netdev 0x12 conference. Van argues that we need to make some small changes to the contract between OSes and network devices. He started by providing historical context. Some of the design of networking protocols is based on legacy. Historically, computer vendors developed computer networks to:

Lock-in customers - their customers can only talk to that vendor’s stuff.
Sell more hardware (remote access terminals and mainframes)
Lock in customers, again.

As a result:

CRT / printer / card reader engineers drove the design - device artifacts were mixed with the protocol.
Communications links were slow (up to 56Kbps) - Huge cost for headers & implicit communication
Encapsulation expressed the mainframe to device path. Everything had to know the path details to build a packet.

IP/TCP was different. The major architecture emphasis is on simplicity, expressive abstraction, and implementable constructs. These layers are notable for what they don't say - there's nothing about intended application, protocol efficiency or network structure.

AFAP (as fast as possible) TCP's reliable delivery constraints how much data is sent, but not how fast it is sent. This leads to an AFAP output contract, usually implemented as device output queue(s). - How fast depends on the queue drain rate

AFAP makes bottleneck run at 100%. According to queuing theory, when the dequeue process runs at the bottleneck rate, and the enqueue rate also runs at the bottleneck speed. Then, the backlog builds up whenever the enqueue process receives packets slightly faster, because the dequeue process cannot go faster.

Delay/loss explosion isn't an issue if:

bandwidth delay product is small
there's a fat buffered router in front of every bottleneck
links from hosts to ToRs run slower than fabric

The first of these saved us until ~1995, then the second and third until ~2012. Since then the problem has been increasing. Dennard's scaling stopped. Usually, the switch's speed was faster than the host speed. CPU upgrades cannot solve network problems anymore. This had a big impact on the network. Google has been working to try and address some of these issues; Van mentioned several Google authored papers: - Hull, BwE, FQ/pacing, Timely, BBR, Carousel. All these papers tried to figure out how to find the bottleneck link downstream and prevent pressure in downstream buffers. BwE discussed how to fix things at the host to prevent queue buildup in switches. FQ/pacing was about desire to prevent many packets traveling to the same destinations in bursts.

Van argued that AFAP isn't working for us now because it's local to the host and our problems aren't local. We need a mechanism that allows for more control of packet spacing on the wire. To enforce relationships between all outgoing packets, the enforcement mechanism needs to be just in front of the NIC. Carousel is a great example of this.

In Carousel, an earliest departure timestamp is set on every SKB by the send routine. A timing wheel scheduler (qdisc) replaces the queue in front of (or in) the NIC. The timing wheel is responsible for sending traffic according to the timestamps. The timing wheel implementation could be lock-free using RCUs. A timing wheel has a horizon of packets that can be handled within a time window. This time window is a hard bound on the time a packet can be sent. The NIC gets access to all packets in the time windows within the horizon, enabling efficient NIC implementation.

The desired qdisc is one which optimizes the completion time, not fairness, because with fairness packets of two transactions are interleaved, while packets of each transaction could be batched and sent in round robin. This optimizes the completion by allowing one transaction to finish in half of the time. No existing qdisc can do this. But, the carousel/timing-wheel scheduler can accomplish this easily.

One way to implement Carousel is to co-exist with the existing qdisc layer, where timestamps are set to AFAP (now) for packets that are not aware of the Carousel mechanism.

There was a lot of discussion that ensued. Watch the videos when they come out. Most interesting comment was from Eric Dumazet who indicated he is going to be sending out patches for some of the ideas Van discussed.

Talk #2: Challenges migrating from NDIV to XDP
Speaker: Willy Tarreau, Emeric Brun
Report by: Evangelos Haleplidis.

Emeric Brun presented the challenges HAProxy faced when porting their PacketShield solution from NDIV to XDP. Emeric begun his talk by providing a quick overview on HAProxy and the Quick Packet Shield anti-DDos solution developed in NDIV.

The first realization attempt of Packet Shield was implemented using Netmap, but faced multiple issues such as multi-second outages, very slow communication with the stack, CPU wasted cycles for reformating descriptors and scaling issues with threads.

To solve these issues, the HAProxy team decided to do filtering fully in the kernel to avoid bouncing packets to user space. While packet delivery performance mattered, what was critical was blocking performance. For the second implementation the decision points were the following:

Seamless attachement to the NICs
Work directly on NIC buffers to avoid copying
Exploit all metadata the NIC+driver provided
Write responses to a different buffer than Rx to save a mempcy()
Avoid alloc/free functions on the fast path by using recyclable Tx descriptors
At the end of the Rx loop have a function to perform slow/locked operations
To maintain states and provide socket info, have a function on the transmit path with an SKB.

The second attempt was implemented by adding NDIV to existing drivers.

While NDIV results showed no measurable impact on TCP traffic, the reasons to shift to XDP were to avoid duplicated effort on code maintenance and improve XDP, upstream on drivers not having XDP and immediately benefit from new driver support.

Emeric continued with similarities and differences. For similarities, the Rx code is very similar and the Tx code rougly covers the same needs and requires minor adaptations for XDP. For differences XDP supports redirect action, XDP does not support and rx_done callback and XDP doesn’t care about outgoing packets.

The talk continued with a hybrid approach for both NDIV and XDP but were not satisfied with the approach and improved this attempt by identifying a number of shortcomings in XDP.

Emeric finished his presentation with initial implementation results and possible proposal items for extending xdp_buff to carry information extacted from the Rx descriptor and the info needed to build a response Tx descriptor as well as for NDIV still provide its own handle_rx, rx_done and handle_tx calls.

Talk #3: Toward an eBPF-based clone of iptables
Speaker: Fulvio Risso, Matteo Bertrone, Sebastiano Miano, Massimo Tumolo
Report by: Roopa Prabhu

Fulvio Risso introduced a possibile implementation of iptables using eBPF. The goal of this talk was to examine the feasibility of an approach of creating a clone of iptables using eBPF.

The authors faced four main challenges. The first is how to preserve the semantic of iptables rules, given the different location of netfilter and eBPF hook points. The second was the selection of a matching algorithm that can outperform current implementation (linear search), but that is feasible by exploiting one of the eBPF maps available in the kernel. The third was to support connection tracking, hence enabling ebpf-iptables to filter based on the state of the session. The final objective is to preserve iptables syntax, which turned out in the choice of having two executables, one for the normal iptables and one for ebfp-iptables (which currently supports a subset of the commands of the original software) and let user decide.

The implementation is based on cascading EBPF programs using tail calls. Linear bit vector search is used to eliminate iptables linear search. This has the advantages of having good performance, especially with large number of rules and is possible with vanilla ebpf today. The architecture included an XDP ingress hook that led to the ingress chain selector. The TC egress hook led to the egress chain selector. The architecture included a connection tracking table update that could update and store a session.

The performance evaluation showed that for a small number of rules, almost equivalent performance with iptables was achieved. It works great with large number of rules while rule update is not so fast. Also results show that latency is always better than iptables.

Future work items include integrating NAT functions to ebpf-iptables, the possible integration with bpf-filter and more work on connection tracking. Lessons learned from the implementation was the constraints given by the number and location of bpf hooks. Having only two hooks XDP for ingress and TC for egress required to implement everything from scratch. Having more ebpf hooks for every netfilter hooks would have been better and enable a more cooperative approach between ebpf and netfilter, instead of the current “replace all” option. For example it is hard to find a one size fits all algorithm for matching. Having more hooks would allow to simply change the matching algorithm, while keeping connection tracking and NAT in netfilter.

Talk #4: Building a Better NOS with Linux and switchdev
Speaker: David Ahern, Shrijeet Mukherjee
Report by: Evangelos Haleplidis

In this talk David Ahern discussed how switchdev and Linux make a better network operating system (NOS) with ASIC drivers in the kernel. The talk started with David describing previous NOS designs being silos dependent on vendors each one having independent blobs. Having Open Networking with Linux provides building blocks to enable innovation and provides flexibility and reconfigurability, instead of having a big blob of highly interdependent processes.

David continued with the evolution of the NOS. At the heart of this evolution is Linux. Linux is everywhere and the end goal is to move these functionalities to the kernel. Current commodities, like servers and load balancers use Linux. However switches today are still controlled by SDKs, usually at userspace. Thus early NOSes had to stay in userspace and be highly customized and interdependent, with no data in the kernel with Linux not having a networking role. NOSes were seen as a black box with a CLI for user access.

When legacy NOSes started, they were afraid of Linux and the licensing part and the kernel stack didn’t have what they needed. To use other software a user would have to recompile their software with the vendor’s SDK, which was a poor way of openness. Vendors then started using more Linux constructs by creating netdev ports for NOSes. But the question was how much networking data should the vendor put in the kernel, all of the routes, bridges, VLANs? This created an ad-hoc solution at best.

The next step was to have more networking functionality exposed via Linux constructs. The vendor SDK still remains and have to program through it, but will allow more control APIs in the kernel. The end goal is to program the ASICs through the vendor’s SDK which happens by listening to notifications from the kernel. That was the pre-switchdev phase.

Moving forward was to move the vendor’s ASIC driver from userspace to the kernel. That’s the point of switchdev. The kernel driver listens to commands and notifications and executes these commands on the ASIC. To make this functional, new APIs were needed and came to existence, such as the devlink API. David pointed out that there are still a few SDK hassles, like error handling, but overall this approach is much better.

David argued that the switchdev model is not working as promised as switchdev requires ASICs vendors to support its model. But ASIC vendors have no motivation to support it. So the question is, given that how do you prove switchdev is really the best approach? To transition to switchdev, the proposed solution is to provide a common layer for switchdev (clsw) that simplifies the overhead of writing a proper kernel driver that supports switchdev. In addition, clsw can be used with SDKs running in kernel mode to provide a transition path for ASIC vendors.

David asks the question that with SDK’s being proprietary blobs, do we want to put them into the kernel? David’s answer was that this approach will pave a way to have Linux everywhere and this is the way forward, by providing a way to show the ASIC vendors this is the best approach. That is the best end goal. Having the Linux kernel at the center of the open NOS, which holds networking configuration and state, will lead to simpler design, more consistent tools and methodologies and enable true openness.

David concludes with a proof of concept with reboot via kexec. The approach leads to a simpler architecture having a simpler initialization order, cutting time by 3x.

Talk #5: LSDN – Manage complex (virtual) networks in cloud environment with Linux kernel facilities
Speaker: Vojtech Aschenbrenner, Roman Kapl, Jan Matejek, Adam Vyskovsky
Report by: Evangelos Haleplidis

In this talk, Vojtech presents a tool for easy management of the configuration of data-center scenarios by leveraging TC. The presented tool is a C library with a clean API intended to be used with cloud orchestrators for managing the whole network of a cloud. The API allows the user to create a model of the desired network. The library then converts the model into set of TC rules, mostly flower classifiers, which are loaded through a netlink socket. This is a one-time action which does not leave any running processes. Also, the library allows the user to validate the network model prior to loading it into the kernel, detecting common configuration errors such as duplicated MAC addresses of network nodes.

LSDN is accompanied with a CLI tool which can read networks from configuration files. The intended use-case for this tool is for deployment without orchestrators and it is also for automated testing of the library.

LSDN supports several back-ends (VLAN, VXLAN, Geneve) for creating isolated virtual networks. In these networks virtual ports can be configured via simple API as well as stateless firewall (stateful firewall is planned) and traffic shaping. The corresponding part of virtual network on the particular physical machine can be represented by static or learning Linux kernel switch.

Vojtech concludes with making a demo of the approach writing a script and creating containers and invites people to try the tool. There are packages for Arch, RPM and DEB based distributions. As future plans, they want to stabilize the project, have stateful firewall and migrations without a deamon.

Github project location: https://github.com/asch/lsdn

Talk #6: Subscriber Billing System Over Commodity Hardware
Speaker: Vikram Siwach, Boemjun Kim, Jamal Hadi Salim, Roman Mashak, Craig Dillabaugh
Report by: Brenda Butler.

Vikram and Boemjun presented a telco billing system using the Linux TC architecture. The talk started with Vikram presenting the traditional approach, subscriber billing and policy enforcement, for telcos. Τhe current billing system is defined by 3GPP and is complex as there are a couple of different interconnected systems, namely the PCRF, the OCS and the OFCS.

The authors set out to replace the current billing system by simplifying the architecture. The concept is to set up a controller in the cloud and make the infrastructure programmable. The goals were to eliminate multiple data storage points, and for users to be able to implement individual customer billing policies and be enforced immediately.

The talk continues with how the datapath pipeline works. Vikram pointed out that they used TC to implement that functionality and count bytes. He then explained the system they built. The architecture included a network controller that created TC rules. The business applications acted as services and via a restfull call were then translated into policies. The SDN controller created the policies and got feedback from the datapath. Then they discussed the new QE module that does centralized Credit and Billing using TC.

They have tested their solution at scale with some impressive results. For large packets, they get almost line rate with 128,000 actions in play.

Part of their work resulted in improvements. One of the improvements was time filtering of statistics. They limitted the dumping of statistics to only those which changed within a time period. A second improvement was batching dumps of actions and hash table sizes. They improved the kernel logic to calculate the maximum number of actions that could be dumped into a batch instead of a fixed limit of 32.

An third improvement was NIC configuration. They enabled RSS on the NIC with 12 queues uses the mqprio qdisc filtering as needed bigger hash buckets to go faster. Chris Mi from Mellanox provided a patch which solved this.

The final improvement was kernel locks that causing bottlenecks. There was a single transmit lock in the prio qdisc (not mqprio) that caused poor performance when contended for by multiple cpus. For the TC action context and stats update lock they modified the TC VLAN action to use RCU instead of spin-lock.

The talk concluded with a demo showing that the policy enforcement kicked in at exactly the right moment (with 289 bytes of bandwidth left over, i.e., less than the next packet size) and then dropping packets as the limit was reached.

Talk #7: Composing and configuring HW assisted I/O virtualized network interfaces
Speaker: Sridhar Samudrala, Anjali Singh Jain
Report by: Roopa Prabhu

In this talk, Sridhar presented the goal of unifying management of different types of HW assisted I/O virtualized network interfaces (SR-IOV, VMDq and Scalable-IOV) in legacy and switchdev modes. The problem is that the linux networking stack and drivers do not provide sufficient hooks to configure, control and monitor the network interfaces assigned to a VM/container that are composed from NICs supporting hardware assisted I/O virtualization.

Sridhar started with a discussion on the properties and features of a standard NIC without virtualization. Then continued describing the different HW asisted virtualized models and their properties and features. The models discussed in this talk are SR-IOV, VMDq and scalable-IOV. In each model, Sridhar discusses how the host connects to the NIC and what properties cannot be controlled from the host.

Sridhar summarized the talk into a couple of points. The first was that the host admin does not have full control of the Network interfaces assigned to a VM/Container in an I/O Virtualized Environment. The second point is that Switchdev mode introduces a control plane via Port Representor netdevs that proivde a hook to configure and control guest network interfaces. The final point was that the PF driver decides on default parameters for resource assignments when composing a Virtual Network interface.

Sridhar then argued that there is a need for a system/network admin to configure per device default and allowed range of values for parameters supported by each type of virtualized interface as well as per-virtualized interface specific ranges of values for parameters configurable from VM/Container.

Finally Sridhar concluded by proposing to extend devlink on PCIe interface to Get and Set per-device hw default and allowed range of parameters for each type of virtualization model. For the per virtualized interface policy configuration, extend ndo_ops on port netdevs to get/set limits for parameters configurable from VM/container. He also suggested introducing a way to enable switchdev without the requirement to support slowpath.

Talk #8: Communication Groups in TIPC
Speaker: Jon Maloy
Report by: Roopa Prabhu

Jon Maloy started the talk by introducing the existing transport and addressing modes of TIPC. TIPC supports two messaging modes, connected oriented messaging, like TCP and datagram messaging like UDP. This talk is about datagram messaging. There are two address types, sockets for unicast messages and a service address to send anycast and multicast messages. Jon tries to address two main problems related to this.

The first problem is that neither Unicast, Anycast or Multicast have socket-to-socket flow control and despite a reliable node-to-node transport layer, messages may still be dropped due to receive buffer overflow.

The second problem is related to service tracking. There is no synchronization between the service tracking connection socket and client sockets. A new server may receive and respond to messages before the client has received the event about its availability. Also the client may receive messages from a server, after an even has declared it unavailable.

To solve these issues, Jon proposed redefining what was initially a multicast group to what he calls a "Communication Group", where all sorts of reliable messaging is possible. Jon suggested it would be possible to leverage the TIPC service tracking feature to let group members subscribe for join/leave events from other members.

In a communication group, each member socket has two addresses, a group-member tuple bound by the user and a port-node tuple bound by the system. The TIPC binding table is the registry and distribution channel for member identities. Groups are closed, member sockets can only exchange messages with other sockets in the same group.

Jon introduced the basic features. A user can create instantiation of its own groups and there is end-to-end flow control. Messages are never dropped because of buffer overflow. For point to multipoint a sliding window algorithm and for multipoint to point a coordinated sliding window. He then followed his presentation of how unicast, multicast and broadcast work.

Jon concluded his talk by arguing that this approach has several advantages over traditional models. It provides connectionless messaging within a group with flow control for loss-free datagram and multicast messaging. It has a simple programming model with a single socket for send, receive and membership events. It has memory and resource efficiency as a single socket occupies approximately 4 kbytes. It is bandwidth effiecient as it leverages L2 broadcast or UDP multicast and can scale to hundreds of members without chocking. Finally it leverages all other TIPC advantages, such as service addressing and immediate reception of membership events.

Talk #9: Suricata, an XDP adventure
Speaker: Eric Leblond
Report by: Aaron Conole

In this talk, Eric discusses the recent work both in the Suricata IDS project as well as work in the XDP infrastructure that Suricata uses. Eric introduces Suricata as an open source GPLv2 Intrusion Detection System which monitors the network for specific signatures and then takes action when those signatures are detected. Suricata supports a wide variety of deep inspection modules which includes HTTP (gzip'd or not), FTP, SMTP, TFTP, KRB, and others - even file data. It also validates TCP parameters.

An issue for Suricata is packet loss. With simulation, Eric found that 3% packet loss can result in missing 10% of the alerts that should be triggered. Things only get worse for higher packet loss rates. For file extraction it is even worse, as one single packet loss can cause a file to be incomplete.

Another problem is the elephant flows. A thread in Suricata can handle up to 500Mbytes/second. If a flow bandwidth is larger than that, packets will be lost. In the case of elephant flows, packets will be dropped because all packets have to be sent to the same thread. A solution could be to increase the ring size but it is just fixing burst issues.

The solution to both problems is to bypass packets from hitting the Suricata processing engine. Two different approaches to bypass were looked at, the local bypass and the capture bypass. In the local bypass, the bypass occurs at the worker threads. In the capture bypass, the bypass occurs at the kernel.

Two different approaches were followed, the AF_PACKET and XDP bypass. With the AF_PACKET capture method using eBPF socket filter approach, they still weren't happy with the performance achieved, and investigated XDP and libbpf which needed lots of changes in Suricata to support the complex design (multiple threads, lots of different parser support). The eBPF filter to XDP filter porting was an easy port, but the problem was that parsing could not be done in skb. The libbpf needed more work, as it was not ready and needed to be patched. They found one of the biggest advantages was an improvement by Jesper Brouer to add CPU redirection to keep packets grouped together.

Eric continues the presentation by presenting the XDP support in Suricata, with a number of features, such as eBPF load balancing, eBPF filter and eBPF bypass for AF_PACKET with socket filter. For XDP, features like, XDP bypass for AF_PACKET capture with XDP filter, XDP bypass in IPS mode and XDP cpu redirect.

Further areas to investigate are using TC for hardware offload, using AF_XDP when it lands, and improving libbpf to be more generic.

Eric concludes his talk by presenting that this work is integrated in Suricata master branch, a network card bypass for Netronome is coming and AF_XDP capture is now in Linux Vanilla opening interesting opportunity.

Talk #10: Smart Network Analytics with BPF using Skydive and CogNETive
Speakers: Liran Schour, Eran Raichstein, Sylvain Afchain
Report by: Evangelos Haleplidis

In this talk Nicolas Planel and Liran Schour presented Skydive and CogNETive smart analytics. Skydive is a real-time network topology and protocol analyzer. The problem is that SDN is complex, with topologies being dynamic and the existence of multiple opaque tunneling layers making it hard to analyze the network with existing tools. Skydive is an API-driven, with a graph engine, software that has distributed architecture of topology and flows/capture probes.

The Skydive architecture has a number of analyzers connecting to a database that contains flows and topologies. Skydive has multiple capture techniques, it can monitor all packets, using BPF monitor only check Syn/Fin/Rst for long lasting TCP flows, or use eBPF limited to enabled kernels.

One of the goals was to capture TCP flows, without cost to computation. BPF filter was used to capture the beginning and end of the TCP session. To solve wraparounds captured the sequence number of packets separated by a 64k windows. In a simple evaluation environment the cost of CPU utilization versus throughput with BPF was significantly smaller than AF packet and slightly smaller than eBPF.

The CogNETive service is a GUI based on Skydive project and is a visual communication debugger for cloud DevOps. CogNETive can analyze and perform anomaly detection. It can also do operational analytics. Skydive offer analytics helper (SocketInfo) quite similar to running “ss -tunp” on the whole infrastructure and relies on Kprobes and eBPF.

To conclude Skydive is a multi-layer network topology and flow exploration framework with multiple capture mechanisms depending on the use case. CogNETive is a network exploration and operation analytics.

As future work, they want to do hybrid capture of packets, by capturing first packets until Classified and continue to capture flow counters via eBPF. Also they want to retrieve Linux network stack counters per local process, thanks to eBPF/Kprobe.

Talk #11: DIM – Generic Dynamic Interrupt Moderation library
Speaker: Tal Gilboa
Report by: Roopa Prabhu

In this talk Tal Gilboa described the details of DIM’s flow, a Generic Dynamic Interrupt Moderation Library, DIM’s algorithm, usage and performance benefits. DIM lib is a net driver independent framework for dynamically tuning interrupt moderation. DIM lib exposes an API, which any net driver may use in order to optimize its throughput, packet rate, latency and interrupt rate.

The problem is to moderate interrupts. Interrupts utilize CPU and if interrupts are not efficient enough, there would be a CPU bottleneck. Tal wanted to break the 1:1 ratio between packets and interrupts, for high load scenarios, with little to no increase in latency. The focus was on interrupts for received packets where item in queues reside in a serial memory and having multiple function calls per packet is inefficient. The benefits accrued from DIM is less functions per call resulting in less CPU overhead and better utilization of data pre-fetch mechanisms.

Static configuration is good, but it does not work the same on all systems and doesn't optimize different traffic patterns. The suggested solution is Dynamic Interrupt Moderation. The optimal approach is for every existing traffic pattern to find a moderation setting. However in real life, there is the trade-off between metrics which forces us to make a choice and traffic patterns are rarely stable. This leads to an algorithm that prefers certain metrics over others. In DIM bandwidth is preferred over packet rate and packet rate preferred over interrupt rate.

Tal continued the talk by presenting the algorithm which samples packets, compares the current versus previous iteration and decide. DIM selects a profile either with timer reset or without timer reset. The algorithm compares samples and selects the best profile, of frames and timers by moving left or right on a profile line with indexes.

Tal presented the benefits of DIM, with the interrupt rate is significantly lower versus static moderation interrupt. For multi stream, DIM still performs better for packets up until 512B. For packets of 1024Bytes, it performs worse than static.

Tal concluded his talk by providing the current status of DIM, which is upstreamed in mlx5, net_dim library and support for adaptive TX. The future plans are to stabilize the profile selection, reduce overhead and have more debug abilities.

Netdev 2018 day 2

Posted Aug 28, 2018 20:35 UTC (Tue) by mtaht (subscriber, #11087) [Link] (16 responses)

"The desired qdisc is one which optimizes the completion time, not fairness, because with fairness packets of two transactions are interleaved, while packets of each transaction could be batched and sent in round robin. This optimizes the completion by allowing one transaction to finish in half of the time. "

This is debatable, although I get where he is coming from from a tcp perspective in the datacenter. If your 100Gige transaction is 10 packets in 6 nanoseconds, by ghu, burst up.

10 packets in a row for 13ms (10mbit), not so much.

The other major reason for my continued advocacy of "pure FQ" is for the wide range of traffic that has "no round trip completion" concept except "get me through fast with minimal delay and jitter" - voip, gaming, dns, videoconferencing, in particular.

Also, as multiple people have finally noticed ( https://github.com/systemd/systemd/issues/9725 ) , sch_fq is grossly unfair to anything other than locally sourced tcp traffic, with the large 15k initial quantum, and a quantum of 3028, to avoid del-acks. It should have been called sch_tcp_pacing or something like that. Real packet average sizes are 300 bytes.

Despite me saying that I do support DRR over SFQ and sometimes larger quantums - as it helps bunch up packets *a little bit* (easing the load on the host stack) and can be configured up or down as needed. In my pathetically slow portion of the universe we often run fq_codel with 300 byte quantums, and sch_cake autoconfigures and does gso-splitting by default, leading to much fairer results at a gigE and below.

"No existing qdisc can do this. But, the carousel/timing-wheel scheduler can accomplish this easily."

I like the timer wheel idea a lot regardless, and, with hardware support, seems needed and useful at speeds over 100mbit. If it works in software... oh, wow, we get to toss out 7 years of bufferbloat-fighting with tbf + fq_codel in favor of something new and shiny. I can't wait to see whatever eric comes up with. :)

... but I kind of figure I'll have to dig up some dusty research into virtualclock FQ systems do make it work where I want.

Netdev 2018 day 2

Posted Aug 29, 2018 6:24 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (15 responses)

There's also and issue pointing that fq_codel wreaks havoc on CAN networks (the stuff inside automobiles). Apparently dropping packets in such networks can be dangerous, who would expect that?
This is sadly an illustration that we can not have nice defaults in this world. Everything is broken :(

I, for one, still haven't found an answer what would be the best setting for my home server. It serves some webpages out to the internet, and various media inside my home network (bulk transfer), interacts with IoT (lowest latency required), hosts few virtual machines and ALSO is a router to the internet and and wifi access-point. CAKE looks like a thing that will finally behave correctly in such scenario.

(Also an interesting note: although I'm a bit of network guy, you post was very close to technobabble, and required increased mental power to process. I wonder if network issues are complicated by their nature).

Netdev 2018 day 2

Posted Aug 29, 2018 9:16 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

Why would you use codel with CAN?!?!?!?!

It's like adding ketchup to cotton candy.

Netdev 2018 day 2

Posted Aug 29, 2018 12:34 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (11 responses)

Because default_qdisc setting applies to all network interfaces.

Netdev 2018 day 2

Posted Aug 29, 2018 17:31 UTC (Wed) by mtaht (subscriber, #11087) [Link] (10 responses)

I don't know the right answer for CAN. Multiple qdiscs specify their actual base qdisc, or noqueue, or mq, and so on. I like to think that FQ_codel + ecn would help multiplexing a variety of useful signals over can deterministically.

But the idea of can bus folk thinking queues are infinite and non-loss ever scares the bejesus out of me. Plenty of other reasons (like running out of memory, or corruption due to a wild memory write, or a cosmic ray event) exist for a packet to be lost.

I won't sleep better if you can point me at what's going on over in that world, but please do so....

Netdev 2018 day 2

Posted Aug 29, 2018 18:21 UTC (Wed) by mtaht (subscriber, #11087) [Link] (7 responses)

meant to say "device driver".

my general hope is that the CAN folk takeaway is not that fq_codel is dangerous, but short buffer and packet loss is. Merely switching back to a fifo masks the problem. So I'd hope - after getting burned by fq_codel (what speed does can run at?) they'd apply rigor to their protocol designs to make them tolerant of loss in general. (and figure out how much steady-state buffering is needed) An example would be an accelerator sensor should always send absolute positions not deltas, at a high sample rate and the overall capacity of the bus left with sufficient headroom with all the other devices on it to never get even close to capacity.

Another takeaway from the systemd bug I cite above is that Linux's network behavior is still far from optimal on short paths. TSQ doesn't scale past a dozen flows. We hit a wall on decreasing cwnd in tcp that can possibly be met with better pacing. Or not pulling data out of the socket buffer until it's needed...

On short paths on that test, there was 760 times more buffering than that required to fill the gigE path, even with codel dropping madly. I ran the same test with a target 1ms, got it down to half that, same throughput same rtt (and quite a few RTOs)....

Netdev 2018 day 2

Posted Aug 29, 2018 18:44 UTC (Wed) by mtaht (subscriber, #11087) [Link] (2 responses)

so I went looking...

this can device is noqueue. I don't know what that means, device has a lot of internal buffering?

http://docs.automotivelinux.org/docs/apis_services/en/dev...

this on the other hand might be a problem

https://github.com/lxc/lxc/issues/2474

but this suggests it isn't though a correct fq_codel target setting would be well above 50ms in this case

https://github.com/GBert/openwrt-misc/tree/master/mcp2515...

Netdev 2018 day 2

Posted Aug 29, 2018 18:45 UTC (Wed) by mtaht (subscriber, #11087) [Link] (1 responses)

on the other other hand the mtu on a can bus is 16???

Netdev 2018 day 2

Posted Aug 29, 2018 19:39 UTC (Wed) by mtaht (subscriber, #11087) [Link]

I knew nothing about CAN before 3 hours ago. In particular I retract what I was saying about ecn (that was an IP specific bus). It looks like the FQ part in fq_codel, would "work" if there was a dissector for it. As the MTU is tiny the codelly bit would "work" also at the 5ms default - but (after reading up on it) both FQ and codel seem very dangerous to deploy in a CAN environment as anything other than a test of the robustness of the system. FQ (DRR) could deliver messages de-interleaved to various devices, and the HOL blocking in the media access force drops in codel.

Thanks for pointing this out. For the first time in years I'm led to conclude strict fifo is better in a given system. Not that I don't see CAN has problems, mind you, but they aren't fq_codels domain to solve.

I'll have to write this up somewhere, maybe discuss on the can list.

Netdev 2018 day 2

Posted Aug 29, 2018 20:03 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

CAN devices typically are not designed to tolerate dropped packets. It's also not easy to do complicated stuff when your whole payload is 8 bytes.

Typically people just over-provision CAN networks so there's almost no contention.

Netdev 2018 day 2

Posted Aug 29, 2018 20:29 UTC (Wed) by mtaht (subscriber, #11087) [Link] (2 responses)

I do my best work when frightened! I now think the best option for can devices is noqueue, and I can imagine quite a few that relied on pfifo_fast as the default suddenly breaking with any other qdisc.

Netdev 2018 day 2

Posted Aug 29, 2018 20:57 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Noqueue might be a bit too aggressive, you don't want to drop a packet if the bus happens to be busy at the moment of transmission, pfifo_fast sounds about right.

Netdev 2018 day 2

Posted Aug 29, 2018 21:28 UTC (Wed) by mtaht (subscriber, #11087) [Link]

Well... my assumption would be the device driver is buffering up one or two packets... based on that whopping 3 hours of experience and googling. :)

default pfifo_fast would be buffering up 1000, which strikes me as a bit much. I did see (I think) something that set txqueuelen to 10.

I'm (sigh) going to have to go look at every "packet" interface in linux now...

Netdev 2018 day 2

Posted Aug 30, 2018 7:02 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (1 responses)

It looks like you done enough research already, but for the matter of completeness:

The bug report about the default qdisc vs CAN: https://github.com/systemd/systemd/issues/9194

It has originated from this thread: https://marc.info/?l=linux-can&m=152785826628439&w=2

Basicaly, it is impossible to have a good default, fighting bufferbloat and fitting all Linux usages at the same time.

Netdev 2018 day 2

Posted Aug 30, 2018 15:08 UTC (Thu) by mtaht (subscriber, #11087) [Link]

Nothing is impossible! But sometimes takes a long time!

https://bugs.openwrt.org/index.php?do=details&task_id...

Netdev 2018 day 2

Posted Aug 29, 2018 17:21 UTC (Wed) by mtaht (subscriber, #11087) [Link]

fq_codel may make more noticible packet loss in a can network (cite?), but loss happens in any network system (fifo or fq), and it scares me that more folk are not aware of it, but think buffers are infinite. If your car crashes one turn in 100 with fq_codel 5ms delay, and one in 10000 with a 1000 packet fifo... it's still gonna crash.

One use for ecn is certainly in networks that cannot tolerate loss, I know of one application in a flying spacecraft, (thankfully mostly merely multiplexing sensors across the bus which has a predictable outcome) and I would certainly approve of ecn + *rigorous* congestion awareness in every app in a can (or nmea 20000 network. Certainly I'd support setting an appropriate target delay to supply adaquate buffering.

I'm sorry for the technobabble, I've been at this gig for too long. I did a lot of introductory talks in the beginning....

Netdev 2018 day 2

Posted Aug 29, 2018 17:23 UTC (Wed) by mtaht (subscriber, #11087) [Link]

It's kind of my hope that the per host fq in your scenario helps.

It's still turtles all the way down.