A report from the networking miniconference

By Jonathan Corbet
August 27, 2014

Kernel Summit 2014

The second day of the 2014 Kernel Summit included a miniconference for networking subsystem developers. Your editor was unable to attend, but he did get to hear Dave Miller's rapid-fire summary of the topics discussed there. The following report has no hope of being complete — taking notes that quickly is difficult — but, with luck, it covers the most important points.

Dave started with a couple of quick topics, the first of which was the Stream Control Transmission Protocol (SCTP). In general, he said, the networking layer has a lot of highly abstracted code that is shared between protocol implementations. It has always been hard for SCTP to participate in that sharing, though, due to its concept of "associations." The result has been a lot of code duplication in the SCTP subsystem. Now, it seems, there is a new effort afoot to rework the SCTP implementation and unify the code (to a greater extent) with the rest of the networking subsystem.

One longstanding suboptimal area in the networking code has been the large hash tables allocated for protocols like TCP at boot time. These tables take a lot of memory; they do not necessarily have to be as big as they are, but there is no way to know what the proper size is when the system is coming up. Now, though, the networking layer has resizeable hash tables protected by the read-copy-update (RCU) mechanism. These tables can be reallocated as necessary, so there is no longer a need to keep large hash tables throughout the life of the system.

The extended Berkeley Packet Filter (eBPF) work, Dave noted, remains somewhat controversial. The biggest problem seems to be that eBPF developer Alexei Starovoitov has a great deal of energy and reviewers are having a hard time keeping up. So, Dave said, he is going to start pushing back a bit on these patches to get Alexei to slow things down.

There are concerns, Dave said, about the proposal to add the ability to dereference general pointers to eBPF. The possibility of adding backward branches to the eBPF virtual machine is also worrying to some. Nobody disagrees with Alexei's main goal: the creation of a generic virtual machine that is useful throughout the kernel. But it is important not to lose the protected execution environment that eBPF has always provided; it would not be good if eBPF were to become a source of security holes in the kernel. So there will need to be more restrictive rules about pointer access and a lot more checking, he said.

Ted Ts'o suggested that the SystemTap developers should have a look at eBPF, as it might make a good replacement for the specially-created kernel modules that are loaded now. But James Bottomley responded that SystemTap needs a thoroughly general execution engine — with wide-ranging access to the kernel — which is something that eBPF is explicitly not trying to be.

Dave then reported on Pablo Neira Ayuso's report on the Netfilter workshop recently held in France. There has been a lot of work put into the removal of the central lock in the connection-tracking code, making that code quite a bit more efficient. There is also, it seems, a determined effort under way to figure out what it will take to run interfaces at the full hardware speed when the traffic is made up of small packets — an area where the Linux network stack falls behind a bit.

There is interest in Intel's Data Plane Development Kit (DPDK), which is a mechanism that pushes packet handling out to user space. It produces good numbers on benchmarks, Dave said, but, in his opinion, there is always going to be some way to get similar performance with in-kernel code. He mentioned receive polling as an example: it gives the desired performance, but still keeps the full Linux network stack available.

Naturally, there was a discussion of nftables, the in-kernel virtual machine intended to eventually replace iptables. There has been a lot of work done on the iptables compatibility layer, a command-line interface that makes it possible for administrators to run their existing firewall scripts unchanged under nftables. But that does not mean that nftables will be replacing iptables anytime soon; the two are not compatible at the kernel interface level, so iptables will have to stay around for a long time. There was "a brawl" at the workshop about possibly replacing the nftables virtual machine with eBPF, but there is one major show-stopper in any such plan: nftables allows partial replacement of a firewall configuration, while eBPF, in its current form, would not.

From there, Dave shifted to encapsulation offloading. Whenever you start encapsulating packets and tunneling them through other transports, you have to worry about issues like where the checksumming happens and how flow distribution is managed. This will become a bigger issue, he said, because UDP encapsulation is going to become ubiquitous; just about every chip out there can checksum UDP packets, so support is easy. But steering the various flows is less so. The networking developers want to avoid the use of deep packet inspection for clean handling of encapsulated flows; to that end, they have come up with a trick using source port numbers to identify flows and steer them accordingly. Other tricks manage the checksumming at various layers; one, called "remote checksum offload," limits checksumming to the outer packet, with inner checksumming done at the receiver.

Of general interest to the network stack is a whole is the concept of send batching. The network driver interface is designed around sending a single packet at a time; there is no way for a driver to know if there are more packets coming immediately afterward — which there often are. If the driver knew more traffic was coming, it could defer starting the hardware, cutting transmit overhead significantly. The plan is to add a new "transmit flush" operation; if a driver provides that function, it will not start the hardware immediately on receipt of a packet to transmit. Instead, that "kick" will be deferred until the flush function is called. Some tweaking may be called for; deferring hardware startup could cause the wire to go idle, which is not desirable. But that seems to be a solvable problem.

Wireless networking

There was, Dave said, a "prisoner exchange," bringing in some developers from the wireless networking summit. Among the topics discussed was ARP proxying in access points to save power. Access points typically already know the MAC addresses of the systems they talk to; they should be able to answer ARP requests and avoid waking the destination system. It was agreed that this task should be handled in the bridging code, which already had related duties.

A bigger issue is network function offloading, where bridging chips can manage the forwarding database and take the processor out of the loop entirely. It is a nice feature, but there is one problem: it's all managed either via binary-only drivers or vendor-specific user-space code. OpenWRT, evidently, is "having fits" over these drivers. To try to address this problem, some work is being done to extend the netlink interface to cover some of these functions; then, hopefully, vendors can at least be convinced to work with the standard tools. There is a QEMU-based device being developed to test this code with.

Wireless maintainer John Linville got up briefly to discuss a few issues from the wireless summit. One problem the wireless developers are facing is that Android is still using the "wireless extensions" ABI, which has been deprecated for many years. It seems that it is easy to add vendor-specific operations to wireless extensions, so vendors are doing that. In response, the wireless developers have been adding some options to the current interface to give it some more flexibility. But that work has not immediately translated into vendors switching over. The current plan is to "talk to Google" and try to get it to encourage vendors to move away from the wireless extensions.

There has been some work to get a firmware dump tool in place. After some discussion, the developers came up with an option using sysfs to get the relevant data.

Finally, John let it be known that he is getting a little tired of being the wireless maintainer, but he has not yet been able to find a good candidate for a replacement. There are a lot of talented developers working on the wireless stack, he said, but most of those work for hardware vendors. It seems that these vendors are, as a general rule, unenthusiastic about having their developers working to support drivers for other vendors' hardware. So a new wireless maintainer almost certainly needs to work for a hardware-neutral organization — a distributor, for example. If there are any such people out there, John would like to hear from them.

This session covered a number of other topics. For example, Bluetooth maintainer Marcel Holtmann gave a high-speed update on that subsystem that was far beyond your editor's fingers' ability to follow. Suffice to say that the 3.17 kernel will include Bluetooth 4.1 support. The conclusion that results from all this, clearly, is that there is still a lot going on in the networking subsystem.

Index entries for this article
Kernel	BPF
Kernel	Networking/Networking summits
Conference	Kernel Summit/2014

A report from the networking miniconference

Posted Aug 28, 2014 20:47 UTC (Thu) by mtaht (subscriber, #11087) [Link] (4 responses)

I would really like to see a "bql-on-everything" effort - only 22 or so ethernet drivers have gained support for it so far, and the results from using it at a variety of bandwidths, from 1mbit to 10GigE, are really impressive for the 4-8 extra lines of code added to the network driver. Entire manufacturers like AMD, cisco, xilinx and more still lack BQL support, even though it's been in the kernel for 2+ years now.

BQL is really spectacular and needed at 10GigE+, but it's useful at all bandwidths.

Recently I added BQL to the beagle bone black, which had this result for reduced latency and increased throughput at 100Mbit.

Some plots:

http://snapon.lab.bufferbloat.net/~d/beagle_bql/bql_makes...
http://snapon.lab.bufferbloat.net/~d/beagle_bql/beaglebon...

(6 line patch not submitted to mainline yet)

The hard part about adding BQL is you really need to have the device in front of you and run tests as due to multiple possible error out conditions you can hose yourself various ways. So it would be best to mandate bql in new drivers, and to ask various maintainers to try adding bql to their drivers as time permits.

It would be good to have some work into improving usbnet as well...

and wifi is a terrible mess that needs much love in the upcoming mu-mimo world.

A report from the networking miniconference

Posted Aug 31, 2014 19:59 UTC (Sun) by flussence (guest, #85566) [Link] (3 responses)

> Entire manufacturers like AMD, cisco, xilinx and more still lack BQL support

That's news to me - is there a list anywhere (or straightforward way to generate one)? I was a bit underwhelmed with fq_codel after switching everything I could to it, maybe this is why.

A report from the networking miniconference

Posted Aug 31, 2014 20:07 UTC (Sun) by dlang (guest, #313) [Link] (1 responses)

It's important to realize that fq_codel is going to make no difference if there is not data in the transmit queue for it to manage, and it only affects packets that you send.

So you can enable it on every machine on your network, but if you don't enable it on your router that is the bottleneck of your connection to the Internet, it does absolutely no good.

Even if you do enable it on your router, if your ISP doesn't have it on their router, it's not going to help for traffic from the Internet to you.

This is why Cerowrt implements inbound rate limiting of connections (which is very CPU intensive), it's attempting to keep the traffic flow low enough that the ISP's router never becomes the bottleneck and so it never starts buffering traffic.

A report from the networking miniconference

Posted Sep 1, 2014 16:26 UTC (Mon) by mtaht (subscriber, #11087) [Link]

re: "no difference" and "absolutely no benefit" of fq_codel on servers.

This is not strictly true. The "fq" portion of fq_codel serves to break up microbursts when multiple flows are in play. So a busy web, file, or voip server can benefit. A machine hosting vms can benefit. Etc.

Most importantly fq_codel "does no harm" in the vast majority of cases, so I would rather like to see it or something derived from it to become the linux default, replacing pfifo_fast... and it made easy to choose something different.

Now this benefit keeps getting reduced (on servers) as new subsystems like TCP small queues come into play, at least on simplistic benchmarks. I'd like it a lot if TSQ could scale correctly against a large number of flows, but that's what the new sch_fq is for.

As david says, the biggest benefit from this style of queue discipline is anywhere you have a fast to slow transition, and/or multiple ports feeding into one on a switch or router where persistent queues can form. The bigger the difference in bandwidth, the more benefit can be had from both the "fq" and aqm portions of the algorithm. Pretty much all the aftermarket linux router firmware has adopted fq_codel at this point, but getting something like it into switches will take some doing....

A report from the networking miniconference

Posted Sep 1, 2014 16:39 UTC (Mon) by mtaht (subscriber, #11087) [Link]

BQL was a breakthrough in low level queue management. As of the last kernel we looked at, support for it was in the following multi-queued (and mostly 10GigE) drivers:

find drivers/net/ethernet -name '*.c' -exec fgrep -l
netdev_tx_completed_queue {} \;

drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
drivers/net/ethernet/intel/i40evf/i40e_txrx.c
drivers/net/ethernet/intel/i40e/i40e_txrx.c
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
drivers/net/ethernet/broadcom/bnx2.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/ethernet/sfc/tx.c
drivers/net/ethernet/mellanox/mlx4.save/en_tx.c
drivers/net/ethernet/mellanox/mlx4/en_tx.c
drivers/net/ethernet/freescale/gianfar.c

And in the following (most GigE and lower) drivers:

find drivers/net/ethernet -name '*.c' -print | xargs fgrep -l
netdev_completed_queue

drivers/net/ethernet/nvidia/forcedeth.c
drivers/net/ethernet/atheros/alx/main.c
drivers/net/ethernet/broadcom/b44.c
drivers/net/ethernet/broadcom/bgmac.c
drivers/net/ethernet/intel/e1000/e1000_main.c
drivers/net/ethernet/intel/e1000e/netdev.c
drivers/net/ethernet/realtek/8139cp.c
drivers/net/ethernet/marvell/sky2.c
drivers/net/ethernet/marvell/skge.c
drivers/net/ethernet/hisilicon/hix5hd2_gmac.c

I know there is currently out-of-tree support for the atheros ar71xx, and now beaglebone black (ti cpsw driver), but that's it. Vast swaths of drivers are presently left uncovered, including everything from allwinner, amd, cisco, octeon, freescale, nvidia, and brocade...

and a technology like powerline ethernet would probably benefit greatly too, as well as DSL and a few others. Wifi would greatly benefit from something BQL-like but pure BQL won't work there.

There is a tool for monitoring the BQL behavior:

https://github.com/ffainelli/bqlmon

And as I said, it's *really* easy to add to most ethernet drivers, 4-8 lines of code. It's getting hard to see the benefit under simplistic workloads, so I tend to drive tests with the rrul test from:

https://github.com/tohojo/netperf-wrapper

You can take an existing BQL-enabled driver and disable BQL to do a reasonable measurement of what before/after behavior looks like. (put a very large value in /sys/net/your_device/queues/tx-*/byte_queue_limits/limit)

New wireless maintainer

Posted Aug 31, 2014 20:46 UTC (Sun) by robbe (guest, #16131) [Link]

Which distributor cares that much about wireless? Canonical? Google?

John's replacement need not already work at that company. I guess one of these "talented developers" could, if suitably motivated, change employer and take up a more managerial role.

A report from the networking miniconference

Posted Sep 1, 2014 0:27 UTC (Mon) by dlang (guest, #313) [Link]

> deferring hardware startup could cause the wire to go idle, which is not desirable. But that seems to be a solvable problem.

The right thing to do is not to delay the hardware startup, but instead make the transmission of packets greedy, if there are multiple packets that can be sent at once (up to a byte size limit), send them all.

This avoids the whole area of problems of the media going idle, or of forgetting to call flush in some codepath.

It also adds less latency to the packets (especially if they turn out to be sparse)

If you already have fq_codel or something like it in place to categorize the packets, finding if you have more packets that you can combine should be a lot easier.