A report from the networking miniconference
Dave started with a couple of quick topics, the first of which was the Stream Control Transmission Protocol (SCTP). In general, he said, the networking layer has a lot of highly abstracted code that is shared between protocol implementations. It has always been hard for SCTP to participate in that sharing, though, due to its concept of "associations." The result has been a lot of code duplication in the SCTP subsystem. Now, it seems, there is a new effort afoot to rework the SCTP implementation and unify the code (to a greater extent) with the rest of the networking subsystem.
One longstanding suboptimal area in the networking code has been the large
hash tables allocated for protocols like TCP at boot time. These tables
take a lot of memory; they do not necessarily have to be as big as they
are, but there is no way to know what the proper size is when the system is
coming up. Now,
though, the networking layer has resizeable hash tables protected by the
read-copy-update (RCU) mechanism. These tables can be reallocated as
necessary, so there is no longer a need to keep large hash tables
throughout the life of the system.
The extended Berkeley Packet Filter (eBPF) work, Dave noted, remains somewhat controversial. The biggest problem seems to be that eBPF developer Alexei Starovoitov has a great deal of energy and reviewers are having a hard time keeping up. So, Dave said, he is going to start pushing back a bit on these patches to get Alexei to slow things down.
There are concerns, Dave said, about the proposal to add the ability to dereference general pointers to eBPF. The possibility of adding backward branches to the eBPF virtual machine is also worrying to some. Nobody disagrees with Alexei's main goal: the creation of a generic virtual machine that is useful throughout the kernel. But it is important not to lose the protected execution environment that eBPF has always provided; it would not be good if eBPF were to become a source of security holes in the kernel. So there will need to be more restrictive rules about pointer access and a lot more checking, he said.
Ted Ts'o suggested that the SystemTap developers should have a look at eBPF, as it might make a good replacement for the specially-created kernel modules that are loaded now. But James Bottomley responded that SystemTap needs a thoroughly general execution engine — with wide-ranging access to the kernel — which is something that eBPF is explicitly not trying to be.
Dave then reported on Pablo Neira Ayuso's report on the Netfilter workshop recently held in France. There has been a lot of work put into the removal of the central lock in the connection-tracking code, making that code quite a bit more efficient. There is also, it seems, a determined effort under way to figure out what it will take to run interfaces at the full hardware speed when the traffic is made up of small packets — an area where the Linux network stack falls behind a bit.
There is interest in Intel's Data Plane Development Kit (DPDK), which is a mechanism that pushes packet handling out to user space. It produces good numbers on benchmarks, Dave said, but, in his opinion, there is always going to be some way to get similar performance with in-kernel code. He mentioned receive polling as an example: it gives the desired performance, but still keeps the full Linux network stack available.
Naturally, there was a discussion of nftables, the in-kernel virtual machine intended to eventually replace iptables. There has been a lot of work done on the iptables compatibility layer, a command-line interface that makes it possible for administrators to run their existing firewall scripts unchanged under nftables. But that does not mean that nftables will be replacing iptables anytime soon; the two are not compatible at the kernel interface level, so iptables will have to stay around for a long time. There was "a brawl" at the workshop about possibly replacing the nftables virtual machine with eBPF, but there is one major show-stopper in any such plan: nftables allows partial replacement of a firewall configuration, while eBPF, in its current form, would not.
From there, Dave shifted to encapsulation offloading. Whenever you start encapsulating packets and tunneling them through other transports, you have to worry about issues like where the checksumming happens and how flow distribution is managed. This will become a bigger issue, he said, because UDP encapsulation is going to become ubiquitous; just about every chip out there can checksum UDP packets, so support is easy. But steering the various flows is less so. The networking developers want to avoid the use of deep packet inspection for clean handling of encapsulated flows; to that end, they have come up with a trick using source port numbers to identify flows and steer them accordingly. Other tricks manage the checksumming at various layers; one, called "remote checksum offload," limits checksumming to the outer packet, with inner checksumming done at the receiver.
Of general interest to the network stack is a whole is the concept of send batching. The network driver interface is designed around sending a single packet at a time; there is no way for a driver to know if there are more packets coming immediately afterward — which there often are. If the driver knew more traffic was coming, it could defer starting the hardware, cutting transmit overhead significantly. The plan is to add a new "transmit flush" operation; if a driver provides that function, it will not start the hardware immediately on receipt of a packet to transmit. Instead, that "kick" will be deferred until the flush function is called. Some tweaking may be called for; deferring hardware startup could cause the wire to go idle, which is not desirable. But that seems to be a solvable problem.
Wireless networking
There was, Dave said, a "prisoner exchange," bringing in some developers from the wireless networking summit. Among the topics discussed was ARP proxying in access points to save power. Access points typically already know the MAC addresses of the systems they talk to; they should be able to answer ARP requests and avoid waking the destination system. It was agreed that this task should be handled in the bridging code, which already had related duties.
A bigger issue is network function offloading, where bridging chips can manage the forwarding database and take the processor out of the loop entirely. It is a nice feature, but there is one problem: it's all managed either via binary-only drivers or vendor-specific user-space code. OpenWRT, evidently, is "having fits" over these drivers. To try to address this problem, some work is being done to extend the netlink interface to cover some of these functions; then, hopefully, vendors can at least be convinced to work with the standard tools. There is a QEMU-based device being developed to test this code with.
Wireless maintainer John Linville got up briefly to discuss a few issues
from the wireless summit. One problem the wireless developers are facing
is that Android is still using the "wireless extensions" ABI, which has
been deprecated for many years. It seems that it is easy to add
vendor-specific operations to wireless extensions, so vendors are doing
that. In response, the wireless developers have been adding some options
to the current interface to give it some more flexibility. But that work
has not
immediately translated into vendors switching over. The current plan is to
"talk to Google" and try to get it to encourage vendors to move away from
the wireless extensions.
There has been some work to get a firmware dump tool in place. After some discussion, the developers came up with an option using sysfs to get the relevant data.
Finally, John let it be known that he is getting a little tired of being the wireless maintainer, but he has not yet been able to find a good candidate for a replacement. There are a lot of talented developers working on the wireless stack, he said, but most of those work for hardware vendors. It seems that these vendors are, as a general rule, unenthusiastic about having their developers working to support drivers for other vendors' hardware. So a new wireless maintainer almost certainly needs to work for a hardware-neutral organization — a distributor, for example. If there are any such people out there, John would like to hear from them.
This session covered a number of other topics. For example, Bluetooth
maintainer Marcel Holtmann gave a high-speed update on that subsystem that
was far beyond your editor's fingers' ability to follow. Suffice to say
that the 3.17 kernel will include Bluetooth 4.1 support. The conclusion
that results from all this, clearly, is that there is still a lot going on
in the networking subsystem.
Index entries for this article | |
---|---|
Kernel | BPF |
Kernel | Networking/Networking summits |
Conference | Kernel Summit/2014 |
Posted Aug 28, 2014 20:47 UTC (Thu)
by mtaht (subscriber, #11087)
[Link] (4 responses)
BQL is really spectacular and needed at 10GigE+, but it's useful at all bandwidths.
Recently I added BQL to the beagle bone black, which had this result for reduced latency and increased throughput at 100Mbit.
Some plots:
http://snapon.lab.bufferbloat.net/~d/beagle_bql/bql_makes...
(6 line patch not submitted to mainline yet)
The hard part about adding BQL is you really need to have the device in front of you and run tests as due to multiple possible error out conditions you can hose yourself various ways. So it would be best to mandate bql in new drivers, and to ask various maintainers to try adding bql to their drivers as time permits.
It would be good to have some work into improving usbnet as well...
and wifi is a terrible mess that needs much love in the upcoming mu-mimo world.
Posted Aug 31, 2014 19:59 UTC (Sun)
by flussence (guest, #85566)
[Link] (3 responses)
That's news to me - is there a list anywhere (or straightforward way to generate one)? I was a bit underwhelmed with fq_codel after switching everything I could to it, maybe this is why.
Posted Aug 31, 2014 20:07 UTC (Sun)
by dlang (guest, #313)
[Link] (1 responses)
So you can enable it on every machine on your network, but if you don't enable it on your router that is the bottleneck of your connection to the Internet, it does absolutely no good.
Even if you do enable it on your router, if your ISP doesn't have it on their router, it's not going to help for traffic from the Internet to you.
This is why Cerowrt implements inbound rate limiting of connections (which is very CPU intensive), it's attempting to keep the traffic flow low enough that the ISP's router never becomes the bottleneck and so it never starts buffering traffic.
Posted Sep 1, 2014 16:26 UTC (Mon)
by mtaht (subscriber, #11087)
[Link]
This is not strictly true. The "fq" portion of fq_codel serves to break up microbursts when multiple flows are in play. So a busy web, file, or voip server can benefit. A machine hosting vms can benefit. Etc.
Most importantly fq_codel "does no harm" in the vast majority of cases, so I would rather like to see it or something derived from it to become the linux default, replacing pfifo_fast... and it made easy to choose something different.
Now this benefit keeps getting reduced (on servers) as new subsystems like TCP small queues come into play, at least on simplistic benchmarks. I'd like it a lot if TSQ could scale correctly against a large number of flows, but that's what the new sch_fq is for.
As david says, the biggest benefit from this style of queue discipline is anywhere you have a fast to slow transition, and/or multiple ports feeding into one on a switch or router where persistent queues can form. The bigger the difference in bandwidth, the more benefit can be had from both the "fq" and aqm portions of the algorithm. Pretty much all the aftermarket linux router firmware has adopted fq_codel at this point, but getting something like it into switches will take some doing....
Posted Sep 1, 2014 16:39 UTC (Mon)
by mtaht (subscriber, #11087)
[Link]
find drivers/net/ethernet -name '*.c' -exec fgrep -l
drivers/net/ethernet/intel/igb/igb_main.c
And in the following (most GigE and lower) drivers:
find drivers/net/ethernet -name '*.c' -print | xargs fgrep -l
drivers/net/ethernet/nvidia/forcedeth.c
I know there is currently out-of-tree support for the atheros ar71xx, and now beaglebone black (ti cpsw driver), but that's it. Vast swaths of drivers are presently left uncovered, including everything from allwinner, amd, cisco, octeon, freescale, nvidia, and brocade...
and a technology like powerline ethernet would probably benefit greatly too, as well as DSL and a few others. Wifi would greatly benefit from something BQL-like but pure BQL won't work there.
There is a tool for monitoring the BQL behavior:
https://github.com/ffainelli/bqlmon
And as I said, it's *really* easy to add to most ethernet drivers, 4-8 lines of code. It's getting hard to see the benefit under simplistic workloads, so I tend to drive tests with the rrul test from:
https://github.com/tohojo/netperf-wrapper
You can take an existing BQL-enabled driver and disable BQL to do a reasonable measurement of what before/after behavior looks like. (put a very large value in /sys/net/your_device/queues/tx-*/byte_queue_limits/limit)
Posted Aug 31, 2014 20:46 UTC (Sun)
by robbe (guest, #16131)
[Link]
John's replacement need not already work at that company. I guess one of these "talented developers" could, if suitably motivated, change employer and take up a more managerial role.
Posted Sep 1, 2014 0:27 UTC (Mon)
by dlang (guest, #313)
[Link]
The right thing to do is not to delay the hardware startup, but instead make the transmission of packets greedy, if there are multiple packets that can be sent at once (up to a byte size limit), send them all.
This avoids the whole area of problems of the media going idle, or of forgetting to call flush in some codepath.
It also adds less latency to the packets (especially if they turn out to be sparse)
If you already have fq_codel or something like it in place to categorize the packets, finding if you have more packets that you can combine should be a lot easier.
A report from the networking miniconference
http://snapon.lab.bufferbloat.net/~d/beagle_bql/beaglebon...
A report from the networking miniconference
A report from the networking miniconference
A report from the networking miniconference
A report from the networking miniconference
netdev_tx_completed_queue {} \;
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
drivers/net/ethernet/intel/i40evf/i40e_txrx.c
drivers/net/ethernet/intel/i40e/i40e_txrx.c
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
drivers/net/ethernet/broadcom/bnx2.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/ethernet/sfc/tx.c
drivers/net/ethernet/mellanox/mlx4.save/en_tx.c
drivers/net/ethernet/mellanox/mlx4/en_tx.c
drivers/net/ethernet/freescale/gianfar.c
netdev_completed_queue
drivers/net/ethernet/atheros/alx/main.c
drivers/net/ethernet/broadcom/b44.c
drivers/net/ethernet/broadcom/bgmac.c
drivers/net/ethernet/intel/e1000/e1000_main.c
drivers/net/ethernet/intel/e1000e/netdev.c
drivers/net/ethernet/realtek/8139cp.c
drivers/net/ethernet/marvell/sky2.c
drivers/net/ethernet/marvell/skge.c
drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
New wireless maintainer
A report from the networking miniconference