Batch processing of network packets
Once upon a time, network interfaces would interrupt the processor every time a packet was received. That may have worked well with the kind of network interfaces we had in the 1990s, but an interface that worked that way now would be generating many thousands of interrupts per second. That, in turn, would swamp the CPU and prevent any work from getting done. The response to this problem in network circles was the adoption of an API called "NAPI" (for "new API") during the long 2.5 development series.
Old-timers on the net — like your editor — used to have their computers beep at them every time an email arrived. Most of us stopped doing that long ago; the beeps were nonstop, and things reached a point where we simply knew there would be email waiting anytime we got over our dread and opened a mail client. NAPI follows a similar approach: rather than poke the processor when packets arrive, the interface just lets them accumulate. The kernel will then poll the interface occasionally, secure in the knowledge that there will always be packets waiting to be processed. Those packets are then processed in a batch, with the batch size limited by the "weight" assigned to the interface.
At this level, we can see that batching of packet processing was added some fifteen years ago. But that is where the batching stops; when the NAPI poll happens, the device driver will pass each packet into the network stack with a call to netif_receive_skb(). From that point on, each packet is handled independently, with no further batching. In retrospect, with all of the effort that has gone into streamlining packet processing, one might wonder why that old API was never revisited, but that is how things often go in the real world.
Eventually, though, somebody usually notices an issue like that; in this case, that somebody was Edward Cree, who put together a patch set changing how low-level packet reception works. The first step was to supplement netif_receive_skb() with a batched version that reads, in its entirety:
void netif_receive_skb_list(struct list_head *head) { struct sk_buff *skb, *next; list_for_each_entry_safe(skb, next, head, list) netif_receive_skb(skb); }
Now, rather than calling netif_receive_skb() for every incoming packet, a driver can make a list out of a batch of packets and pass them upward with a single call. Not much has changed at this point, but even this tweak improves performance by quite a bit, as it turns out.
The rest of the patch series is occupied with pushing the batching further up the network stack, so that packets can be passed in lists as far as possible. That gets a little trickier at the higher levels, since some packets have to be handled in fundamentally different ways. For example, some may have been allocated from the system's memory reserves (part of a mechanism to avoid deadlocks on network block devices); those require special handling. When such situations are encountered, the list of packets must be split into smaller lists, but the batching is preserved as far as possible.
The benchmark results (included in this merge commit) are interesting. In one test case, using a single receive queue, a kernel with these patches (and a suitably patched driver) showed a 4% improvement in packet-processing speed. That would certainly justify the addition of this bit of infrastructure, but it turns out that this number is the worst case that Cree could find. In general, just adding and using netif_receive_skb_list() improves performance by 10%, and the performance improvement with the entire patch series centers around 25%. One test showed a 35% speed improvement. In an era where developers have sweated mightily for much smaller gains, this is an impressive performance improvement.
One might well wonder why even the simplest batching shown above can improve things by so much. It mostly comes down to cache behavior. As Cree notes in the patch introduction, the processor's instruction cache is not large enough to hold the entire packet-processing pipeline. A device driver will warm the cache with its own code, but then the processing of a single packet pushes that code out of cache, and the driver must start cold with the next one. Just eliminating that bit of cache contention by putting the packets into a list before handing them to the network stack thus improves things considerably; creating the same sort of cache efficiency through the network stack improves things even more.
Networking also uses a lot of indirect function calls. These calls were never cheap, but the addition of retpolines for Spectre mitigation has made things worse. Batching replaces a bunch of per-packet indirect calls with single per-list calls, reducing that overhead.
There is a problem that often comes with throughput-oriented optimizations, and which can often be seen with batching: an increase in latencies. In the networking case, though, that cost was already paid years ago when NAPI was added. The new batching works on bunches of packets that have already been accumulated at the NAPI poll time and doesn't really add any further delays. So it's an almost free improvement from that point of view.
This code has been merged for the 4.19 kernel, so it will be generally
available when the release happens. As of this writing, only the
Solarflare network interfaces use the new netif_receive_skb_list()
API. The necessary changes at the driver level are quite small, though, so
it would be surprising if other drivers were not updated in the relatively
near future, possibly even before the 4.19 release. This particular fruit
is hanging too low to go unpicked for long.
Index entries for this article | |
---|---|
Kernel | Networking/Performance |
Posted Aug 21, 2018 13:17 UTC (Tue)
by ernstp (guest, #13694)
[Link]
drm/ttm,amdgpu: Introduce LRU bulk move functionality
https://patchwork.freedesktop.org/series/47871/
Posted Aug 21, 2018 15:42 UTC (Tue)
by ecree (guest, #95790)
[Link] (8 responses)
Posted Aug 22, 2018 2:43 UTC (Wed)
by dgc (subscriber, #6611)
[Link] (3 responses)
-Dave.
Posted Aug 22, 2018 22:02 UTC (Wed)
by ejr (subscriber, #51652)
[Link]
Posted Aug 27, 2018 19:20 UTC (Mon)
by roblucid (guest, #48964)
[Link]
Posted Aug 27, 2018 19:21 UTC (Mon)
by roblucid (guest, #48964)
[Link]
Posted Aug 22, 2018 3:57 UTC (Wed)
by mtaht (subscriber, #11087)
[Link] (3 responses)
However I've seen a lot of code that does stupid things to software batch gro stuff, with things that actually overwhelm the cache. I'm curious, with this new code,
Skylake is one thing. But a typical cache size on small boxes is 32k/32k for these.
Posted Aug 22, 2018 3:59 UTC (Wed)
by mtaht (subscriber, #11087)
[Link] (2 responses)
Posted Aug 23, 2018 16:15 UTC (Thu)
by ecree (guest, #95790)
[Link] (1 responses)
Posted Aug 24, 2018 7:26 UTC (Fri)
by mtaht (subscriber, #11087)
[Link]
In my ideal world, packets inside the kernel and from future devices, would have something the like the following format:
|tx_timestamp|rx_timestamp|packet|some|different|hashes|skb control block for all kinds of other stuff|
The rx_timestamp is free if you have it in hw, the rx/tx_timestamp would make all the codel-y work faster on tx, and also enable the timer queues VJ is talking about ( https://netdevconf.org/0x12/session.html?evolving-from-af... ) see also sch_etx and the igb network hw, and... selfishly - on top of "timestamp always", 3 hashes would make sch_cake fast enough for general use. timestamps are at the front because they are "free" though they could live at the back, hashes at the back because you need time to calc them and on a cut through switch you can't wait for them. The existing cb has some other fields I'd do always too and make persistent through the stack.
I'm so totally not going to make more of this "modest proposal", as much as I think the whole skb layout could use a major revision, as it would take a forklift, time, and taste to redo in linux, dpdk, rgmii, etc. It would make it really difficult to backport code from one format to the other. It would take years to implement in linux (years more in bsd, osx, windows), for a benefit that would mostly be for 100gige+ interfaces originally. A metric ton of people would have their favorite fixed format field they'd want somewhere, thus politics and gnashing of standards bodies teeth would happen... specialized hw offload engines break...
Still, if I could have 1 sysctl to immuttably rx timestamp packets on all interfaces always it would be great. Packets being managed by codel could measure the entire system from ingress to egress and *drop them*, shedding load automatically as a system as a whole (rather than at the qdisc) got more stress than it should take, figuring out the cost of each substep through the layers would be something you could do on a per packet basis on your workload, rather than by blasting specialized test tools through it that *don't model real traffic* and claiming that pps really meant something...
and there's be ponies! and speckled dancing unicorns! harder RTT deadlines! and systems that didn't collapse under load! and poppies, poppies, poppies everywhere to roll around in to help you sleep even better!
It's after midnight. I usually don't post anything after midnight.
...
I am happy that ebpf is making it easier to offload more stuff into smarter hardware. and I look forward to trying this new skb list idea out next quarter on some very old, slow, tiny hardware.
Posted Aug 21, 2018 16:31 UTC (Tue)
by eru (subscriber, #2753)
[Link] (10 responses)
That is an interesting point about caches that may be applicable in other situations as well. So far I have imagined it is pretty much neutral whether you do
or
except these solution may need extra storage for the intermediate results,
but this article suggests the second solution may be better despite that.
Posted Aug 21, 2018 17:19 UTC (Tue)
by HIGHGuY (subscriber, #62277)
[Link] (1 responses)
Posted Aug 26, 2018 0:14 UTC (Sun)
by BenHutchings (subscriber, #37955)
[Link]
Posted Aug 21, 2018 18:57 UTC (Tue)
by mm7323 (subscriber, #87386)
[Link] (2 responses)
I must admit, that while it makes sense to me having read the article, I doubt I'd have spotted that optimisation in a long long time! Open Source is cool!
Posted Aug 22, 2018 15:55 UTC (Wed)
by excors (subscriber, #95769)
[Link] (1 responses)
Even outside of proper DSPs, I think this is a fairly natural pattern when using SIMD CPUs or GPUs. You want to process multiple data items concurrently, so you think about your data as a single array (or maybe a structure-of-arrays) rather than as a large collection of independent OOP-style objects, and you write your processing code as array transformations. But the processing code has limited resources (registers, constants, instruction cache, local memory in GPUs, etc), and the compiler will either complain or go dreadfully slow (which should be obvious in a profiler) if you exceed the limits; or maybe some of the processing works best on vectors of 8 elements while some only works on vectors of 4, etc; or maybe it's just hard to understand the whole algorithm at once; so it's natural to split the processing into multiple simpler passes that can be optimised individually. You might run each pass over the entire data set in sequence, or (if memory bandwidth is a concern) you might split the data into chunks that fit within the data cache then run all passes over one chunk before moving onto the next chunk. (This seems kind of like the data-oriented design thing that some game developers have been interested in, as a reaction against the cache-inefficiency of common OOP designs. But I imagine it's much easier to recognise and apply in a game engine that you're writing from scratch, and that you know is going to have to process huge amounts of data, than in a large existing OOP-ish codebase like Linux that was designed with very difference performance requirements.)
Posted Aug 22, 2018 20:32 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link]
One thing that springs to mind, in an object oriented context, is that inheritance really breaks (instruction) cache efficiency. You may have a collection of objects and wish to call the same method on each one, but due to inheritance each object can take very different code paths and foul the instruction caches.
The data orientated approach the link describes mitigates this. But sorting collections by object type (before any other comparison criteria) may also yield better performance in this vain too I guess.
Posted Aug 21, 2018 19:54 UTC (Tue)
by epa (subscriber, #39769)
[Link]
Posted Aug 22, 2018 13:06 UTC (Wed)
by ncm (guest, #165)
[Link] (3 responses)
Cache awareness is right at the heart of our deal we have made with the devil for performance. There are more different caches in modern chips than you would ever imagine, many that are undocumented or barely even mentioned.
The ones we know of include your regular data caches, along with instruction caches, microinstruction caches, conditional-branch target caches, and page map caches. Many other systems that act in the role of caches include the familiar pipelines, and also register renaming, speculative execution, and memory pre-fetching.
The soul that we have handed over to the devil in exchange is made of data security (spectre etc.) and our ability to predict the consequences of design choices. Crack-brained newbie design choices (I'm looking at you, function pointers!) get a billion transistors thrown at them, which works just well enough to prevent learning better.
Too-clever compilers do their part. Often, a really dumb algorithm (e.g. counting bits) is recognized and replaced with special instructions not normally accessible, and faster than a smart algorithm the compiler cannot recognize.
I have seen 2x performance differences because the compiler decided the unlikely branch in a loop was the more likely, or because it was optimizing for older chips that like random other instructions mixed into sequences, where newer chips like to fuse a comparison that has a directly-adjacent branch into a single microinstruction -- but only in a small-enough loop.
Posted Aug 22, 2018 14:11 UTC (Wed)
by cagrazia (guest, #124754)
[Link] (2 responses)
Posted Aug 22, 2018 17:32 UTC (Wed)
by excors (subscriber, #95769)
[Link] (1 responses)
> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.
Isn't that pretty much exactly what a GPU is? They basically do everything with 8/16/32-wide vectors (though presented as scalars to programmers), with many thousands of threads, limited pipelining within a thread, no out-of-order execution, and the memory model is typically that you get absolutely no cache coherence (except when using special instructions that simply bypass the L1 caches, like atomics or Vulkan's Coherent variables).
GPUs have been widely available for a long time and are much faster than CPUs in terms of FLOPS and memory bandwidth, but even most new software (where legacy compatibility isn't that important) is still written almost entirely for CPUs, so that kind of hardware design is evidently not the solution. (It's still a commercial success, though. And incidentally most GPU programming uses dialects of C, so C isn't holding them back.)
As for the CPU features that are relevant to Spectre - speculative execution and caches - if their invisibility from C makes C not a low-level language, then x86 machine code is not a low-level language either, and it seems a bit unfair to blame C for the limitations of machine code.
I think most programmers who care about low-level details can tolerate the complex transformations that a C compiler performs, because they can always check the generated assembly code to see what really happened, and can easily influence the compiler (with intrinsics, attributes, compiler flags, inline assembly, compiler plugins, etc), so it's not really all that mysterious. But the transformation between machine code and what actually happens in the CPU is much more opaque and poorly documented and hard for software people to influence, which I think is why Spectre was such a hard problem to discover and to solve.
Hmm, that kind of implies that moving most of the magic out of the CPU hardware and into the C compiler would help a lot. But that sounds like the Itanium approach, and that didn't work so well either.
Posted Aug 22, 2018 23:24 UTC (Wed)
by ncm (guest, #165)
[Link]
We have got some more mileage out of C++ (which is quite a lot faster than C, for common problems), compiler intrinsics, and shaders, but we may find that a powerful functional language is needed to manage the complexity of what modern hardware can be. (If only the functional-language literati could break their nasty garbage-collection habit!) The biggest opportunities for using what hardware can be made to do depend on values not having or needing addresses, for longer periods. There are blessed moments in C++ when values have no addresses, and we can get a lot done in those moments.
Much of what the hundreds of billions of transistors on CPU chips nowadays do is just keeping this computation from getting in the way of that computation. Remarkably little of them are actually doing the computations themselves. By eliminating the von Neumann hierarchical, addressible memory model, those transistors could be put to work properly -- if only we knew a better way to program them.
"Say what you will about sequential instruction machines, they don't often have a problem deciding what to do in the next cycle."
Posted Aug 22, 2018 4:05 UTC (Wed)
by ncm (guest, #165)
[Link] (7 responses)
Posted Aug 22, 2018 14:23 UTC (Wed)
by cagrazia (guest, #124754)
[Link] (6 responses)
I suppose in these years the problem to pass or to retrieve packets from/to the NIC directly to applications has been solved with a mix between the Rizzo's (author of Netmap) and the Linux network developers' design. This work instead focuses on the internal of the network stack, from the driver to the upper layers (without reaching the application).
Posted Aug 22, 2018 16:18 UTC (Wed)
by dps (guest, #5725)
[Link] (1 responses)
I suspect there is a lot of hardware off load too but in the final analysis somebody else got that job :-(
It might be worth knowing that solarflare et al target customers that want to be close to stock exchanges because the speed of light is finite.
Posted Aug 26, 2018 0:44 UTC (Sun)
by BenHutchings (subscriber, #37955)
[Link]
It certainly used to be that the major performance win of user-level networking was not so much the avoidance of kernel/user context switches, but improving temporal locality of access to packets. With kernel networking, the kernel has to demux incoming packets into socket queues and account for the memory allocated to each socket, as the packets come in. So the CPU will access packet headers during demux and then again some time later when the application receives the packets. With user-level networking, the hardware does demux into (typically) per-process queues and the CPU will access packet headers only when the application receives the packet. (The packet buffers are naturally accounted to the process.) Still, the cost of context switches has been increased substantially by mitigations for speculation leaks. So that may be a bigger part of the performance advantage now. I worked on Solarflare drivers up to the SFC9100 generation, and there was no TCP offload or anything really unusual there. The essential features are checksum offload, flow steering/filtering, and lots of queues.
Posted Aug 22, 2018 17:13 UTC (Wed)
by shemminger (subscriber, #5739)
[Link]
Posted Aug 22, 2018 22:51 UTC (Wed)
by ncm (guest, #165)
[Link] (2 responses)
Some of these uses actually need to act on the contents of the packet in real time, and those users build custom ASICs that deliver incompletely-arrived packets for speculative processing. But most uses just need to timestamp, filter, and direct packets to various places for parallel downstream processing. Those are the uses that can benefit from kernel and driver improvements, and they get everything they need from netmap, DPDK, ef_vi and the like. DPDK is a mess, and ef_vi is proprietary. Netmap offers a plausible route to break out of lock-in, and actually use commodity hardware in many of what are now thought of as high-end, niche applications that need specialized equipment.
Posted Aug 23, 2018 15:32 UTC (Thu)
by willemb (subscriber, #73364)
[Link] (1 responses)
https://www.kernel.org/doc/html/latest/networking/af_xdp....
Posted Aug 24, 2018 1:52 UTC (Fri)
by ncm (guest, #165)
[Link]
Posted Aug 23, 2018 9:58 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Posted Aug 24, 2018 12:42 UTC (Fri)
by fillods (guest, #22226)
[Link]
This reminds me an insightful article some years ago : "Improving Linux networking performance" https://lwn.net/Articles/629155/
Posted Aug 24, 2018 13:35 UTC (Fri)
by zoobab (guest, #9945)
[Link] (1 responses)
"* The graph in the test results shows that ZeroMQ is slower than TCP/IP. What's the point then?
Obviously, you would expect system working on top of TCP to have higher latencies than TCP. Anything else would be - simply speaking - supernatural. However, throughput is a different matter. ZeroMQ gets you more throughput than TCP has using intelligent batching algorithms. Moreover ZeroMQ delivers value-add over the TCP. Asynchronicity, message queueing, routing based on business logic, multicast etc.
* How come ZeroMQ has higher throughput than TCP although it's built on top of TCP?
Avoiding redundant networking stack traversals can improve throughput significantly. In other words, sending two messages down the networking stack in one go takes much less time then sending each of them separately.This technique is known as message batching.
* When sending messages in batches you have to wait for the last one to send the whole batch. This would make the latency of the first message in the batch much worse, wouldn't it?
ZeroMQ batches messages in opportunistic manner. Rather than waiting for a predefined number of messages and/or predefined time interval, it sends all the messages available at the moment in one go. Imagine the network interface card is busy sending data. Once it is ready to send more data it asks ZeroMQ for new messages. ZeroMQ sends all the messages available at the moment. Does it harm the latency of the first message in the batch? No. The message won't be sent earlier anyway because the networking card was busy. On the contrary, latency of subsequent messages will be improved because sending single batch to the card is faster then sending lot of small messages. On the other hand, if network card isn't busy, the message is sent straight away without waiting for following messages. Thus it'll have the best possible latency." Posted Sep 9, 2018 14:38 UTC (Sun)
by ssl (guest, #98177)
[Link]
Batch processing of network packets
Credit where it's due
He suggested it in http://lists.openwall.net/netdev/2016/01/15/51 and the following thread seems also to contain some early prefiguring of XDP.
Credit where it's due
Credit where it's due
Credit where it's due
Credit where it's due
Credit where it's due
if folk have experimented with decreasing the (oft rather large) NAPI value in the first place, and to what extent it was tested on devices with small i/dcaches.
Credit where it's due
Credit where it's due
Credit where it's due
Batch processing of network packets
for each item in input
do_b(do_a(item))
for each item in input
push(tmplist, do_a(item))
for each item in tmplist
do_b(item)
Batch processing of network packets
In this case the data was most likely transferred into system memory via DMA and it is not or no longer in cache. In parallel the handling of each packet only touches the first tens of bytes for the headers.
That makes the instruction cache cost higher than the data cache cost. If the data cache cost were higher, the gains might have been more modest or maybe even counterproductive.
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
At least some applications were saving 100ns is worth money sometimes bypass the Linux kernel network stack. I believe that solarflare has a LD_PRELOAD library that moves some of tcp to user space, thereby avoiding context switching.
I suspect there is a lot of hardware off load too but in the final analysis somebody else got that job :-(
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets
Batch processing of network packets - locking
Batching was already mentioned at that time.
zeromq batching
Batch processing of network packets