Batch processing of network packets

By Jonathan Corbet
August 21, 2018

It has been understood for years that kernel performance can be improved by doing things in batches. Whether the task is freeing memory pages, initializing data structures, or performing I/O, things go faster if the work is done on many objects at once; many kernel subsystems have been reworked to take advantage of the efficiency of batching. It turns out, though, that there was a piece of relatively low-hanging fruit at the core of the kernel's network stack. The 4.19 kernel will feature some work increasing the batching of packet processing, resulting in some impressive performance improvements.

Once upon a time, network interfaces would interrupt the processor every time a packet was received. That may have worked well with the kind of network interfaces we had in the 1990s, but an interface that worked that way now would be generating many thousands of interrupts per second. That, in turn, would swamp the CPU and prevent any work from getting done. The response to this problem in network circles was the adoption of an API called "NAPI" (for "new API") during the long 2.5 development series.

Old-timers on the net — like your editor — used to have their computers beep at them every time an email arrived. Most of us stopped doing that long ago; the beeps were nonstop, and things reached a point where we simply knew there would be email waiting anytime we got over our dread and opened a mail client. NAPI follows a similar approach: rather than poke the processor when packets arrive, the interface just lets them accumulate. The kernel will then poll the interface occasionally, secure in the knowledge that there will always be packets waiting to be processed. Those packets are then processed in a batch, with the batch size limited by the "weight" assigned to the interface.

At this level, we can see that batching of packet processing was added some fifteen years ago. But that is where the batching stops; when the NAPI poll happens, the device driver will pass each packet into the network stack with a call to netif_receive_skb(). From that point on, each packet is handled independently, with no further batching. In retrospect, with all of the effort that has gone into streamlining packet processing, one might wonder why that old API was never revisited, but that is how things often go in the real world.

Eventually, though, somebody usually notices an issue like that; in this case, that somebody was Edward Cree, who put together a patch set changing how low-level packet reception works. The first step was to supplement netif_receive_skb() with a batched version that reads, in its entirety:

    void netif_receive_skb_list(struct list_head *head)
    {
	struct sk_buff *skb, *next;

	list_for_each_entry_safe(skb, next, head, list)
		netif_receive_skb(skb);
    }

Now, rather than calling netif_receive_skb() for every incoming packet, a driver can make a list out of a batch of packets and pass them upward with a single call. Not much has changed at this point, but even this tweak improves performance by quite a bit, as it turns out.

The rest of the patch series is occupied with pushing the batching further up the network stack, so that packets can be passed in lists as far as possible. That gets a little trickier at the higher levels, since some packets have to be handled in fundamentally different ways. For example, some may have been allocated from the system's memory reserves (part of a mechanism to avoid deadlocks on network block devices); those require special handling. When such situations are encountered, the list of packets must be split into smaller lists, but the batching is preserved as far as possible.

The benchmark results (included in this merge commit) are interesting. In one test case, using a single receive queue, a kernel with these patches (and a suitably patched driver) showed a 4% improvement in packet-processing speed. That would certainly justify the addition of this bit of infrastructure, but it turns out that this number is the worst case that Cree could find. In general, just adding and using netif_receive_skb_list() improves performance by 10%, and the performance improvement with the entire patch series centers around 25%. One test showed a 35% speed improvement. In an era where developers have sweated mightily for much smaller gains, this is an impressive performance improvement.

One might well wonder why even the simplest batching shown above can improve things by so much. It mostly comes down to cache behavior. As Cree notes in the patch introduction, the processor's instruction cache is not large enough to hold the entire packet-processing pipeline. A device driver will warm the cache with its own code, but then the processing of a single packet pushes that code out of cache, and the driver must start cold with the next one. Just eliminating that bit of cache contention by putting the packets into a list before handing them to the network stack thus improves things considerably; creating the same sort of cache efficiency through the network stack improves things even more.

Networking also uses a lot of indirect function calls. These calls were never cheap, but the addition of retpolines for Spectre mitigation has made things worse. Batching replaces a bunch of per-packet indirect calls with single per-list calls, reducing that overhead.

There is a problem that often comes with throughput-oriented optimizations, and which can often be seen with batching: an increase in latencies. In the networking case, though, that cost was already paid years ago when NAPI was added. The new batching works on bunches of packets that have already been accumulated at the NAPI poll time and doesn't really add any further delays. So it's an almost free improvement from that point of view.

This code has been merged for the 4.19 kernel, so it will be generally available when the release happens. As of this writing, only the Solarflare network interfaces use the new netif_receive_skb_list() API. The necessary changes at the driver level are quite small, though, so it would be surprising if other drivers were not updated in the relatively near future, possibly even before the 4.19 release. This particular fruit is hanging too low to go unpicked for long.

Index entries for this article
Kernel	Networking/Performance

Batch processing of network packets

Posted Aug 21, 2018 13:17 UTC (Tue) by ernstp (guest, #13694) [Link]

Here's some other batching in progress I noted:

drm/ttm,amdgpu: Introduce LRU bulk move functionality

https://patchwork.freedesktop.org/series/47871/

Credit where it's due

Posted Aug 21, 2018 15:42 UTC (Tue) by ecree (guest, #95790) [Link] (8 responses)

The original idea was not mine: it was Jesper Dangaard Brouer who thought of it. I 'merely' implemented it.
He suggested it in http://lists.openwall.net/netdev/2016/01/15/51 and the following thread seems also to contain some early prefiguring of XDP.

Credit where it's due

Posted Aug 22, 2018 2:43 UTC (Wed) by dgc (subscriber, #6611) [Link] (3 responses)

Is it a sign that you're getting old when you notice that the old, forgotten ways have become the shiny new again?

-Dave.

Credit where it's due

Posted Aug 22, 2018 22:02 UTC (Wed) by ejr (subscriber, #51652) [Link]

Not even that old: https://dl.acm.org/citation.cfm?id=502057

Credit where it's due

Posted Aug 27, 2018 19:20 UTC (Mon) by roblucid (guest, #48964) [Link]

Batch processing was all the rage when I first used computers!

Credit where it's due

Posted Aug 27, 2018 19:21 UTC (Mon) by roblucid (guest, #48964) [Link]

Batch processing was all the rage, when I was a teen :)

Credit where it's due

Posted Aug 22, 2018 3:57 UTC (Wed) by mtaht (subscriber, #11087) [Link] (3 responses)

Normally batching anything makes me grumpy, but I like this. :)

However I've seen a lot of code that does stupid things to software batch gro stuff, with things that actually overwhelm the cache. I'm curious, with this new code,
if folk have experimented with decreasing the (oft rather large) NAPI value in the first place, and to what extent it was tested on devices with small i/dcaches.

Skylake is one thing. But a typical cache size on small boxes is 32k/32k for these.

Credit where it's due

Posted Aug 22, 2018 3:59 UTC (Wed) by mtaht (subscriber, #11087) [Link] (2 responses)

Also, I so dearly want timestamps on ingress, always. Being able to amortize one itty bitty timestamp per draining of the rx ring would make me happy and not cost a fraction of the cpu you just saved.

Credit where it's due

Posted Aug 23, 2018 16:15 UTC (Thu) by ecree (guest, #95790) [Link] (1 responses)

Packets should still be timestamped if they were before, see netif_receive_skb_list_internal() (and compare with netif_receive_skb_internal()).

Credit where it's due

Posted Aug 24, 2018 7:26 UTC (Fri) by mtaht (subscriber, #11087) [Link]

you missed the "always" part. :)

In my ideal world, packets inside the kernel and from future devices, would have something the like the following format:

The rx_timestamp is free if you have it in hw, the rx/tx_timestamp would make all the codel-y work faster on tx, and also enable the timer queues VJ is talking about ( https://netdevconf.org/0x12/session.html?evolving-from-af... ) see also sch_etx and the igb network hw, and... selfishly - on top of "timestamp always", 3 hashes would make sch_cake fast enough for general use. timestamps are at the front because they are "free" though they could live at the back, hashes at the back because you need time to calc them and on a cut through switch you can't wait for them. The existing cb has some other fields I'd do always too and make persistent through the stack.

I'm so totally not going to make more of this "modest proposal", as much as I think the whole skb layout could use a major revision, as it would take a forklift, time, and taste to redo in linux, dpdk, rgmii, etc. It would make it really difficult to backport code from one format to the other. It would take years to implement in linux (years more in bsd, osx, windows), for a benefit that would mostly be for 100gige+ interfaces originally. A metric ton of people would have their favorite fixed format field they'd want somewhere, thus politics and gnashing of standards bodies teeth would happen... specialized hw offload engines break...

Still, if I could have 1 sysctl to immuttably rx timestamp packets on all interfaces always it would be great. Packets being managed by codel could measure the entire system from ingress to egress and *drop them*, shedding load automatically as a system as a whole (rather than at the qdisc) got more stress than it should take, figuring out the cost of each substep through the layers would be something you could do on a per packet basis on your workload, rather than by blasting specialized test tools through it that *don't model real traffic* and claiming that pps really meant something...

and there's be ponies! and speckled dancing unicorns! harder RTT deadlines! and systems that didn't collapse under load! and poppies, poppies, poppies everywhere to roll around in to help you sleep even better!

It's after midnight. I usually don't post anything after midnight.

...

I am happy that ebpf is making it easier to offload more stuff into smarter hardware. and I look forward to trying this new skb list idea out next quarter on some very old, slow, tiny hardware.

Batch processing of network packets

Posted Aug 21, 2018 16:31 UTC (Tue) by eru (subscriber, #2753) [Link] (10 responses)

That is an interesting point about caches that may be applicable in other situations as well. So far I have imagined it is pretty much neutral whether you do

for each item in input
    do_b(do_a(item))

for each item in input
    push(tmplist, do_a(item))
for each item in tmplist
    do_b(item)

except these solution may need extra storage for the intermediate results, but this article suggests the second solution may be better despite that.

Batch processing of network packets

Posted Aug 21, 2018 17:19 UTC (Tue) by HIGHGuY (subscriber, #62277) [Link] (1 responses)

It depends on how hard the code hammers the data cache too.
In this case the data was most likely transferred into system memory via DMA and it is not or no longer in cache. In parallel the handling of each packet only touches the first tens of bytes for the headers.
That makes the instruction cache cost higher than the data cache cost. If the data cache cost were higher, the gains might have been more modest or maybe even counterproductive.

Batch processing of network packets

Posted Aug 26, 2018 0:14 UTC (Sun) by BenHutchings (subscriber, #37955) [Link]

Recently received packets should be in the (last-level) cache. Since modern systems tend to have memory controllers closely integrated with memory caches, they can allocate or update cache lines when a DMA write occurs rather than sending an invalidation message. In systems with multiple memory controllers, PCIe DMA masters can provide a hint as to which should do this (which, if I remember correctly, is supposed to be taken from the corresponding DMA queue's MSI-X entry).

Batch processing of network packets

Posted Aug 21, 2018 18:57 UTC (Tue) by mm7323 (subscriber, #87386) [Link] (2 responses)

I wonder if this is sort of obvious to a DSP engineer where writing tight little processing loops that can optimally keep some data pipeline busy while running instructions out of cache is often the goal, working on multiple passes across some buffers.

I must admit, that while it makes sense to me having read the article, I doubt I'd have spotted that optimisation in a long long time! Open Source is cool!

Batch processing of network packets

Posted Aug 22, 2018 15:55 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

Even outside of proper DSPs, I think this is a fairly natural pattern when using SIMD CPUs or GPUs. You want to process multiple data items concurrently, so you think about your data as a single array (or maybe a structure-of-arrays) rather than as a large collection of independent OOP-style objects, and you write your processing code as array transformations. But the processing code has limited resources (registers, constants, instruction cache, local memory in GPUs, etc), and the compiler will either complain or go dreadfully slow (which should be obvious in a profiler) if you exceed the limits; or maybe some of the processing works best on vectors of 8 elements while some only works on vectors of 4, etc; or maybe it's just hard to understand the whole algorithm at once; so it's natural to split the processing into multiple simpler passes that can be optimised individually. You might run each pass over the entire data set in sequence, or (if memory bandwidth is a concern) you might split the data into chunks that fit within the data cache then run all passes over one chunk before moving onto the next chunk.

(This seems kind of like the data-oriented design thing that some game developers have been interested in, as a reaction against the cache-inefficiency of common OOP designs. But I imagine it's much easier to recognise and apply in a game engine that you're writing from scratch, and that you know is going to have to process huge amounts of data, than in a large existing OOP-ish codebase like Linux that was designed with very difference performance requirements.)

Batch processing of network packets

Posted Aug 22, 2018 20:32 UTC (Wed) by mm7323 (subscriber, #87386) [Link]

Thanks for the link - that's a great post with interesting points.

One thing that springs to mind, in an object oriented context, is that inheritance really breaks (instruction) cache efficiency. You may have a collection of objects and wish to call the same method on each one, but due to inheritance each object can take very different code paths and foul the instruction caches.

The data orientated approach the link describes mitigates this. But sorting collections by object type (before any other comparison criteria) may also yield better performance in this vain too I guess.

Batch processing of network packets

Posted Aug 21, 2018 19:54 UTC (Tue) by epa (subscriber, #39769) [Link]

Yes, the second is often faster; and you can add another rule, that an interface taking a list of items in a single call will often perform better than one call per item. We see the same principle in pipelined HTTP requests, or bulk operations on an SQL database, avoiding a separate round trip for each datum.

Batch processing of network packets

Posted Aug 22, 2018 13:06 UTC (Wed) by ncm (guest, #165) [Link] (3 responses)

I have seen compilers make that transformation all by themselves, so don't be surprised if it's not faster when you do it, or that it gets suddenly slower when you make a trivial change that makes the compiler decide not to do that. Clang will abandon lots of its loop optimizations if it thinks something in the loop _might_ possibly throw.

Cache awareness is right at the heart of our deal we have made with the devil for performance. There are more different caches in modern chips than you would ever imagine, many that are undocumented or barely even mentioned.

The ones we know of include your regular data caches, along with instruction caches, microinstruction caches, conditional-branch target caches, and page map caches. Many other systems that act in the role of caches include the familiar pipelines, and also register renaming, speculative execution, and memory pre-fetching.

The soul that we have handed over to the devil in exchange is made of data security (spectre etc.) and our ability to predict the consequences of design choices. Crack-brained newbie design choices (I'm looking at you, function pointers!) get a billion transistors thrown at them, which works just well enough to prevent learning better.

Too-clever compilers do their part. Often, a really dumb algorithm (e.g. counting bits) is recognized and replaced with special instructions not normally accessible, and faster than a smart algorithm the compiler cannot recognize.

I have seen 2x performance differences because the compiler decided the unlikely branch in a loop was the more likely, or because it was optimizing for older chips that like random other instructions mixed into sequences, where newer chips like to fuse a comparison that has a directly-adjacent branch into a single microinstruction -- but only in a small-enough loop.

Batch processing of network packets

Posted Aug 22, 2018 14:11 UTC (Wed) by cagrazia (guest, #124754) [Link] (2 responses)

In July issue of CACM, there is an interesting article ("C Is Not a Low-Level Language") which opened my mind on these issues. The author argues that C is not a low-level language anymore, and many important details (here we are talking about caches, but the author also include pipelines, etc.) are hidden. In the end, the C abstract machine doesn't map easily to the abstractions exposed by the current CPUs and GPUs, but that was true on PDP-11. Therefore, we are running our code on fast PDP-11 emulators, and we are forced to write even more smart compilers to get the full speed (well, in some cases).

Batch processing of network packets

Posted Aug 22, 2018 17:32 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

From that article at https://queue.acm.org/detail.cfm?id=3212479 :

> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.

Isn't that pretty much exactly what a GPU is? They basically do everything with 8/16/32-wide vectors (though presented as scalars to programmers), with many thousands of threads, limited pipelining within a thread, no out-of-order execution, and the memory model is typically that you get absolutely no cache coherence (except when using special instructions that simply bypass the L1 caches, like atomics or Vulkan's Coherent variables).

GPUs have been widely available for a long time and are much faster than CPUs in terms of FLOPS and memory bandwidth, but even most new software (where legacy compatibility isn't that important) is still written almost entirely for CPUs, so that kind of hardware design is evidently not the solution. (It's still a commercial success, though. And incidentally most GPU programming uses dialects of C, so C isn't holding them back.)

As for the CPU features that are relevant to Spectre - speculative execution and caches - if their invisibility from C makes C not a low-level language, then x86 machine code is not a low-level language either, and it seems a bit unfair to blame C for the limitations of machine code.

I think most programmers who care about low-level details can tolerate the complex transformations that a C compiler performs, because they can always check the generated assembly code to see what really happened, and can easily influence the compiler (with intrinsics, attributes, compiler flags, inline assembly, compiler plugins, etc), so it's not really all that mysterious. But the transformation between machine code and what actually happens in the CPU is much more opaque and poorly documented and hard for software people to influence, which I think is why Spectre was such a hard problem to discover and to solve.

Hmm, that kind of implies that moving most of the magic out of the CPU hardware and into the C compiler would help a lot. But that sounds like the Itanium approach, and that didn't work so well either.

Batch processing of network packets

Posted Aug 22, 2018 23:24 UTC (Wed) by ncm (guest, #165) [Link]

Problems don't care what's "fair". When the tools become inadequate, you need better tools.

We have got some more mileage out of C++ (which is quite a lot faster than C, for common problems), compiler intrinsics, and shaders, but we may find that a powerful functional language is needed to manage the complexity of what modern hardware can be. (If only the functional-language literati could break their nasty garbage-collection habit!) The biggest opportunities for using what hardware can be made to do depend on values not having or needing addresses, for longer periods. There are blessed moments in C++ when values have no addresses, and we can get a lot done in those moments.

Much of what the hundreds of billions of transistors on CPU chips nowadays do is just keeping this computation from getting in the way of that computation. Remarkably little of them are actually doing the computations themselves. By eliminating the von Neumann hierarchical, addressible memory model, those transistors could be put to work properly -- if only we knew a better way to program them.

"Say what you will about sequential instruction machines, they don't often have a problem deciding what to do in the next cycle."

Batch processing of network packets

Posted Aug 22, 2018 4:05 UTC (Wed) by ncm (guest, #165) [Link] (7 responses)

I wonder if we couldn't get the netmap driver tweaks into mainline, as they are in FreeBSD now. That seems like more low-hanging fruit.

Batch processing of network packets

Posted Aug 22, 2018 14:23 UTC (Wed) by cagrazia (guest, #124754) [Link] (6 responses)

Last time I worked with Netmap (2014), it was an exciting piece of software to move packets from the NIC to the driver, and then directly in user space (bypassing IP/transport), always batching as much as possible. The problem to solve was to reduce the cost of system calls and stack traversing for an application to get or to send the packets from the NIC. While Netmap was very useful for L2 tracing software (libpcap, Wireshark, etc.) or packet generators, it was insufficient for the other parts of the stack: IP layer or TCP could not benefit from Netmap batching.

I suppose in these years the problem to pass or to retrieve packets from/to the NIC directly to applications has been solved with a mix between the Rizzo's (author of Netmap) and the Linux network developers' design. This work instead focuses on the internal of the network stack, from the driver to the upper layers (without reaching the application).

Batch processing of network packets

Posted Aug 22, 2018 16:18 UTC (Wed) by dps (guest, #5725) [Link] (1 responses)

At least some applications were saving 100ns is worth money sometimes bypass the Linux kernel network stack. I believe that solarflare has a LD_PRELOAD library that moves some of tcp to user space, thereby avoiding context switching.

I suspect there is a lot of hardware off load too but in the final analysis somebody else got that job :-(

It might be worth knowing that solarflare et al target customers that want to be close to stock exchanges because the speed of light is finite.

Batch processing of network packets

Posted Aug 26, 2018 0:44 UTC (Sun) by BenHutchings (subscriber, #37955) [Link]

At least some applications were saving 100ns is worth money sometimes bypass the Linux kernel network stack. I believe that solarflare has a LD_PRELOAD library that moves some of tcp to user space, thereby avoiding context switching.

It certainly used to be that the major performance win of user-level networking was not so much the avoidance of kernel/user context switches, but improving temporal locality of access to packets.

With kernel networking, the kernel has to demux incoming packets into socket queues and account for the memory allocated to each socket, as the packets come in. So the CPU will access packet headers during demux and then again some time later when the application receives the packets. With user-level networking, the hardware does demux into (typically) per-process queues and the CPU will access packet headers only when the application receives the packet. (The packet buffers are naturally accounted to the process.)

Still, the cost of context switches has been increased substantially by mitigations for speculation leaks. So that may be a bigger part of the performance advantage now.

I suspect there is a lot of hardware off load too but in the final analysis somebody else got that job :-(

I worked on Solarflare drivers up to the SFC9100 generation, and there was no TCP offload or anything really unusual there. The essential features are checksum offload, flow steering/filtering, and lots of queues.

Batch processing of network packets

Posted Aug 22, 2018 17:13 UTC (Wed) by shemminger (subscriber, #5739) [Link]

Lots of this work was motivated by making kernel performance reach that of DPDK. http://vger.kernel.org/netconf2018_files/StephenHemminger...

Batch processing of network packets

Posted Aug 22, 2018 22:51 UTC (Wed) by ncm (guest, #165) [Link] (2 responses)

There are important (i.e. $billlions) applications of packet ingestion that do not need any stack processing at all -- just get the bits where user space can see them, with as little handling as possible, if not less. In many cases typical packets are ~100 bytes long, leaving 76ns to process each packet at 10Gbps. Fortunately, bursts over 10Gbps are still rare.

Some of these uses actually need to act on the contents of the packet in real time, and those users build custom ASICs that deliver incompletely-arrived packets for speculative processing. But most uses just need to timestamp, filter, and direct packets to various places for parallel downstream processing. Those are the uses that can benefit from kernel and driver improvements, and they get everything they need from netmap, DPDK, ef_vi and the like. DPDK is a mess, and ef_vi is proprietary. Netmap offers a plausible route to break out of lock-in, and actually use commodity hardware in many of what are now thought of as high-end, niche applications that need specialized equipment.

Batch processing of network packets

Posted Aug 23, 2018 15:32 UTC (Thu) by willemb (subscriber, #73364) [Link] (1 responses)

Linux 4.18 landed AF_XDP for this purpose

https://www.kernel.org/doc/html/latest/networking/af_xdp....

Batch processing of network packets

Posted Aug 24, 2018 1:52 UTC (Fri) by ncm (guest, #165) [Link]

Thank you, this is good news. I wonder how I had missed it.

Batch processing of network packets

Posted Aug 23, 2018 9:58 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

I wonder how many of those would be needed to annihilate the performance advantage of userspace networking stacks such as DPDK…

Batch processing of network packets - locking

Posted Aug 24, 2018 12:42 UTC (Fri) by fillods (guest, #22226) [Link]

Not only batching helps with regard to cache efficiency and so on, it helps with regard to locking through factorization.

This reminds me an insightful article some years ago : "Improving Linux networking performance" https://lwn.net/Articles/629155/
Batching was already mentioned at that time.

zeromq batching

Posted Aug 24, 2018 13:35 UTC (Fri) by zoobab (guest, #9945) [Link] (1 responses)

Let me copy the paragraph of the ZeroMQ FAQ:

"* The graph in the test results shows that ZeroMQ is slower than TCP/IP. What's the point then?

Obviously, you would expect system working on top of TCP to have higher latencies than TCP. Anything else would be - simply speaking - supernatural. However, throughput is a different matter. ZeroMQ gets you more throughput than TCP has using intelligent batching algorithms. Moreover ZeroMQ delivers value-add over the TCP. Asynchronicity, message queueing, routing based on business logic, multicast etc.

* How come ZeroMQ has higher throughput than TCP although it's built on top of TCP?

Avoiding redundant networking stack traversals can improve throughput significantly. In other words, sending two messages down the networking stack in one go takes much less time then sending each of them separately.This technique is known as message batching.

* When sending messages in batches you have to wait for the last one to send the whole batch. This would make the latency of the first message in the batch much worse, wouldn't it?

ZeroMQ batches messages in opportunistic manner. Rather than waiting for a predefined number of messages and/or predefined time interval, it sends all the messages available at the moment in one go. Imagine the network interface card is busy sending data. Once it is ready to send more data it asks ZeroMQ for new messages. ZeroMQ sends all the messages available at the moment. Does it harm the latency of the first message in the batch? No. The message won't be sent earlier anyway because the networking card was busy. On the contrary, latency of subsequent messages will be improved because sending single batch to the card is faster then sending lot of small messages. On the other hand, if network card isn't busy, the message is sent straight away without waiting for following messages. Thus it'll have the best possible latency."

zeromq batching

Posted Aug 24, 2018 13:37 UTC (Fri) by zoobab (guest, #9945) [Link]

http://zeromq.org/area:faq

Batch processing of network packets

Posted Sep 9, 2018 14:38 UTC (Sun) by ssl (guest, #98177) [Link]

https://fd.io/technology/ also claim the cache warmup might speed up the packet processing