Batch processing of network packets

Posted Aug 22, 2018 13:06 UTC (Wed) by ncm (guest, #165)
In reply to: Batch processing of network packets by eru
Parent article: Batch processing of network packets

I have seen compilers make that transformation all by themselves, so don't be surprised if it's not faster when you do it, or that it gets suddenly slower when you make a trivial change that makes the compiler decide not to do that. Clang will abandon lots of its loop optimizations if it thinks something in the loop _might_ possibly throw.

Cache awareness is right at the heart of our deal we have made with the devil for performance. There are more different caches in modern chips than you would ever imagine, many that are undocumented or barely even mentioned.

The ones we know of include your regular data caches, along with instruction caches, microinstruction caches, conditional-branch target caches, and page map caches. Many other systems that act in the role of caches include the familiar pipelines, and also register renaming, speculative execution, and memory pre-fetching.

The soul that we have handed over to the devil in exchange is made of data security (spectre etc.) and our ability to predict the consequences of design choices. Crack-brained newbie design choices (I'm looking at you, function pointers!) get a billion transistors thrown at them, which works just well enough to prevent learning better.

Too-clever compilers do their part. Often, a really dumb algorithm (e.g. counting bits) is recognized and replaced with special instructions not normally accessible, and faster than a smart algorithm the compiler cannot recognize.

I have seen 2x performance differences because the compiler decided the unlikely branch in a loop was the more likely, or because it was optimizing for older chips that like random other instructions mixed into sequences, where newer chips like to fuse a comparison that has a directly-adjacent branch into a single microinstruction -- but only in a small-enough loop.

Batch processing of network packets

Posted Aug 22, 2018 14:11 UTC (Wed) by cagrazia (guest, #124754) [Link] (2 responses)

In July issue of CACM, there is an interesting article ("C Is Not a Low-Level Language") which opened my mind on these issues. The author argues that C is not a low-level language anymore, and many important details (here we are talking about caches, but the author also include pipelines, etc.) are hidden. In the end, the C abstract machine doesn't map easily to the abstractions exposed by the current CPUs and GPUs, but that was true on PDP-11. Therefore, we are running our code on fast PDP-11 emulators, and we are forced to write even more smart compilers to get the full speed (well, in some cases).

Batch processing of network packets

Posted Aug 22, 2018 17:32 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

From that article at https://queue.acm.org/detail.cfm?id=3212479 :

> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.

Isn't that pretty much exactly what a GPU is? They basically do everything with 8/16/32-wide vectors (though presented as scalars to programmers), with many thousands of threads, limited pipelining within a thread, no out-of-order execution, and the memory model is typically that you get absolutely no cache coherence (except when using special instructions that simply bypass the L1 caches, like atomics or Vulkan's Coherent variables).

GPUs have been widely available for a long time and are much faster than CPUs in terms of FLOPS and memory bandwidth, but even most new software (where legacy compatibility isn't that important) is still written almost entirely for CPUs, so that kind of hardware design is evidently not the solution. (It's still a commercial success, though. And incidentally most GPU programming uses dialects of C, so C isn't holding them back.)

As for the CPU features that are relevant to Spectre - speculative execution and caches - if their invisibility from C makes C not a low-level language, then x86 machine code is not a low-level language either, and it seems a bit unfair to blame C for the limitations of machine code.

I think most programmers who care about low-level details can tolerate the complex transformations that a C compiler performs, because they can always check the generated assembly code to see what really happened, and can easily influence the compiler (with intrinsics, attributes, compiler flags, inline assembly, compiler plugins, etc), so it's not really all that mysterious. But the transformation between machine code and what actually happens in the CPU is much more opaque and poorly documented and hard for software people to influence, which I think is why Spectre was such a hard problem to discover and to solve.

Hmm, that kind of implies that moving most of the magic out of the CPU hardware and into the C compiler would help a lot. But that sounds like the Itanium approach, and that didn't work so well either.

Batch processing of network packets

Posted Aug 22, 2018 23:24 UTC (Wed) by ncm (guest, #165) [Link]

Problems don't care what's "fair". When the tools become inadequate, you need better tools.

We have got some more mileage out of C++ (which is quite a lot faster than C, for common problems), compiler intrinsics, and shaders, but we may find that a powerful functional language is needed to manage the complexity of what modern hardware can be. (If only the functional-language literati could break their nasty garbage-collection habit!) The biggest opportunities for using what hardware can be made to do depend on values not having or needing addresses, for longer periods. There are blessed moments in C++ when values have no addresses, and we can get a lot done in those moments.

Much of what the hundreds of billions of transistors on CPU chips nowadays do is just keeping this computation from getting in the way of that computation. Remarkably little of them are actually doing the computations themselves. By eliminating the von Neumann hierarchical, addressible memory model, those transistors could be put to work properly -- if only we knew a better way to program them.

"Say what you will about sequential instruction machines, they don't often have a problem deciding what to do in the next cycle."