Batch processing of network packets

Posted Aug 22, 2018 17:32 UTC (Wed) by excors (subscriber, #95769)
In reply to: Batch processing of network packets by cagrazia
Parent article: Batch processing of network packets

From that article at https://queue.acm.org/detail.cfm?id=3212479 :

> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.

Isn't that pretty much exactly what a GPU is? They basically do everything with 8/16/32-wide vectors (though presented as scalars to programmers), with many thousands of threads, limited pipelining within a thread, no out-of-order execution, and the memory model is typically that you get absolutely no cache coherence (except when using special instructions that simply bypass the L1 caches, like atomics or Vulkan's Coherent variables).

GPUs have been widely available for a long time and are much faster than CPUs in terms of FLOPS and memory bandwidth, but even most new software (where legacy compatibility isn't that important) is still written almost entirely for CPUs, so that kind of hardware design is evidently not the solution. (It's still a commercial success, though. And incidentally most GPU programming uses dialects of C, so C isn't holding them back.)

As for the CPU features that are relevant to Spectre - speculative execution and caches - if their invisibility from C makes C not a low-level language, then x86 machine code is not a low-level language either, and it seems a bit unfair to blame C for the limitations of machine code.

I think most programmers who care about low-level details can tolerate the complex transformations that a C compiler performs, because they can always check the generated assembly code to see what really happened, and can easily influence the compiler (with intrinsics, attributes, compiler flags, inline assembly, compiler plugins, etc), so it's not really all that mysterious. But the transformation between machine code and what actually happens in the CPU is much more opaque and poorly documented and hard for software people to influence, which I think is why Spectre was such a hard problem to discover and to solve.

Hmm, that kind of implies that moving most of the magic out of the CPU hardware and into the C compiler would help a lot. But that sounds like the Itanium approach, and that didn't work so well either.

Batch processing of network packets

Posted Aug 22, 2018 23:24 UTC (Wed) by ncm (guest, #165) [Link]

Problems don't care what's "fair". When the tools become inadequate, you need better tools.

We have got some more mileage out of C++ (which is quite a lot faster than C, for common problems), compiler intrinsics, and shaders, but we may find that a powerful functional language is needed to manage the complexity of what modern hardware can be. (If only the functional-language literati could break their nasty garbage-collection habit!) The biggest opportunities for using what hardware can be made to do depend on values not having or needing addresses, for longer periods. There are blessed moments in C++ when values have no addresses, and we can get a lot done in those moments.

Much of what the hundreds of billions of transistors on CPU chips nowadays do is just keeping this computation from getting in the way of that computation. Remarkably little of them are actually doing the computations themselves. By eliminating the von Neumann hierarchical, addressible memory model, those transistors could be put to work properly -- if only we knew a better way to program them.

"Say what you will about sequential instruction machines, they don't often have a problem deciding what to do in the next cycle."