Linear speedup

Posted Mar 6, 2025 11:34 UTC (Thu) by SLi (subscriber, #53131)
Parent article: Two new graph-based functional programming languages

Why is that not a linear speedup? I mean, could be, could not be. Are you assuming the GPU cores are as fast as your CPU cores? That's very unlikely to be true.

I think that to measure whether the speedup is linear, you'd need to run it on a single GPU core and compare to many GPU cores. Another interesting metric could be the energy consumption of the calculation on a CPU vs GPU.

I think a 32x speedup on GPU vs CPU suggests almost perfect parallelization. Depends of course also on the the details of the CPU and the GPU. In my own deep learning work, I use a thumb rule of expecting a high-end GPU to beat a very decent CPU by a factor of 40-100x (for training, which is the best case for GPU).

Linear speedup

Posted Mar 6, 2025 12:32 UTC (Thu) by excors (subscriber, #95769) [Link] (3 responses)

> Are you assuming the GPU cores are as fast as your CPU cores? That's very unlikely to be true.

As I understand it, a CUDA core (in Nvidia's terminology) is essentially a single 32-bit ALU. A modern Intel CPU core has the equivalent of maybe 76 32-bit ALUs (depending on how you count). A GPU with 6144 CUDA cores is only about 10x more theoretical compute power than an 8-core CPU.

Obviously there are very big differences in how easily you can exploit the full compute power, so comparing ALU count is not very meaningful, but it's more meaningful than comparing "cores".

Linear speedup

Posted Mar 6, 2025 15:51 UTC (Thu) by Athas (subscriber, #104895) [Link] (2 responses)

It's a little more complicated than that - you need to take both arithmetic throughput, core count, and frequency into account. Ultimately, you can indeed compute the peak computational capacity of a CPU or GPU (although it's usually given in FLOP/s - I suspect measuring integer addition may be trickier), although 10x does seem fairly reasonable.

Focusing on arithmetic is a bit misleading however, as most programs are not bottlenecked by arithmetic, but by memory accesses. GPUs also have far higher memory bandwidth (and latency) than CPUs, although with a lot of caveats about how the access patterns must look in order to fully utilize a GPU memory system. In fact, that is the reason why I am a little skeptical of Bend ever performing well on a GPU. While the execution model is obviously parallel, do we have any reason to suspect that the memory access pattern will have the kind of specific cross-thread spatial locality that is needed to take advantage of GPU memory? I don't see it. Perhaps Bend can run well on a specialised graph reduction machine, but on today's machine, the main problem is locality, and I don't see how interaction nets address that challenge.

Linear speedup

Posted Mar 6, 2025 21:15 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

I do not pretend to be an expert on these interaction nets. But based on the sample code shown in the article, it looks like it would be fairly straightforward to vectorize most of these operations, if we assume that List[N32] is a contiguous array and not (e.g.) a cons list or something.

I also assume that a "node" is an implementation detail and does not need to really exist at runtime (for reflection and the like), so that the optimizer is allowed to reason (loosely) as follows:

1. As a starting point, a node in this case is something like (subtract, val, ~min), and we construct a temporary array of these nodes (the nodes are not the same size as the input type, so we can't put them in the original array). The last operation of the function is to evaluate all of these nodes and copy the results back into the input array, and we start out with a loop that evaluates each node (possibly in parallel).
2. These nodes do not escape the function, so we do not need to obey any particular ABI for them, and can change their representation in arbitrary ways (as long as we also change the evaluation function to match).
3. Subtract is a constant. Elide it. Nodes are now (val, ~min) and the evaluation operation is now hard-coded to do a subtract.
4. ~min is a single value over the whole array. Move it to a register. Nodes are now (val), and the evaluation operation is now hard-coded to subtract ~min from each node. Furthermore, since nodes are now the same size as N32, we can elide the temporary array and do the evaluation in-place.
5. The tuple wrapper around (val) isn't doing anything useful (and presumably doesn't even change the runtime representation at all). Elide it. Nodes are now val, which is the same type and value as the input array, so we need not do anything to construct nodes - we can just reinterpret the input array as an array of pre-constructed nodes.
6. Notice that the evaluation function is exactly identical to a vectorized subtract. So emit that instead of a loop.
7. Now the function doesn't construct nodes at all, so if there are any further optimization opportunities, we can find them with conventional analysis.

Of course, this all depends on a lot of assumptions. If they are representing every object as a separate heap allocation as in CPython, and List[N32] is an array of pointers (or a cons list), then yeah, it is going to be slow. But the problem with doing that is representing everything as a separate heap allocation (and building lists out of cons, if they are doing that). It's not inherent to the graph theory, it's just a consequence of how you decided to represent your data at runtime, and (to some extent) how much freedom you decided to expose in your APIs (e.g. if you get yourself into the business of garbage collecting arbitrary object graphs, that is going to slow you down at least a little compared to a more restrictive set of pointer/reference rules that is amenable to refcounting and the like).

Linear speedup

Posted Mar 27, 2025 16:27 UTC (Thu) by Spudd86 (subscriber, #51683) [Link]

That's a memory placement problem, and it would also be under control of the compiler. Because of the execution model of GPUs there is far less reason to keep one representation of data. It's usually far better to build kernels that store their output to a copy, and change the layout of that new copy to be better for the next step.

If you can do the analysis to prove something can be done in parallel without locks you can also likely do the work to re-organize the data with the same analysis. Especially in functional languages where people aren't writing update in place logic in the first place.

Linear speedup

Posted Mar 6, 2025 14:04 UTC (Thu) by daroc (editor, #160859) [Link]

Well, yes. As I mentioned in the article, it's a rough measure. The actual details of what is faster is going to depend strongly not only on your hardware, but also your particular problem and how well it can be automatically parallelized.

The basic existing benchmarks are definitely exciting, though, for exactly the reasons you point out: for problems that can be parallelized, it sure looks like Bend does an excellent job of doing that without much additional effort on the programmer's part.