Linear speedup
Linear speedup
Posted Mar 6, 2025 11:34 UTC (Thu) by SLi (subscriber, #53131)Parent article: Two new graph-based functional programming languages
I think that to measure whether the speedup is linear, you'd need to run it on a single GPU core and compare to many GPU cores. Another interesting metric could be the energy consumption of the calculation on a CPU vs GPU.
I think a 32x speedup on GPU vs CPU suggests almost perfect parallelization. Depends of course also on the the details of the CPU and the GPU. In my own deep learning work, I use a thumb rule of expecting a high-end GPU to beat a very decent CPU by a factor of 40-100x (for training, which is the best case for GPU).
Posted Mar 6, 2025 12:32 UTC (Thu)
by excors (subscriber, #95769)
[Link] (3 responses)
As I understand it, a CUDA core (in Nvidia's terminology) is essentially a single 32-bit ALU. A modern Intel CPU core has the equivalent of maybe 76 32-bit ALUs (depending on how you count). A GPU with 6144 CUDA cores is only about 10x more theoretical compute power than an 8-core CPU.
Obviously there are very big differences in how easily you can exploit the full compute power, so comparing ALU count is not very meaningful, but it's more meaningful than comparing "cores".
Posted Mar 6, 2025 15:51 UTC (Thu)
by Athas (subscriber, #104895)
[Link] (2 responses)
Focusing on arithmetic is a bit misleading however, as most programs are not bottlenecked by arithmetic, but by memory accesses. GPUs also have far higher memory bandwidth (and latency) than CPUs, although with a lot of caveats about how the access patterns must look in order to fully utilize a GPU memory system. In fact, that is the reason why I am a little skeptical of Bend ever performing well on a GPU. While the execution model is obviously parallel, do we have any reason to suspect that the memory access pattern will have the kind of specific cross-thread spatial locality that is needed to take advantage of GPU memory? I don't see it. Perhaps Bend can run well on a specialised graph reduction machine, but on today's machine, the main problem is locality, and I don't see how interaction nets address that challenge.
Posted Mar 6, 2025 21:15 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
I also assume that a "node" is an implementation detail and does not need to really exist at runtime (for reflection and the like), so that the optimizer is allowed to reason (loosely) as follows:
1. As a starting point, a node in this case is something like (subtract, val, ~min), and we construct a temporary array of these nodes (the nodes are not the same size as the input type, so we can't put them in the original array). The last operation of the function is to evaluate all of these nodes and copy the results back into the input array, and we start out with a loop that evaluates each node (possibly in parallel).
Of course, this all depends on a lot of assumptions. If they are representing every object as a separate heap allocation as in CPython, and List[N32] is an array of pointers (or a cons list), then yeah, it is going to be slow. But the problem with doing that is representing everything as a separate heap allocation (and building lists out of cons, if they are doing that). It's not inherent to the graph theory, it's just a consequence of how you decided to represent your data at runtime, and (to some extent) how much freedom you decided to expose in your APIs (e.g. if you get yourself into the business of garbage collecting arbitrary object graphs, that is going to slow you down at least a little compared to a more restrictive set of pointer/reference rules that is amenable to refcounting and the like).
Posted Mar 27, 2025 16:27 UTC (Thu)
by Spudd86 (subscriber, #51683)
[Link]
If you can do the analysis to prove something can be done in parallel without locks you can also likely do the work to re-organize the data with the same analysis. Especially in functional languages where people aren't writing update in place logic in the first place.
Posted Mar 6, 2025 14:04 UTC (Thu)
by daroc (editor, #160859)
[Link]
The basic existing benchmarks are definitely exciting, though, for exactly the reasons you point out: for problems that can be parallelized, it sure looks like Bend does an excellent job of doing that without much additional effort on the programmer's part.
Linear speedup
Linear speedup
Linear speedup
2. These nodes do not escape the function, so we do not need to obey any particular ABI for them, and can change their representation in arbitrary ways (as long as we also change the evaluation function to match).
3. Subtract is a constant. Elide it. Nodes are now (val, ~min) and the evaluation operation is now hard-coded to do a subtract.
4. ~min is a single value over the whole array. Move it to a register. Nodes are now (val), and the evaluation operation is now hard-coded to subtract ~min from each node. Furthermore, since nodes are now the same size as N32, we can elide the temporary array and do the evaluation in-place.
5. The tuple wrapper around (val) isn't doing anything useful (and presumably doesn't even change the runtime representation at all). Elide it. Nodes are now val, which is the same type and value as the input array, so we need not do anything to construct nodes - we can just reinterpret the input array as an array of pre-constructed nodes.
6. Notice that the evaluation function is exactly identical to a vectorized subtract. So emit that instead of a loop.
7. Now the function doesn't construct nodes at all, so if there are any further optimization opportunities, we can find them with conventional analysis.
Linear speedup
Linear speedup