LWN: Comments on "Two new graph-based functional programming languages"

Linear speedup

Spudd86 — Thu, 27 Mar 2025 16:27:55 +0000

That's a memory placement problem, and it would also be under control of the compiler. Because of the execution model of GPUs there is far less reason to keep one representation of data. It's usually far better to build kernels that store their output to a copy, and change the layout of that new copy to be better for the next step.

If you can do the analysis to prove something can be done in parallel without locks you can also likely do the work to re-organize the data with the same analysis. Especially in functional languages where people aren't writing update in place logic in the first place.

Chicken and egg

marcH — Sat, 08 Mar 2025 17:55:26 +0000

> The major disadvantage is that computers don't work like that. In practice, CPUs excel at applying complex operations to small amounts of data that are locally held in the cache, and struggle with operations that require operating over large blocks of memory.

To some extend, computers don't work like that because... workloads like Bend and Vine are not popular! Microbenchmarks used by CPU designers are based on popular software. It's even common to say that computers are not designed to run software but benchmarks :-)

Victor Bret's "The future of programming" has a good joke about that (among other good ones) and of course there's the seminal "C is not a low level programming language". Jim Keller has been pretty open about hardware design in interviews but unlike the previous two you need a lot more time.

GPGPUs, CUDA etc. had to happen anyway, it was inevitable. Funny the seed money came from gamers.

Sorting on GPU

jezuch — Fri, 07 Mar 2025 14:15:03 +0000

> 68 seconds on a single thread, 25 seconds on four threads (although this may have been impacted by hyperthreading on the virtual machine, which had access to four dedicated vCPUs), and 0.78 seconds on the GPU.

Nice. I wasn't keeping up with developments in sorting algorithms so I completely missed how you can make them efficient on GPUs. (By "efficient" I mean "efficiently exploiting the parallelism", because Wikipedia tells me that bitonic sort has a complexity of O(n × log^2 n), which is more complex than the theoretical minimum for serial algorithms.)

Linear speedup

NYKevin — Thu, 06 Mar 2025 21:15:16 +0000

I do not pretend to be an expert on these interaction nets. But based on the sample code shown in the article, it looks like it would be fairly straightforward to vectorize most of these operations, if we assume that List[N32] is a contiguous array and not (e.g.) a cons list or something.

I also assume that a "node" is an implementation detail and does not need to really exist at runtime (for reflection and the like), so that the optimizer is allowed to reason (loosely) as follows:

1. As a starting point, a node in this case is something like (subtract, val, ~min), and we construct a temporary array of these nodes (the nodes are not the same size as the input type, so we can't put them in the original array). The last operation of the function is to evaluate all of these nodes and copy the results back into the input array, and we start out with a loop that evaluates each node (possibly in parallel).
2. These nodes do not escape the function, so we do not need to obey any particular ABI for them, and can change their representation in arbitrary ways (as long as we also change the evaluation function to match).
3. Subtract is a constant. Elide it. Nodes are now (val, ~min) and the evaluation operation is now hard-coded to do a subtract.
4. ~min is a single value over the whole array. Move it to a register. Nodes are now (val), and the evaluation operation is now hard-coded to subtract ~min from each node. Furthermore, since nodes are now the same size as N32, we can elide the temporary array and do the evaluation in-place.
5. The tuple wrapper around (val) isn't doing anything useful (and presumably doesn't even change the runtime representation at all). Elide it. Nodes are now val, which is the same type and value as the input array, so we need not do anything to construct nodes - we can just reinterpret the input array as an array of pre-constructed nodes.
6. Notice that the evaluation function is exactly identical to a vectorized subtract. So emit that instead of a loop.
7. Now the function doesn't construct nodes at all, so if there are any further optimization opportunities, we can find them with conventional analysis.

Of course, this all depends on a lot of assumptions. If they are representing every object as a separate heap allocation as in CPython, and List[N32] is an array of pointers (or a cons list), then yeah, it is going to be slow. But the problem with doing that is representing everything as a separate heap allocation (and building lists out of cons, if they are doing that). It's not inherent to the graph theory, it's just a consequence of how you decided to represent your data at runtime, and (to some extent) how much freedom you decided to expose in your APIs (e.g. if you get yourself into the business of garbage collecting arbitrary object graphs, that is going to slow you down at least a little compared to a more restrictive set of pointer/reference rules that is amenable to refcounting and the like).

Linear speedup

Athas — Thu, 06 Mar 2025 15:51:25 +0000

It's a little more complicated than that - you need to take both arithmetic throughput, core count, and frequency into account. Ultimately, you can indeed compute the peak computational capacity of a CPU or GPU (although it's usually given in FLOP/s - I suspect measuring integer addition may be trickier), although 10x does seem fairly reasonable.

Focusing on arithmetic is a bit misleading however, as most programs are not bottlenecked by arithmetic, but by memory accesses. GPUs also have far higher memory bandwidth (and latency) than CPUs, although with a lot of caveats about how the access patterns must look in order to fully utilize a GPU memory system. In fact, that is the reason why I am a little skeptical of Bend ever performing well on a GPU. While the execution model is obviously parallel, do we have any reason to suspect that the memory access pattern will have the kind of specific cross-thread spatial locality that is needed to take advantage of GPU memory? I don't see it. Perhaps Bend can run well on a specialised graph reduction machine, but on today's machine, the main problem is locality, and I don't see how interaction nets address that challenge.

Very interesting article!

daroc — Thu, 06 Mar 2025 14:08:58 +0000

So Taelin goes into more detail about that in his discussion of the history of the project, actually. Long story short: interaction nets are not optimal for every possible program, but there is a large subset of programs for which they are provably optimal. Bend, as a language, has a certain number of contortions in its type system to ensure that any programs you write with it fall inside the subset that can be evaluated optimally.

(Specifically, programs that are well-typed with elementary affine types can be executed optimally; they don't require the use of Lamping's oracle, which eliminates the brackets and lenses from his original algorithm.)

And some of Bend's benchmarks do demonstrate an asymptotic speedup compared to the same algorithm written in Haskell and executed either lazily or strictly. The challenge now is improving Bend's code generation, libraries, runtime, etc. in order to make things usefully faster in practice, as opposed to just in theory.

Linear speedup

daroc — Thu, 06 Mar 2025 14:04:24 +0000

Well, yes. As I mentioned in the article, it's a rough measure. The actual details of what is faster is going to depend strongly not only on your hardware, but also your particular problem and how well it can be automatically parallelized.

The basic existing benchmarks are definitely exciting, though, for exactly the reasons you point out: for problems that can be parallelized, it sure looks like Bend does an excellent job of doing that without much additional effort on the programmer's part.

Linear speedup

excors — Thu, 06 Mar 2025 12:32:40 +0000

> Are you assuming the GPU cores are as fast as your CPU cores? That's very unlikely to be true.

As I understand it, a CUDA core (in Nvidia's terminology) is essentially a single 32-bit ALU. A modern Intel CPU core has the equivalent of maybe 76 32-bit ALUs (depending on how you count). A GPU with 6144 CUDA cores is only about 10x more theoretical compute power than an 8-core CPU.

Obviously there are very big differences in how easily you can exploit the full compute power, so comparing ALU count is not very meaningful, but it's more meaningful than comparing "cores".

Linear speedup

SLi — Thu, 06 Mar 2025 11:34:11 +0000

Why is that not a linear speedup? I mean, could be, could not be. Are you assuming the GPU cores are as fast as your CPU cores? That's very unlikely to be true.

I think that to measure whether the speedup is linear, you'd need to run it on a single GPU core and compare to many GPU cores. Another interesting metric could be the energy consumption of the calculation on a CPU vs GPU.

I think a 32x speedup on GPU vs CPU suggests almost perfect parallelization. Depends of course also on the the details of the CPU and the GPU. In my own deep learning work, I use a thumb rule of expecting a high-end GPU to beat a very decent CPU by a factor of 40-100x (for training, which is the best case for GPU).

Very interesting article!

bredelings — Thu, 06 Mar 2025 02:08:39 +0000

I didn't realize that people were still working on that stuff! I had contented myself with the spineless-tagless-g-machine.

Is there some promise of doing optimal reduction by using interaction nets? I seem to recall that optimal reduction was even better than specializing each function for the data it was applied to, but in practice determining the optional reduction order had too much overhead. But perhaps using interaction nets changes that...