Two new graph-based functional programming languages
Functional programming languages have a long association with graphs. In the 1990s, it was even thought that parallel graph-reduction architectures could make functional programming languages much faster than their imperative counterparts. Alas, that prediction mostly failed to materialize. Even though graphs are still used as a theoretical formalism in order to define and optimize functional languages (such as Haskell's spineless tagless graph-machine), they are still mostly compiled down to the same old non-parallel assembly code that every other language uses. Now, two projects — Bend and Vine — have sprung up attempting to change that, and prove that parallel graph reduction can be a useful technique for real programs.
Background
A graph is a fairly general data structure consisting of nodes (which, for example, might represent commits, filesystem objects, or network routers), connected by a set of edges (which might represent relationships between commits, directory entries, or network cables). Graphs are frequently used to represent interconnected data of all kinds.
In the case of graph-reduction-based functional programming, nodes represent the data and functions of a program, and the edges represent the dataflow between those parts. For example, a particular function application (function call) might be represented by a node that is connected to the function's arguments, its definition (represented by a function abstraction node), and where the return value is used. The program is executed by incrementally rewriting the graph into a simpler and simpler form, until all that's left is the result. Andrea Asperti and Stefano Guerrini's book, The Optimal Implementation of Functional Programming Languages, gives this example of what part of the rewrite rules for a version of the lambda calculus might look like:
That diagram is a graphical representation of a rule that rewrites function application nodes (written with an @) and function abstraction nodes (written with a λ) when they come in contact. The graph on the left represents the configuration that the rule matches, and the graph on the right is what the rule rewrites it into. The connections are labeled with letters to make the rewrite easier to see, but the actual rule itself just needs to act on these two nodes whenever they become connected. The idea behind the rule is to attach the argument being given to the function (d) to the place in the function that references its argument (b). Then the result of the application (a) is connected to the final value of the body of the function (c).
Through a series of small rewrite rules like this, a graph machine can implement every operation in a program. Since the nodes are all directly connected to the data that they need to reference during a rewrite, there is no global synchronization needed at run time. Because these rules are local, they can be applied in parallel with no locking, automatically scaling the evaluation of the program across as many cores as are available. In some respects, the approach is similar to cellular automata — a method of computation based on a dynamic, evolving structure rather than on a static sequence of instructions.
The major advantages of this approach are parallelism and simplicity; the evaluation of the program depends on a handful of simple rewrite rules that can be simultaneously applied across the entire program. As processors get faster, the execution time of programs is increasingly defined by the latency of retrieving information from memory. In theory, structuring a computation so that it can apply one rule at a time to every node in the program could help with that, by ensuring that the CPU needs to fetch less context for each operation, and that operations can be located near the data they require.
The major disadvantage is that computers don't work like that. In practice, CPUs excel at applying complex operations to small amounts of data that are locally held in the cache, and struggle with operations that require operating over large blocks of memory. Evaluating a graph-reduction-based language is a pathological case for modern CPUs: simple rules, the evaluation of which requires fetching lots of data from memory and chasing chains of pointers. So the graph-reduction-based future of the 1990s didn't come to pass, and graphs are mostly only used in modern programming language implementations as an internal compiler representation. At the end of the compilation process, programs end up generating code in pretty much the same way regardless of language, although the details can vary.
In recent years, however, hardware trends have been shifting. GPUs are designed to apply many simple operations across a large piece of memory, although there are still problems with pointer-chasing that make implementing graph reduction difficult. Modern computers also sport many CPU cores, which makes the idea of being able to automatically parallelize execution tempting. So some projects, at least, believe that the graph-reduction-based approach warrants another look.
Bend and the HVM
Bend is a new, Apache-2.0-licensed, programming language designed to scale automatically to thousands of threads. It was started by Victor Taelin, who has been experimenting with prototypes of graph-reduction-based languages for many years, and has now started the Higher Order Company in order to attempt to develop those ideas further. In 2024, he published a summary of his rationale for working with graph-based functional programming.
Programs written in Bend do not yet have competitive performance compared to existing functional
programming languages like Haskell or OCaml, a circumstance that the language's
documentation
attributes to the fact that
"our code gen is still embarrassingly bad, and there is a lot to
improve
". However, Bend already allows programs written in it to be run
on a GPU and automatically parallelized, with measurable speedups from doing so.
The language does need to make a few concessions in order to make that possible. For example, a traditional garbage collector would be a single, centralized piece of code that would touch every value in the program — which would immediately destroy the value of the graph representation, by turning local rewrites into something with global consequences. So Bend's garbage collection is a bit unorthodox: every use of a variable points to an independent copy of its data. If a variable is used more than once, the data is copied so that each use gets its own copy. If part of a data structure is used in one place, and a different part is used in another, that still only counts as one use of the whole data structure, so it isn't copied. This design, with every use of a variable pointing to a separate copy of the data, allows for the memory to be garbage-collected as soon as the variable goes out of scope, with no other coordination.
Copying everything that needs to be used more than once is not good for performance in a traditional system, however. Bend partially circumvents that cost by performing copy operations lazily. So copying a data structure and passing it to a function that only uses part of it will only copy the parts that are actually used. Copying some data and then immediately freeing one of the copies by having its variable go out of scope is essentially free. In the future, better code generation should help with eliminating unnecessary copies.
Bend's syntax is inspired by Python, although it does have required type annotations and some features taken from Haskell. An implementation of merge sort looks like this:
def merge_sort(l: List(f24)) -> List(f24): match l: # Here List is the type; Nil and Cons are possible values of the type # Nil represents the empty list. case List/Nil: return List/Nil case List/Cons: match l.tail: case List/Nil: # Returning a new list here instead of just doing "return l" # avoids copying. Since l.head and l.tail are each used once, l # is only used once. return List/Cons(l.head, List/Nil) case List/Cons: (as, bs) = split(l) return merge_lists(merge_sort(as), merge_sort(bs))
That code demonstrates a few features of Bend; notably, that accessing data inside a structure like a list is done by pattern matching. If a list is zero elements long (that is, just a Nil value) or one element long (a Cons holding a single value and ending with a Nil), it's already sorted and we can just put the list back together and return it. The interesting case is when a list is longer than that: we split the list into two parts, sort each part independently, and then merge them back together. How the list is split doesn't matter, as long as the parts are roughly equal in size, but interested readers can see one possibility here. This is a naive translation of the definition of merge sort, but the fact that the two recursive merge_sort() calls are independent of each other is enough to let Bend evaluate them in parallel.
Bend compiles programs into a graph that can be executed by the Higher-order Virtual Machine (HVM). From there, the graph can either be evaluated on HVM's slow-but-definitely-correct interpreter, translated to C and compiled for execution on the CPU, or compiled to NVIDIA's CUDA language for execution on the GPU. Both of the compiled versions execute in multiple threads, although the GPU version typically has significantly more hardware threads available.
Bend is not magic, of course. Some algorithms are inherently serial, and so there's no way for Bend to effectively split them over multiple threads. Generally, in order to take advantage of Bend's capabilities, the documentation recommends writing things using a divide-and-conquer approach. For example, insertion sort is not any faster in Bend than in any other language (because each version of the array is needed to calculate the next array), but bitonic sort, which breaks the array into independent chunks during sorting, can be much faster.
I ran a series of performance experiments with Bend, to evaluate the potential speedup from running on the GPU. I used the examples from the Bend repository, which are mostly microbenchmarks of various kinds. Also, obviously, the CPU and GPU of a computer are not necessarily comparable in terms of straight-line performance. So these numbers should be taken with a large grain of salt as approximate indications of what is possible for some algorithms.
With that said, running a test that generated a large array and then sorted it with bitonic sort took 68 seconds on a single thread, 25 seconds on four threads (although this may have been impacted by hyperthreading on the virtual machine, which had access to four dedicated vCPUs), and 0.78 seconds on the GPU. This is not exactly a linear speedup — the GPU I tested on has 6,144 cores — but that is probably down to the overhead of loading the data into GPU memory and decoding the result once the program finishes. The language definitely shows its potential for inherently parallelizable algorithms.
Vine
Vine is a programming language started by "T6" in June 2024.
When I emailed them to ask about Vine, they said they started the project because
"there's a lot of potential in
interaction nets, and I wanted a language that could realize that.
" Since
then, the project has attracted a handful of other contributors. Unlike
Bend, Vine does not currently have code generation for GPUs, but it is still
designed around a graph-based representation, known as an
interaction net, that can be
executed on Vine's own virtual machine. Vine is dual-licensed under the Apache
2.0 and MIT licenses.
Vine focuses less on having a simple syntax, and more on exposing the power of its underlying model to the programmer. For example, it uses borrowed references to let the programmer avoid copies in many cases.
Its most unusual feature is probably a concept called an "inverse type", which can loosely be taken as representing an edge in the graph that "goes the other way". This is a concept more or less unique to graph-based languages, although it's somewhat similar to a promise in JavaScript or a channel in Go. A variable of a normal type represents a value that some other part of the program is responsible for generating; a variable of an inverse type represents a value that this part of the program is responsible for generating and sending out.
For example, a variable of type N32 (a 32-bit unsigned integer) corresponds to an edge that points to some computation that eventually returns an N32. The other end of that edge is represented by the inverse type, ~N32, and must eventually be given an N32. This mechanism lets the programmer write code like this function, which finds the minimum value in a list and then subtracts that value from each element:
fn sub_min(&list: &List[N32]) { // A normal variable, to store the smallest item of the list that we've // seen so far. We start with the maximum value. let min_acc = 0xFFFF_FFFF; // An inverted variable, which represents an obligation to eventually // provide the actual minimum value in the list. let ~min: N32; let it = list.iter(); while it.next() is Some(&val) { // As we iterate through the list, we keep track of the smallest value // so far in the normal way. if val < min_acc { min_acc = val; } // But we simultaneously get the eventual minimum value and subtract // it from each element. val -= ~min; } // After going through the whole list, we know what the real minimum was, // so we fulfil our obligation to provide it by putting the value into ~min. ~min = min_acc; }
Unlike the same code in a more typical language, this code only needs to traverse the list once. As it does, it fills the list up with partially-evaluated nodes corresponding to the subtraction. Once it reaches the last line (where the obligation to provide a value for ~min is discharged), those nodes have access to both of their arguments, and get evaluated in place, putting the final values into the list.
While interesting, and more than a little mind-bending, Vine hasn't yet shown whether this feature is really useful in real-world programs. I asked T6 whether they thought that Vine was practical for real programs, to which they said:
Not quite yet. There are a few additional features that I think are useful for writing more complex programs; they've recently been unblocked by my implementation of traits. Some more [language server] features would improve [quality of life] a bit as well.
[...]
Now, I think for it to be used in the real world, there will need to be applications of interaction nets in the real world. I think there will be, but it remains to be seen.
They also thought that eventually implementing graph-based functional
programming languages on the GPU (or even on custom hardware designed for the
purpose) could be "very fast
", but that it would take some additional
work to catch up to the decades of work that have gone into optimizing
programming language runtimes on existing CPUs. "I'm not trying to compete
with that.
"
Their future plans for Vine at this point involve rewriting Vine's compiler in
the language itself, which "will be a good test of the practicality of the
language
". They also want to investigate ways to statically prevent
deadlocks, and to develop a debugger. In general T6 seems optimistic about the
future of graph-based functional programming, and encouraged anyone who was
interested in the topic to look into it, since relatively few people are focused
on it at the moment.
Conclusions
While neither Vine nor Bend seems likely to become a competitor to major programming languages any time soon, they do indicate an exciting area of potential research for future languages. The hardware may have finally caught up with the lofty predictions of how to implement efficient functional languages. If computers continue adding more parallelism, there could be a lot of utility in programming language designs that support automatically scaling to match the hardware. Unlike some other advances in programming language design, however, it seems difficult for existing languages to adopt these features, since it would involve completely changing their evaluation model.
Posted Mar 6, 2025 2:08 UTC (Thu)
by bredelings (subscriber, #53082)
[Link] (1 responses)
Is there some promise of doing optimal reduction by using interaction nets? I seem to recall that optimal reduction was even better than specializing each function for the data it was applied to, but in practice determining the optional reduction order had too much overhead. But perhaps using interaction nets changes that...
Posted Mar 6, 2025 14:08 UTC (Thu)
by daroc (editor, #160859)
[Link]
(Specifically, programs that are well-typed with elementary affine types can be executed optimally; they don't require the use of Lamping's oracle, which eliminates the brackets and lenses from his original algorithm.)
And some of Bend's benchmarks do demonstrate an asymptotic speedup compared to the same algorithm written in Haskell and executed either lazily or strictly. The challenge now is improving Bend's code generation, libraries, runtime, etc. in order to make things usefully faster in practice, as opposed to just in theory.
Posted Mar 6, 2025 11:34 UTC (Thu)
by SLi (subscriber, #53131)
[Link] (5 responses)
I think that to measure whether the speedup is linear, you'd need to run it on a single GPU core and compare to many GPU cores. Another interesting metric could be the energy consumption of the calculation on a CPU vs GPU.
I think a 32x speedup on GPU vs CPU suggests almost perfect parallelization. Depends of course also on the the details of the CPU and the GPU. In my own deep learning work, I use a thumb rule of expecting a high-end GPU to beat a very decent CPU by a factor of 40-100x (for training, which is the best case for GPU).
Posted Mar 6, 2025 12:32 UTC (Thu)
by excors (subscriber, #95769)
[Link] (3 responses)
As I understand it, a CUDA core (in Nvidia's terminology) is essentially a single 32-bit ALU. A modern Intel CPU core has the equivalent of maybe 76 32-bit ALUs (depending on how you count). A GPU with 6144 CUDA cores is only about 10x more theoretical compute power than an 8-core CPU.
Obviously there are very big differences in how easily you can exploit the full compute power, so comparing ALU count is not very meaningful, but it's more meaningful than comparing "cores".
Posted Mar 6, 2025 15:51 UTC (Thu)
by Athas (subscriber, #104895)
[Link] (2 responses)
Focusing on arithmetic is a bit misleading however, as most programs are not bottlenecked by arithmetic, but by memory accesses. GPUs also have far higher memory bandwidth (and latency) than CPUs, although with a lot of caveats about how the access patterns must look in order to fully utilize a GPU memory system. In fact, that is the reason why I am a little skeptical of Bend ever performing well on a GPU. While the execution model is obviously parallel, do we have any reason to suspect that the memory access pattern will have the kind of specific cross-thread spatial locality that is needed to take advantage of GPU memory? I don't see it. Perhaps Bend can run well on a specialised graph reduction machine, but on today's machine, the main problem is locality, and I don't see how interaction nets address that challenge.
Posted Mar 6, 2025 21:15 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
I also assume that a "node" is an implementation detail and does not need to really exist at runtime (for reflection and the like), so that the optimizer is allowed to reason (loosely) as follows:
1. As a starting point, a node in this case is something like (subtract, val, ~min), and we construct a temporary array of these nodes (the nodes are not the same size as the input type, so we can't put them in the original array). The last operation of the function is to evaluate all of these nodes and copy the results back into the input array, and we start out with a loop that evaluates each node (possibly in parallel).
Of course, this all depends on a lot of assumptions. If they are representing every object as a separate heap allocation as in CPython, and List[N32] is an array of pointers (or a cons list), then yeah, it is going to be slow. But the problem with doing that is representing everything as a separate heap allocation (and building lists out of cons, if they are doing that). It's not inherent to the graph theory, it's just a consequence of how you decided to represent your data at runtime, and (to some extent) how much freedom you decided to expose in your APIs (e.g. if you get yourself into the business of garbage collecting arbitrary object graphs, that is going to slow you down at least a little compared to a more restrictive set of pointer/reference rules that is amenable to refcounting and the like).
Posted Mar 27, 2025 16:27 UTC (Thu)
by Spudd86 (subscriber, #51683)
[Link]
If you can do the analysis to prove something can be done in parallel without locks you can also likely do the work to re-organize the data with the same analysis. Especially in functional languages where people aren't writing update in place logic in the first place.
Posted Mar 6, 2025 14:04 UTC (Thu)
by daroc (editor, #160859)
[Link]
The basic existing benchmarks are definitely exciting, though, for exactly the reasons you point out: for problems that can be parallelized, it sure looks like Bend does an excellent job of doing that without much additional effort on the programmer's part.
Posted Mar 7, 2025 14:15 UTC (Fri)
by jezuch (subscriber, #52988)
[Link]
Nice. I wasn't keeping up with developments in sorting algorithms so I completely missed how you can make them efficient on GPUs. (By "efficient" I mean "efficiently exploiting the parallelism", because Wikipedia tells me that bitonic sort has a complexity of O(n × log^2 n), which is more complex than the theoretical minimum for serial algorithms.)
Posted Mar 8, 2025 17:55 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
To some extend, computers don't work like that because... workloads like Bend and Vine are not popular! Microbenchmarks used by CPU designers are based on popular software. It's even common to say that computers are not designed to run software but benchmarks :-)
Victor Bret's "The future of programming" has a good joke about that (among other good ones) and of course there's the seminal "C is not a low level programming language". Jim Keller has been pretty open about hardware design in interviews but unlike the previous two you need a lot more time.
GPGPUs, CUDA etc. had to happen anyway, it was inevitable. Funny the seed money came from gamers.
Very interesting article!
Very interesting article!
Linear speedup
Linear speedup
Linear speedup
Linear speedup
2. These nodes do not escape the function, so we do not need to obey any particular ABI for them, and can change their representation in arbitrary ways (as long as we also change the evaluation function to match).
3. Subtract is a constant. Elide it. Nodes are now (val, ~min) and the evaluation operation is now hard-coded to do a subtract.
4. ~min is a single value over the whole array. Move it to a register. Nodes are now (val), and the evaluation operation is now hard-coded to subtract ~min from each node. Furthermore, since nodes are now the same size as N32, we can elide the temporary array and do the evaluation in-place.
5. The tuple wrapper around (val) isn't doing anything useful (and presumably doesn't even change the runtime representation at all). Elide it. Nodes are now val, which is the same type and value as the input array, so we need not do anything to construct nodes - we can just reinterpret the input array as an array of pre-constructed nodes.
6. Notice that the evaluation function is exactly identical to a vectorized subtract. So emit that instead of a loop.
7. Now the function doesn't construct nodes at all, so if there are any further optimization opportunities, we can find them with conventional analysis.
Linear speedup
Linear speedup
Sorting on GPU
Chicken and egg