LWN: Comments on "DeVault: Announcing the Hare programming language"

DeVault: Announcing the Hare programming language

mpr22 — Thu, 23 Feb 2023 17:56:58 +0000

manual memory management is C's most infamous footgun.

DeVault: Announcing the Hare programming language

tinydev.art — Thu, 23 Feb 2023 14:33:28 +0000

This programming language doesn't seem to be aimed at you.

Hare seems to be C without all the footguns, with more consistent syntax. It solves the gripes modern C programmers have with C, not the gripes python programmers have with C.

DeVault: Announcing the Hare programming language

federico3 — Tue, 31 May 2022 00:05:55 +0000

Why Zig when you can use Nim?

DeVault: Announcing the Hare programming language

wtarreau — Sun, 15 May 2022 05:37:27 +0000

Never heard about V, thanks for the tip! There is some very interesting stuff there. Just tried it, there are some issues (invalid memory accesses and crashes when using wrong argument type to some functions without even a warning at build time, memory leaks like crazy triggering the OOM killer in a few seconds, and being 20 times slower than C on a the Vweb server test), but it looks well balanced, very likely suitable for scripting, and I like the fact that it translates to/from C. It looks quite young and we can hope that some of the problems above (invalid memory accesses an OOM) will get solved soon. It definitely deserves being watched!

DeVault: Announcing the Hare programming language

flussence — Fri, 13 May 2022 23:56:25 +0000

Alright, I'll give you the rest of that, but I don't think copying data between processes has been a hard problem for something like a decade now? (or else Linux would be absolutely awful for s/openssl/opengl/)

DeVault: Announcing the Hare programming language

tialaramex — Wed, 11 May 2022 16:09:06 +0000

> Where provenance makes sense is if the language you have has some higher level concept of an object which has some properties described in the program and known to the compiler.

I shall quote K&R themselves on this subject in their second edition:

"An object, sometimes called a variable, is a location in storage, and its interpretation depends on two main attributes ..."

> This ends up bringing with it rules such as you can not just arbitrarily turn one object into another and back again (i.e. no pointer casts), you can not arbitrarily split one object into two (i.e. no pointer arithmetic)

Nope, a language can (and some do, notably Rust of course but also this is possible with care in C and C++) provide pointer casts and pointer arithmetic. Provenance works this just fine for these operations.

Rust's Vec::split_with_spare_mut() isn't even unsafe. Most practical uses for this feature are unsafe, but the call itself is fine, it merely gives you back your Vec<T> (which will now need to re-allocate if grown because any spare space at the end of it is gone) and that space which was not used yet as a mutable slice of MaybeUninit<T> to do with as you see fit.

> and you can not arbitrarily manufacture pointers out of random data.

But here's where your problem arises. Here provenance is conjured from nowhere. It's impossible magic.

> Unfortunately C is not such a language and by forcing provenance rules on it, one is in essence trying to retrofit some kind of object model to it without any of the expressiveness and enforced rules that are needed for the programmer to not make programmes that are obviously wrong under the assumptions (i.e. provenance)

As we saw C is in fact such a language after all. The fact that many of its staunchest proponents don't seem to understand it very well is a problem for them and for C.

DeVault: Announcing the Hare programming language

tialaramex — Tue, 10 May 2022 23:48:50 +0000

Rust doesn't have anything similar to ISO/IEC 14882:2020, a large written document which is the product of a lot of work but is of limited practical value since it doesn't describe anything that actually exists today.

However, Rust does extensively document what is promised (and what is not) about the Rust language and its standard library, and especially the safe subset which Rust programmers should (and most do) spend the majority of their time working with.

For example, all that ISO document has to say about what happens if I've got two byte-sized signed integers which may happen to have the value 100 in them and I add them together is that this is "Undefined Behaviour" and offers no suggestions as to what to do about that besides try to ensure it never happens. In Rust the "no specification" tells us that this will panic in debug mode, but, if it doesn't panic (because I'm not in debug mode and I didn't enable this behaviour in release builds) it will wrap, to -56. I don't know about you, but I feel like "Absolutely anything might happen" is less specific than "The answer is exactly -56".

Rust also provides plenty of alternatives, including checked_add() unchecked_add() wrapping_add() saturating_add() and overflowing_add() depending on what you actually mean to happen for overflow, as well as the type wrappers Saturating and Wrapping which are useful here (e.g. Saturating<i16> is probably the correct type for a 16-bit signed integer used to represent CD-style PCM audio samples)

DeVault: Announcing the Hare programming language

nybble41 — Tue, 10 May 2022 23:48:27 +0000

The access to volatile memory at pte->v_addr through the non-volatile pointer ptr1 is UB because according to [C89]:

> If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.[57]

> [57] This applies to those objects that behave as if they were defined with qualified types, even if they are never actually defined as objects in the program (such as an object at a memory-mapped input/output address).

Objects in the page at pte->v_addr behave as if they were defined as volatile objects because the content changes in ways not described by the C abstract machine when pte->p_addr is updated. The same applies to passing a pointer to volatile object(s) to memset(), which takes a non-volatile pointer.

The initializer for page_location (pte->v_addr) is also not a constant, but I assume this is just pseudo-code for the value being set by some initialization function not shown here.

[C89] http://port70.net/~nsz/c/c89/c89-draft.html

Last comment, I promise.

cbushey — Tue, 10 May 2022 20:41:50 +0000

oh, and I have absolutely no clue how polyglot programming works using GralVM. At least the last two comments are pretty short.

And almost a week late!

cbushey — Tue, 10 May 2022 20:32:04 +0000

This is so not going to be read by anyone. Hey, I'm like that backup language!

I think I figured it out!

cbushey — Tue, 10 May 2022 20:29:06 +0000

>with a relatively unimportant and untested fallback in some other language for slower systems. (Almost no applications fall into this category.)

That sounds like a perfect problem for a programming language to solve. Not some newfangled fresh language where you need to solve a lot of problems and people need to learn new semantics, tooling, and libraries of course! No, it should be lua or javascript (or is that typescript or ecmascript), hmmm. Oh, I've got it! Use GralVM. Then you can aot compile all the code that can run on a jvm. Oh, plus it's designed to be a polyglot compiler so you should be able to mix and match languages you're using with it's help. Maybe even one language per module. Oh, and you've got that Truffle Language Implementation Framework with some more polyglot programing. Now you have a much better solution than making your own language. You can use everybody else's well tested and supported programming languages. Well, you end up learning half a dozen or so unfamiliar syntax but that is clearly better then figuring out something simpler in a single language that can easily solve the problem for a relatively unimportant and untested fallback for slower systems. Sorry. I just wanted to ramble some on lwn and this page has so many comments that my comment is guaranteed to get lost in the noise. I hope you have a nice day.

DeVault: Announcing the Hare programming language

khim — Tue, 10 May 2022 20:10:40 +0000

> Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

Rust hasn't needed any specs because till very recently there was just one compiler. Today we have 1.5: LLVM-based rustc and GCC-based rustc. One more is in development, thus I assume formal specs would be higher on list of priorities now.

This being said IMNSHO it's better to not have specs rather than have ones which are silently violated by actual compilers. At least when there are no specs you know that discussions between compiler developers and language users have to happen, when you have one which is ignored…

This looks like a good time to stop

corbet — Tue, 10 May 2022 20:05:17 +0000

When we start seeing this type of language, and people accusing each other of lying, it's a strong signal that a thread may have outlived its useful lifetime. This article now has nearly 350 comments on it, so I may not be the only one who feels like that outliving happened a little while ago.

I would like to humbly suggest that we let this topic rest at this point.

Thank you.

DeVault: Announcing the Hare programming language

khim — Tue, 10 May 2022 19:57:57 +0000

> And the problem is that formally specified C - including K&R C, and C89 - left a huge amount unspecified; users of C assumed that the behaviour of their implementation of the language was C behaviour, and not merely the way their implementation behaved.

True, but irrelevant. The most important part that we discussing here was specified in both: pointers are addresses, if two pointers are equal they can be used interchangeably.

> But then we got SSA, the polytope model, and other tools that allowed compilers to do significant optimizations beyond peephole optimizations on the source code, their IR, and the object code.

Yes. And there was an attempt to inject ideas that make these useful into C89. Failed one. The committee has created an unreal language that no one can or will actually use. It was ripped out (and replaced with crazy aliasing rules, but that's another story).

> And now we have a mess - the provenance model, for example, is compiler authors trying to find a rigorous way to model what users "expect" from pointers, not just what C89 permits users to assume

Can you, please stop lying? Provenance models are trying to justify deliberate sabotage where fully-standard compliant programs are broken. It's not my words, the provenance proposal itself laments:

These GCC and ICC outcomes would not be correct with respect to a concrete semantics, and so to make the existing compiler behaviour sound it is necessary for this program to be deemed to have undefined behaviour.

To make the existing compiler behavior sound, my ass. The whole story of provenance started with sabotage: after failed attempt to bring provanance-like properties to C89 saboteurs returned in C99 and, finally, succeeded in adding some (and thus rendered some C89 programs invalid in the process), but that wasn't enough: they got the infamous DR260 resolution which was phrased like that: After much discussion, the UK C Panel came to a number of conclusions as to what it would be desirable for the Standard to mean.

Note: the resolution hasn't changed the standard. It hasn't allowed saboteurs to break more programs. No. It was merely a suggestion to develop adjustments to the standards — and listed three cases where such adjustments were supposed to cause certain outcomes.

Nothing like that happened. For more than two decades compilers invented more and more clever ways to screw the developers and used that resolution as a fig leaf.

And then… Rust happened. Since Rust guys are pretty concerned about program correctness (and LLVM sometimes miscomplied IR-programs they perceived correct) they went to C++ guys and asked “hey, what are the actual rules we have to follow”? And the answer was… “here is the defect report, we use it to screw the developers and miscompile their valid programs”. Rust developers weren't amused.

And that is when the lie was, finally, exposed.

So, please, don't liken problems with pointer provenance to problems with C11 memory model.

Indeed, C89 or C99 doesn't allow one to write valid multi-threaded programs. Everything is defined strictly for single-threaded program. To support programs where two threads of execution can touch objects “simultaneously” you need to extend the language somehow.

Provenance excuse is used to break completely valid C and C++ programs. It's not about extending the language, it's about narrowing it. Certain formerly valid programs have to be deemed to have undefined behaviour..And after more than two decades we don't even have the rules which we are supposed to follow finalized!

And they express it in a form of if you believe pointers are mere addresses, you are not writing C++; you are writing the C of K&R, which is a dead language. IOW: they know they sabotaged C developers —and they are proud of it.

> Nothing else is promised to you - there's no such thing as a "compiler barrier" in C89, for example.

Yes. And to express many things which would be needed to, e.g., write an OS kernel in C89, you need to extend the language in some way. This is deliberate: the idea was to make sure strictly-conforming C89 programs run everywhere, but conforming programs may require certain language extensions. Not ideal, but works.

Saboteurs managed to screw it all completely and categorically refuse to fix what they have done.

DeVault: Announcing the Hare programming language

Vipketsh — Tue, 10 May 2022 19:22:56 +0000

I still don't see it. You have that memset here:

> void *get_zeroed_page()
> [...]
> memset(pte->v_addr, 0, PAGE_SIZE);

If you don't have to assume that this writes over data pointed to by some other pointer it means that your aliasing rules say that no two pointers alias. Or put another way, for all practical purposes, having two pointers pointing to the same thing is unworkable. By some reading of C89 that may be the conclusion, but quite clearly that was never the intent and exactly no one expects things to work that way (including compiler writers, oddly enough).

> compiler does not have to assume that *ptr1 has changed within the C abstract machine

You mean across a function call ? That quite simply means that exactly no data could ever be shared by any two functions (in different compile units). Again, this would make the language completely unworkable and be counter what anyone expects.

> [...] you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

No. The language is defined, first and foremost, by what existing programs expect. If the standard allows interpretations and compilers to do things counter to what a majority of these programs expect, it is the standard that is broken and not the majority of all programs. I firmly believe that the job of a standard is to document existing behaviour and not to be a tool to change all programmes out there.

p.s.: I find it fascinating that instead of arguing about actual behaviour the C standard keeps coming up as if it where a bible handed down by some higher power and everything in it is completely infallible. Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

DeVault: Announcing the Hare programming language

khim — Tue, 10 May 2022 19:01:08 +0000

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Which is the only way to have PTEs in C code.

> And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change.

If you have such an address then you have to extend the language somehow. Or, alternatively, don't touch it.

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Of course. Because it's impossible to write C89 program which changes the PTEs, such a concept just couldn't exist in it. You have to extend the language to cover that usecase.

> If compilers really did implement C89 to the letter of the specification

…then such compilers would have been as popular as ISO 7185. Means: no one would have cared about these and no one would have mentioned their existence.

> If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible

Yes. But some programs would still be possible. Programs which do tricks with pointers would work just fine, programs which touch PTEs wouldn't.

> provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89)

Citation needed. Badly. Because, I would repeat once more, in C89 (not in C99 and newer) the notion of “pointers which have the same bitpattern yet different” doesn't exist. If you add a few bits to the pointer converted to integer and then clear these same bits you would get the exact same pointer — guaranteed. The fact that these bits are lowest bits of converted integer is implementation-specific thing, you can imagine a case where these would live as top bits, e.g. So yet, that requires some implementation-specific extension. But pretty mild and limited.

Yes, provenance is an attempt to deal with the idea of C99+ that some pointers may be equal to others yet, somehow, still different — but that's not allowed in C89. If two pointers are equal then they are either both null pointers, or both point to the same object, end of story.

Sure, it makes some optimizations hard and/or impossible. So what? This just means that you cannot do such optimizations in C89 mode. Offer -fno-provenance option, enable it for -std=c89, done.

DeVault: Announcing the Hare programming language

farnz — Tue, 10 May 2022 18:36:32 +0000

If I promise that the function calls are just naming what code does, but it's real behaviour is poking global volatile pointers, and those functions are implemented in pure C89, there's no difference in behaviour. Given the following C definitions of get_zeroed_page, get_ptr_to_handle and release_page, you still have the non-determinism, albeit I've introduced a platform dependency:


const size_t PAGE_SIZE;
struct pt_entry {
    volatile char *v_addr;
    volatile char *p_addr;
}
volatile struct pt_entry *pte; // Initialized by outside code, with suitable guarantees on v_addr for the compiler implementation and on p_addr
int *page_location = pte->v_addr;

void *get_zeroed_page() {
   pte->p_addr += PAGE_SIZE;
   memset(pte->v_addr, 0, PAGE_SIZE);
   return pte->v_addr;
}

void release_page(void *handle) {
  assert(handle == pte->v_addr);
  pte->p_addr -= PAGE_SIZE;
}

void *get_ptr_to_handle(void* handle) {
  assert(handle == pte->v_addr);
  return page_location;
}

This has semantics on the real machine, because of the use of volatile - the writes to *pte are guaranteed to occur in program order. But the compiler does not have any way to know that volatile int *v_addr ends up with the same value between two separate calls to get_ptr_to_handle but points to different memory.

Also, I'd note that C89 does not have language asserting what you claim - it actually says quite the opposite, that the compiler does not have to assume that *ptr1 has changed within the C abstract machine, since ptr1 is not volatile. It's just that early implementations made that assumption because to do otherwise would require them to analyse not just the function at hand, but also other functions, to determine if *ptr1 could change. Like khim, you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

DeVault: Announcing the Hare programming language

farnz — Tue, 10 May 2022 18:10:40 +0000

fread and fwrite are poor examples, because they are C code defined in terms of the state change they make to the abstract machine, and with a QoI requirement that the same state change happens to the real machine. Indeed, everything that's defined in C89 has its impact on the abstract machine fully defined by the spec; the only get-out is that volatile marks something where all reads and writes through it must be visible in the real machine in program order.

But note that this is a very minimal promise; the only thing happening in the real machine that I can reason about in C89 is the program order of accesses to volatiles. Nothing else that happens in the abstract machine is guaranteed to be visible outside it - everything else is left to the implementation's discretion.

And no, the state change is not visible inside the C89 abstract machine; if I write through a volatile pointer to a PTE, the implementation must ensure that my write happens in the real machine as well as the abstract machine, but it does not have to assume that anything has changed in the abstract machine. That, in turn, means that it may not know that ptr1 now has changed in the "real" machine, because it's not volatile and thus changes in the real machine are irrelevant.

And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change. Now, depending on the architecture, that almost certainly is not enough to guarantee an instant change - e.g. on x86, the processor can use old or new value of the PTE until the TLB is flushed, and I can force a TLB flush with invlpg to get deterministic behaviour - but I can bring the program into a non-deterministic state without calling into an assembly block or running a system routine, as long as I have the address of a PTE.

And there's no "list of system routines" in C89; the behaviour of fread, fwrite and other such functions is fully defined in the abstract machine by the spec, with a QoI requirement to have their behaviour be reflected in the "real" machine. By invoking the idea of a "list of system routines", you're extending the language beyond C89.

You're making the same mistake a lot of people make, of assuming that the behaviour of compilers in the early 1990s and earlier reflected the specification at the time, and wasn't just a consequence of limited compiler technology. If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible; provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89) while still allowing the compiler to reason about the meaning of your code in a way compatible with the C89 specification.

DeVault: Announcing the Hare programming language

farnz — Tue, 10 May 2022 18:03:23 +0000

Early C did not have a formal specification - what the one and only implementation did was what the language did.

And the problem is that formally specified C - including K&R C, and C89 - left a huge amount unspecified; users of C assumed that the behaviour of their implementation of the language was C behaviour, and not merely the way their implementation behaved.

Up until the early 1990s, this wasn't a big deal. The tools needed for compilers to do much more than peephole optimizations simply didn't exist in usable form; remember that SSA, the first of the tools needed to start reasoning about blocks or even entire programs doesn't appear in the literature until 1988. As a result, most implementations happened, more by luck than judgement, to behave in similar ways where the specification was silent.

But then we got SSA, the polytope model, and other tools that allowed compilers to do significant optimizations beyond peephole optimizations on the source code, their IR, and the object code. And now we have a mess - the provenance model, for example, is compiler authors trying to find a rigorous way to model what users "expect" from pointers, not just what C89 permits users to assume, while C11's concurrent memory model is an effort to come up with a rigorous model for what users can expect when multiple threads of execution alter the state of the abstract machine.

Remember that all you are guaranteed about your code in C89 is that the code behaves as-if it made certain changes to the abstract machine for each specified operation (standard library as well as language), and that all volatile accesses are visible in the real machine in program order. Nothing else is promised to you - there's no such thing as a "compiler barrier" in C89, for example.

DeVault: Announcing the Hare programming language

Vipketsh — Tue, 10 May 2022 17:26:01 +0000

You are playing a nice slight-of-hand here. Your code clearly shows function calls, and without knowing what's in those called functions, one has no choice but to assume that *ptr1 has changed (within the C abstract machine), and thus you can never get to the output of "-1 == 0". If, on the other hand, you do see the internals of those functions you will see the modification of some memory that may alias with ptr1 and then because of that you can not get the "-1 == 0" output.

The only way I can see your reasoning working is if you are somehow allowed to assume that function calls are an elaborate way of saying "nop".

DeVault: Announcing the Hare programming language

Vipketsh — Tue, 10 May 2022 17:16:08 +0000

Where provenance makes sense is if the language you have has some higher level concept of an object which has some properties described in the program and known to the compiler. Most importantly object lifetime is known. This ends up bringing with it rules such as you can not just arbitrarily turn one object into another and back again (i.e. no pointer casts), you can not arbitrarily split one object into two (i.e. no pointer arithmetic) and you can not arbitrarily manufacture pointers out of random data. Unfortunately C is not such a language and by forcing provenance rules on it, one is in essence trying to retrofit some kind of object model to it without any of the expressiveness and enforced rules that are needed for the programmer to not make programmes that are obviously wrong under the assumptions (i.e. provenance). Worse yet the rules, or better said heuristics*, that standards and compilers chose to signal lifetime are counter to what existing programmes expect and exploit.

Re: your mathematics analogy. I think you have taken the wrong view point there: pretty much all of mainstream mathematics is concerned with either extending existing and useful theory (e.g. how rational numbers where extended to create irrational numbers) or to put existing theory on a more sound footing (e.g. Hilbert's axioms), possibly closing off various paradoxes. Realise how in pretty much all of the evolution of mathematics a very strong emphasis was placed on any new theory being backwards compatible -- no-one, taken seriously, wanted to ever end up with 1+1=3 but instead worked to solidify the intuition that 1+1=2. I think that if one wanted to paint mathematical evolution onto C, the definitions underpinning C would need to be changed in a way that (i) they are backwards compatible with existing programs and (ii) that loopholes exploited by compiler writers be closed instead of officially sanctioned. Right now, it's the opposite: people are trying to convince C programmers that the intuition they had all along was always false and reality is actually completely different.

*: Possibly the one with the most problems is the idea that realloc() will, in the absence of failure, always (i) destroy the object passed to it, and (ii) will allocate a completely new one. This is counter to the intuition of many a programmer and there is no enforced rule in C that prevents programmers from carrying pointers over the realloc() call, which would make the idea actually work. The reason people are annoyed is that such code exists, is used and has worked for a long time and there is no evidence that this idea has much, if any, benefit on the compiled program.

DeVault: Announcing the Hare programming language

khim — Tue, 10 May 2022 17:00:27 +0000

> Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile.

No. Calling fread and fwrite can certainly change things, too.

> But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

Yes and no. Change is dramatic, sure. But it's most definitely visible to C code.

By necessity such things have to either be implemented with volatile or by calling system routine (which must be added to the list of functions like fread and fwrite as system extension, or else you couldn't use them them).

Place where you pass your pointer to the invlpg would be place where compiler would know that object may suddenly change value.

> In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

In practice people who are doing these things have to use volatile at some point in the kernel, or else it just wouldn't work. Thus I don't see what you are trying to prove.

The fact that real OSes have to expand list of “special” functions which may do crazy things? It's obvious. In practice your functions are called mmap and munmap and they should be treated by compiler similarly to read and write: compiler either have to know what they are doing or it should assume they may touch and change any object they can legitimately refer given their arguments.

> I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

No. You couldn't do things like change to PTEs in a fully portable C89 program. It's pointless to talk about such programs since they don't exist.

The only way to do it is via asm and/or call to system routine which, by necessity, needs extensions to C89 standard to be usable. In both cases everything is fully defined.

DeVault: Announcing the Hare programming language

farnz — Tue, 10 May 2022 16:39:10 +0000

When you say "if pointers are equal, then they are completely equivalent", are you talking at a single point in time, or over the course of execution of the program?

Given, for example, the following program, it is a fault to assume that ptr1 and ptr2 are equivalent throughout the runtime, because ptr1 is invalidated by the call to release_page:


handle page_handle = get_zeroed_page();
int test;
int *ptr2;
int *ptr1 = get_ptr_to_handle(page_handle);
*ptr1 = -1; // legitimate - ptr1 points to a page, which is larger than an int in this case and correctly aligned etc.
test = *ptr1; // makes test -1
release_page(page_handle);
page_handle = get_zeroed_page();
ptr2 = get_ptr_to_handle(page_handle); // ptr2 could have the same numeric value as ptr1.
if (ptr2 == ptr1 && *ptr1 == test) {
    puts("-1 == 0");
} else {
    puts("-1 != 0");
}
release_page(page_handle);

This is the sort of code that you need to be clear about; C89's language leaves it unclear whether it's legitimate to assume that *ptr1 == test, even though the only assignments in the program are to *ptr1 (setting it to -1) and test. The thing that hurts here is that even if, in bitwise terms including hidden CHERI bits etc, ptr1 == ptr2, it's possible for the underlying machine to change state over time, and any definition of "completely equivalent" has to take that into account.

One way to handle that is to say that even though the volatile keyword does not appear anywhere in that code snippet, you give dereferencing a pointer volatile-like semantics (basically asserting that locations pointed to can change outside the changes done by the C program), and say that each time it's dereferenced, it could be referring to a new location in physical memory. In that case, this program cannot print "-1 == 0", because it has to dereference ptr1 to determine that.

Another is to follow the letter of the C89 spec, which says that the only things that can change in the abstract machine's view of the world other than via a C statement are things marked volatile. In that case, this program is allowed to print "-1 == 0" or "-1 != 0" depending on whether ptr1 == ptr2, because the implementation "knows" that it is the only thing that can assign a value to *ptr1, and thus it "knows" that because no-one has assigned through *ptr1 since it read the value to get test it is tautologically true that *ptr1 == test.

Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile. But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

And that's the fundamental issue with rewinding to C89 rules - C89 implies very strongly that the only interface between things running on the "abstract C89 machine" and the real hardware are things that are marked as volatile in the C abstract machine. In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

Note, too, that I wasn't talking about optimization in either case - I'm simply looking at the semantics of the C abstract machine as defined in C89, and noting that they're not powerful enough to model a change that affects the abstract machine but happens outside it. I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

DeVault: Announcing the Hare programming language

khim — Tue, 10 May 2022 16:10:42 +0000

> Your resulting C compiler is not the GCC I grew up with (well, OK, the one that teenage me knew), or that with some minor optimisation passes disabled, it's an altogether different animal, perhaps closer to Python.

But that's is the language which Kernighan and Ritchie designed and used to write Unix. Their goal was not to create some crazy portable dream, they just wanted to keep supporting both 18-bit PDP-7 and 16-bit PDP-11 from the same codebase by rewriting some parts of code written in PDP-7 assembler in the higher-level language. They have been using B which had no types at all and improved it. By adding character types, then structs, arrays, pointers (yes, B conflated pointers and integers, it only had one type).

Yet malloc was not special, free was not special and not even all Unix programs used them (just look obn the source of original Bourne Shell some days).

> I don't believe there is or was an audience for this compiler.

How about “all the C users for the first decade of it's existence”? Initially C was used exclusively in Unix, but in 1980th it became used more widely. Yet I don't think any compilers of that era support anything even remotely resembling “pointer provenance”.

That craziness started after a failed attempt of C standard committee to redo the language. They then went back and replaced that with a simpler aliasing rules which prevented type puning, but even these weren't abused by compilers till XXI century.

> People weren't writing C because of the expressive syntax, the unsurpassed quality of the tooling or the comprehensive "batteries included" standard library, it didn't have any of those things - they were doing it because C compilers produce reasonable machine code, and this alternative interpretation of C89 doesn't do that any more.

Can you, please, stop rewriting history? C was quite popular way before ANSI C arrived and tried (but failed!) to introduce crazy aliasing rules. Yes, C compilers were restricted and couldn't do all the amazing optimizations… but C developers can do these, instead! When John Carmack was adopting his infamous 0x5f3759df-based trick he certainly haven't cared to think about the fact that there are some aliasing rules which may render code invalid and that was true for the majority of users who grew in an era before GCC started breaking good K&R programs.

> This is because under your preferred C89 "no provenance" model there isn't any provenance, CHERI isn't a fairytale spell it's just engineering.

It's engineering, yes, but you can add provenance to C89. It just has to be consistent. You can even model it with “poor man's CHERI” aka MS-DOS Large model by playing tricks with segment and offset. E.g. realloc could turn 0x0:0x1234 pointer into 0x1:0x1224 pointer if it decided not to move an object.

This way all the fast-path code would be negated and you would never have the situation where bitwise-identical pointers point to different objects. This may not be super-efficient but it is compatible with C89. Remember? Bitwise-different pointers can point to the same object, but the opposite is forbidden! Easy, no?

All these games where certain pointers can be compared but not others and its programmer's responsibility to remember all these unwritten rules… I don't know how that language can be used to development, sorry.

The advice I have gotten from our toolchain-support team is to ask clang developer about low-level constructs which I may wish to create!

So much for “standard is a treaty” talks…

DeVault: Announcing the Hare programming language

tialaramex — Tue, 10 May 2022 15:04:40 +0000

I don't have a Russell's Paradox equivalent under your constraints, but I think it's worth highlighting just how severe those constraints are.

Your resulting C compiler is not the GCC I grew up with (well, OK, the one that teenage me knew), or that with some minor optimisation passes disabled, it's an altogether different animal, perhaps closer to Python. In this language, pointers are all just indexes into an array containing all of memory - including the text segment and the stack, and so you can do some amazing acrobatics as a programmer, but your optimiser is badly hamstrung. C's already poor type checking is further reduced in power in the process, which again makes the comparison to Python seem appropriate.

I don't believe there is or was an audience for this compiler. People weren't writing C because of the expressive syntax, the unsurpassed quality of the tooling or the comprehensive "batteries included" standard library, it didn't have any of those things - they were doing it because C compilers produce reasonable machine code, and this alternative interpretation of C89 doesn't do that any more.

> (except something like CHERI where provenance actually exist at runtime and is 100% consistent)

You can only do this at all under CHERI via one of two equally useless routes:

1. The "Python" approach I describe where you declare that all "pointers" inherit a provenance with 100% system visibility, this obviously doesn't catch any bugs, and you might as well switch off CHERI, bringing us to...

2. The compatibility mode. As I understand it Morello provides a switch so you can say that now we don't enforce CHERI rules, the hidden "valid" bit is ignored and it behaves like a conventional CPU. Again you don't catch any bugs.

This is because under your preferred C89 "no provenance" model there isn't any provenance, CHERI isn't a fairytale spell it's just engineering.

DeVault: Announcing the Hare programming language

renox — Tue, 10 May 2022 13:55:05 +0000

> Why, oh why, do we need yet another programming language?
> We don't.

We do, because
1) unfortunately most languages have very serious design issues (C integer promotions rules for example)
2) languages are often never "changed", new features are added but design issues are left unchanged..

Unfortunately there are many C "would be replacer": Odin, Zig, V, Hare now (quite a few other in the Pascal family), this will spread thin contributors and delay the adoption of a "popular" C's replacement language.

DeVault: Announcing the Hare programming language

renox — Tue, 10 May 2022 13:44:52 +0000

Could you explain why you think Zig doesn't follow its initial goals ?

DeVault: Announcing the Hare programming language

khim — Tue, 10 May 2022 02:18:30 +0000

> C89 assumes that pointers to things must be different, which sounds reasonable but does not formally explain how this works.

No. It's the opposite. C89 doesn't assume that pointers to things must be different. But yes, it asserts that if pointers are equal then they are completely equivalent — you can compare any two pointers and go from there.

Note that it doesn't work in the other direction: it's perfectly legal to have pointers which are different yet point to the same object. That's easily observable in MS-DOS's large memory model.

> I do not believe that C89 is fine, and thus that we should or could just implement C89 as if provenance isn't a thing and be happy.

Show me a Russel's paradox, please. Not in a form “optimizer could not do X or Y, which is nice thing to be able to do and thus we must punish the software developer who assumes pointers are just memory addresses”, but “this valid C89 program can produce one of two outputs depending or how we read the rules and thus it's impossible to define how it should work”.

Then and only then you would have a point.

> That's my point here.

I think you are skipping one important step there. Yes, there was an attempt to add rules about how certain pointers have to point to different objects. But it failed spectacularly. It was never adopted and final text of C89 standard don't even mention it. In C89 pointers are just addresses, no crazy talks about pointer validity, object creation and disappearance and so on.

There are some inaccuracies from that failed attempt: C89 defines only two lifetimes: static and automatic… yet somehow the value of a pointer that refers to freed space is indeterminate. Yet if you just declare that pointers are addresses then it should be possible to fix these inaccuracies without much loss.

Where non-trivial lifetimes first appeared is C99, not C89. There yes, it has become impossible to read from "union member other than the last one stored into", there limitation on whether you can compare two pointers or not were added (previously it was possible to compare two arbitrary pointers and if they happen to be valid and equal, they would be equivalent), etc.

But I don't see why C89 memory model would be, somehow, unsound. Hard to optimize? Probably. Wasterful? Maybe so. But I don't see where it's inconsistent. Show me. Please.

Yes, it absolutely rejects pointer provenance in any shape or form (except something like CHERI where provenance actually exist at runtime and is 100% consistent). Yes, it may limit optimization opportunities. But where's the Russel's paradox, hmm?

DeVault: Announcing the Hare programming language

tialaramex — Mon, 09 May 2022 22:39:07 +0000

Maybe I didn't express myself very well.

As I understand it, the problem with assumptions in naive set theories and C89 (and various other things) is that because it's an assumption rather than an axiom you don't spot where the problems are. You never write it down at all and so have no opportunity to notice that's it's too vague whereas when you're writing an axiom you can see what you're doing.

The naive theories let Russell's paradox sneak in by creating this poorly defined set, but the axioms in Zermelo's theory oblige you to define a set more carefully to have a set at all, and in that process the paradox gets defused. In particular ZFC has a "specification" axiom which says in essence OK, so, tell me how to make this "set" using another set and first order logic. The sets naive set theories were created for can all be constructed this way no problem, but weird sets with paradoxical properties cannot.

C89 assumes that pointers to things must be different, which sounds reasonable but does not formally explain how this works. I believe that it's necessary (in order for a language like C89 to avoid its equivalent of paradoxes, the programs which abuse language semantics to do something hostile to optimisation) to define such rules, and they're going to look like provenance.

I do not believe that C89 is fine, and thus that we should or could just implement C89 as if provenance isn't a thing and be happy. That's my point here. C89 wasn't written in defiance of a known reality, but in ignorance of an unknown one, like the Ptolemaic system. Geocentrists today are different from Ptolemy, but not because Ptolemy was right.

DeVault: Announcing the Hare programming language

Wol — Mon, 09 May 2022 21:55:35 +0000

> Your naive theory _assumed_ things which Zermelo made into axioms.

Whoops. I know a lot of people think my view of maths is naive, but that's not what I understand an axiom to be. An axiom is something which is *assumed* to be true, because it can't be proven. Zermelo would have provided a proof, which would have changed your naive theory from axiom to proven false.

This is Euclid's axiom that parallel lines never meet. That axiom *defines* the special case of 3D geometry. but because in the general case it's false, it's not an axiom of geometry.

Cheers,
Wol

DeVault: Announcing the Hare programming language

tialaramex — Mon, 09 May 2022 18:58:34 +0000

> Naturally they also refuse to enable such options when you compile programs with -std=c89 (variant of the C standard which does not include any provisions for provenance whatsoever).

Suppose it is the year 1889, the twentieth century seems bright ahead, and you have learned everything there is to know (or so you think) about Set Theory. (This is called Naive Set Theory). It seems to you that this works just fine.

Fast forward a few years, and this chap named Ernst Zermelo says your theory is incomplete and that's why it suffers from some famous paradoxes. His new axiomatised Set Theory requires several exciting new axioms, including the infamous Axiom Of Choice and with these axioms Zermelo vanquishes the paradoxes.

Now, is Ernst wrong? Was your naive theory actually fine? Shouldn't you be allowed to go back to using that simpler, "better" set theory and ignore Ernst's stupid paradoxes and his obviously nonsensical Axiom of Choice ? No. Ernst was right, your theory was unsound, _and it was already unsound in 1889_, you just didn't know it yet. Your naive theory _assumed_ things which Zermelo made into axioms.

Likewise, the C89 compilers you have nostalgia for were actually unsound, and it would have been possible (or if you resurrect them, is possible today on those compilers) to produce nonsensical results because in fact pointer provenance may not have been mentioned in the C89 standard but it was relied upon by compilers anyway. It was silently assumed, and had been for many years.

The excellent index for K&R's Second Edition of "The C Programming Language" covering C89, doesn't even have an entry for the words "alias" or "provenance". Because there are _assumptions_ about these things baked in to the language, but they haven't been surfaced.

The higher level programming languages get to have lots of optimisations here because assumptions like "pointer provenance" are necessarily true in a language that only has references anyway. To keep those same optimisations (as otherwise they'd be slower!) C and C++ must make the assumptions too, and yet to deliver on their "low level" promise they cannot. Squaring this circle is difficult which is why the committees punt rather than do it over all these years.

I happen to think Rust (more by luck than judgement so far as I can see) got this right. If you define most of the program in a higher level ("safe" in Rust terms) language, you definitely can have all those optimisations and then compartmentalize the scary assumption-violating low level stuff. This is what Aria's Tower of Weakenings is about too. Aria proposes that even most of unsafe Rust can safely keep these assumptions, something like Haphazard (the Hazard Pointer implementation) doesn't do anything that risks provenance confusion and so it's safe to optimize with those assumptions and more stuff like that can be safely labelled safe, until only the type of code that really mints "pointers" from integers out of nowhere cannot justify the assumptions and accordingly cannot be optimised successfully.

It's OK if five machine instructions can't be optimised, probably even if they're in your tight inner loop, certainly it's better than accidentally optimising them to four *wrong* instructions. What's a problem for C and C++ is that the provenance problem is infectious and might spread from that inner loop to the entire program and then you're back to slower than Python.

DeVault: Announcing the Hare programming language

farnz — Mon, 09 May 2022 08:35:30 +0000

Or, going back further; by 1960 we had FORTRAN, LISP, ALGOL and COBOL. Why did we bother with C, FORTH, Smalltalk, BASIC, Prolog, SQL, ML, Pascal, Logo, Common Lisp, Ada, Objective-C, Haskell, Python, R, Ruby, Java, Delphi, PHP, JavaScript, C#, the ones you named, and more, when we had "enough" programming languages?

DeVault: Announcing the Hare programming language

rsidd — Mon, 09 May 2022 05:29:12 +0000

If people in 2002 had decided that we don't need new languages, we wouldn't have had Go, or Rust, or Scala, or Kotlin, or (my favourite for scientific programming) Julia. What has changed in 2022 that we don't need new languages any more? Or are you saying we didn't need the above languages either?

DeVault: Announcing the Hare programming language

farnz — Sun, 08 May 2022 16:02:32 +0000

I think you're taking this the wrong way - the point is that C programmers value the fact that C someone wrote 30+ years ago is still useful C far more than they value any new language feature that's not in C2x (or often, C99 or C90).

As a consequence of this, the bar for "new language that will attract people away from C" is very, very high. The natural instinct of the remaining people who prefer C when faced with a new language is not "oh, that feature is awesome and I will switch language to get it", but "will that language still be usable in N decades time? Can I implement the feature usably in a C library?".

If you do value that long-term stability over the language features, then a new language aiming to attract you has a big hill to climb - being new, it can't point to 30+ years of history like C can, and so how can it convince you that code written for the language as it exists today will still be useful code in 30 years time?

After all, we've established that that's a big chunk of why you use C - because you've seen so many attractive languages come and go in C's lifetime, and you don't want to start a project today that'll be impossible to compile (let alone use) in a decade. And that's a perfectly good reason to prefer C - it's the same reason that a lot of physicists prefer Fortran, because they want their code to still work for cross-checking results decades after a 20 year experiment comes to an end - but it does make it hard for a new language to attract you away from C.

DeVault: Announcing the Hare programming language

tialaramex — Sun, 08 May 2022 12:48:30 +0000

> My opinion is that the reason that language is in there, and has to be there, is so that malloc(), or anything else that works with a pointer, can return or check for an error. And the reason dereferencing a NULL pointer is undefined is because there is no telling how a platform behaves when you do so. See how none of this has anything to do with the compiler ?

You have muddled the NULL pointer (an abstract idea) with the all zeroes address on a typical CPU, these are intentionally not the same thing.

While it's obviously a bad idea, C has no trouble with using actual values from a type as sentinels, atoi("junk") and atoi("0") are both zero. So it wouldn't have been a problem to define that malloc() returning zero can be either an error or an actual zero address. And because C runs on the abstract machine, not some actual platform with whatever weird behaviour, the question of what happens if we try to do platform illegal operations never comes up.

Most platforms are likely to either not be phased at all by the all-zeroes address, or to be equally concerned with some other address values, including values beyond some logical "end of memory", ROMs, and memory mapped peripherals. We can observe that the C language does not define special behaviour for any of these, only NULL which means something in the abstract machine and *that* is why it's used as a sentinel value.

DeVault: Announcing the Hare programming language

Vipketsh — Sun, 08 May 2022 12:04:06 +0000

Thanks for clarifying and sorry about my misunderstanding.

DeVault: Announcing the Hare programming language

iustin — Sun, 08 May 2022 11:55:21 +0000

> No, C programmers are not degenerate psychologically challenged moronic narcissists who work day and night just to deliver security issues and bugs to you. Could we be civil and stop looking down on people who code in C, please ?

I think you misunderstood my reply entirely. I think the post I quoted was valid in the sense of - C programmers have seen enough fads coming by and passing, and yet - as you say - C still works. Not in the sense "they're dumb and cannot adjust".

DeVault: Announcing the Hare programming language

Vipketsh — Sun, 08 May 2022 11:53:16 +0000

I look at new programming languages as experiments: the important point isn't to create something wonderful, but to either prove or disprove the practicality of things. Personally, I have little interest in experimenting but I am interested in reading about the results when people do. Thus I find it valuable when people try new things. Actually, I definitely applaud people trying things by creating something new instead of making stuff up and then ramming it into an existing popular language where if it doesn't work out we are all stuck with garbage for evermore. If things don't work out for a new language, it dies, no one remembers and there is little lasting harm, but if it does something awesome and it shows it is awesome, those pieces can be cherry-picked into the popular languages.

DeVault: Announcing the Hare programming language

Vipketsh — Sun, 08 May 2022 11:49:29 +0000

I think many people who still use C do so for the same very simple reason I do: it works, has a reasonable community (many libraries) and, most importantly, code written today has a high chance of continuing to work unmodified far into the future. C is the only language which delivers all three, even though that last part is going down the drain these days because of continued exploitation of undefined behaviour. No matter what I find exactly no enjoyment being on a permanent treadmill having to jiffy my code around so that it works again with whatever "awesomeness" the language designers decided to change in the last few months or, even worse, to find that the language has been abandoned and now my code needs a full rewrite in something else. Is this really so bad ?

No, C programmers are not degenerate psychologically challenged moronic narcissists who work day and night just to deliver security issues and bugs to you. Could we be civil and stop looking down on people who code in C, please ?

DeVault: Announcing the Hare programming language

anton — Sat, 07 May 2022 20:35:15 +0000

In the decades for which Lisp and FORTH existed, why didn't they solve such a trivial problem?

What makes you think they didn't? There is CC64, although you may be unhappy with the language features.

The other answer is: What itch would a Lisp or Forth (rather than assembler) programmer scratch by writing a C compiler?