DeVault: Announcing the Hare programming language

Posted May 4, 2022 13:56 UTC (Wed) by Vipketsh (guest, #134480)
In reply to: DeVault: Announcing the Hare programming language by felix.s
Parent article: DeVault: Announcing the Hare programming language

Why are we all talking about undefined behaviour as if this stuff would be some laws-of-nature that is set in stone ? They are *not* set in stone and there is no (technical) reason they can't be changed. There is also no reason why compiler authors couldn't say "this makes no sense, we'll define it like this" (e.g. -fwrapv) and thereby force the standards committee's hand. In other words, I can not accept arguments which end with "that's undefined behaviour, end of story" -- explain why it has to be undefined and/or what does the "undefined" nature of the behaviour get us (e.g. X% faster code, portability to X architecture, etc.) and then we can have a discussion about what is a better trade off: the benefits of the undefined behaviour or the benefits of not having it.

It would also be great if there were words to differentiate between undefined behaviour that can not be avoided (e.g. use-after-free) and those which we talk about only because compiler authors decided to explicitly add transformations based on said allowances (e.g. deleting NULL pointer checks due to NULL dereference being undefined). Lastly, I think we would be in a much better situation if 'undefined behaviour' would be read as 'allowed by the underlying machine, forbidden for the compiler'.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:25 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

> (e.g. deleting NULL pointer checks due to NULL dereference being undefined)

Everyone keeps repeating this, but given something where you have some inline-able utility functions that all safeguard against `NULL` internally (because that's the sensible thing to do). Do you not want to let the compiler optimize out those checks when inlining the bodies into a caller that, itself, has a NULL-then-bail check on that pointer (which a dereference is implicitly a check for as well)? If you want to remove this optimization, what kinds of optimizations are you willing to leave on the floor?

FWIW, once you get to LTO, the language it was written in may be long gone and seeing something like (forgive my lack of asm familiarity):

load eax, *ecx ; int x = *ptr;
…
xor ecx, ecx ; if (!ptr) return;
jnz pc+2
ret
…

and not remove that `xor/jnz/ret` sequence?

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:48 UTC (Wed) by Vipketsh (guest, #134480) [Link]

> inline-able utility functions that all safeguard against `NULL` internally [...] Do you not want to let the compiler optimize out those checks when inlining the bodies into a caller that, itself, has a NULL-then-bail check on that pointer

It is a very reasonable optimisation, but you don't need "dereferencing NULL is undefined behaviour" to make it! Proving that optimisation valid can be done with range analysis: since you have the first check for NULL, you know that the pointer is not NULL afterwards, and thus any following checks against NULL can not evaluate to true.

> what kinds of optimizations are you willing to leave on the floor?

If someone would actually quantify how much one or another optimisation gets us, we could have a discussion. I would guess that in many cases, it is something I could live with, but if people who do compiler benchmarks would tell me "well, that can loose you up to 10%" (or similar) I may very well concede that the undefined behaviour is worth keeping.

> FWIW, once you get to LTO, the language it was written in may be long gone

Yes, and no. LTO operates on the compiler's middle-end representation -- the same on which most all transformations occur. So, while the original language is indeed lost, all information from it needed to decide if one or another transformation is valid is still present.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:55 UTC (Wed) by khim (subscriber, #9252) [Link] (37 responses)

> They are *not* set in stone and there is no (technical) reason they can't be changed.

It can be done and it was done. Here is an example with undefined behavior, here is another with unspecified behavior.

> Why are we all talking about undefined behaviour as if this stuff would be some laws-of-nature that is set in stone ?

They are not set of stone, yet they are the rules of the game. But most discussions about undefined behavior go like “why do you say I couldn't hold basketball and run — I tried that and it works!”.

Yes, it may work if the referee looks the other way. But it's against the rules. If you want to make it rules-compliant then you have to change the rules first!

> I can not accept arguments which end with "that's undefined behaviour, end of story"

It is the end of story for a compiler writer or a C developer. Just like “it's against the rules” is “the end of story” for the basketball player.

> explain why it has to be undefined and/or what does the "undefined" nature of the behaviour get us (e.g. X% faster code, portability to X architecture, etc.) and then we can have a discussion about what is a better trade off: the benefits of the undefined behaviour or the benefits of not having it.

Sure. Raise that issue with ISO/IEC JTC1/SC22/WG21 committee. If they would agree — rules would be changed. Or you can try to see what it takes to change the compiler and report.

But rules are rules, they are “the status quo”. Adherence to rules doesn't need any justifications. But changes to the rules need a justification, sure.

> Lastly, I think we would be in a much better situation if 'undefined behaviour' would be read as 'allowed by the underlying machine, forbidden for the compiler'.

Such thing already exist in the standard. It's called “implementation-defined behavior”. Undefined behavior is called undefined because it's, well… undefined.

P.S. There are cases where modern compilers break a perfectly valid programs which don't, actually, trigger UB. That can only be called an act of sabotage, but these are different storied from what we are discussing here.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:03 UTC (Wed) by Vipketsh (guest, #134480) [Link] (36 responses)

You know full well that no individual can engage a committee and have any meaningful hope of ever bringing about change, and you also know full well that committees are more about preserving the entrenchment of their members than real technical progress. Thus any reference to them comes across as a simple means of excluding individuals without discussion.

It's also not like compiler writers have never ignored standards when it suited them. I think it was yourself who mentioned in some thread from a while ago that LLVM is doing all sorts of optimisations based on pointer provenance when there is exactly no mention of them in any standard. So, no, I can't just simply accept "it's against the rules" to end a discussion.

> “why do you say I couldn't hold basketball and run — I tried that and it works!”

To take your analogy further, maybe basketball would be better that way. If I think so, I can gather up a bunch of people go to some court to try it for a while and, if it works, maybe the basketball rules committee will take an interest and change the rules or maybe its the birth of a new sport. So, yes, turning certain undefined behaviours into defined ones in a compiler is exactly the place to have this discussion and the rule changes can very well happen later.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:28 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (4 responses)

There are certainly individuals at C++'s standards committee. I don't know that much about C since I haven't been there though. I don't think that "preserving the entrenchment" is at all how I'd describe it. There are certainly different pressures on things, but there is *not* a unified one anywhere other than "improve the language overall" (and not everyone agrees about every step taken). I've seen long-standing members get told "no, that idea doesn't work" about as much as I've seen new members get "oh, yeah, that's neat" (in rough proportion to their prevalence at least). Implementors have their hobby horses, library designers have theirs, and users are there too to express their views on things. "Unified" is not how I would describe it in relation to any single feature-in-progress.

FWIW, I go because I work on CMake and would like to ensure that modules are buildable (as would my employer). But I go to other presentations that I am personally interested in (when not a schedule conflict with what I'm there to do) and participate.

With the pandemic making face-to-face less tenable, I expect it to be easier than ever for folks to attend committee meetings.

> I think it was yourself who mentioned in some thread from a while ago that LLVM is doing all sorts of optimisations based on pointer provenance when there is exactly no mention of them in any standard.

That was me I believe. My understanding is that without provenance, pointers are extremely hard to actually reason about and make any kind of sensible optimizations around (this includes aliasing, thread safety, etc.). It was something that was encountered when implementing the language that turned out to be underlying a lot of things but was never spelled out. There is work to figure out what these rules actually are and how to put them into standard-ese ("pointer zap" is the search term to use for the paper titles).

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:51 UTC (Wed) by Vipketsh (guest, #134480) [Link]

If the C/C++ committees are indeed that nice, I have at least some hope of sensible changes -- thanks for sharing your experience.

(For reference, long ago I have engaged with the Unicode people as part of a research group and that experience was one of dealing with megalomanic cesspools, lobby groups, legal threats and the like -- truly awe full)

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:06 UTC (Wed) by khim (subscriber, #9252) [Link] (2 responses)

> That was me I believe.

That was different discussion.

> There is work to figure out what these rules actually are and how to put them into standard-ese ("pointer zap" is the search term to use for the paper titles).

This would have been great if, while these rules are not finalized, compilers produced working programs by default.

Instead clang not just breaks programs by default, it even refuses to provide -fno-provenance switch which may be used to stop miscompiling them!

And it's not as if rules were actually designed, presented to C/C++ committee (like happend with DR 236) then rejected because of typical committee politics. At least then you can say “yes, changes to the standard are not accepted yet, but you can read about our rules here”.

Instead, for two decades compilers silently miscompiled valid programs and their developers offered no rules which software developers can follow to develop programs which wouldn't be miscompiled!

That's an act of deliberate sabotage. And, worst of all, it have only gained prominence because of Rust: since Rust developers actually care about program correctness (and because LLVM miscompiled programs which they believed to be correct) they tried to find the actual list of rules that govern provenance in C/C++… and found nothing.

Yes, this provenance fiasco is awful blow against C/C++ compiler developers. You couldn't say “standard is a treaty, you have to follow it” and then turn around and add “oh, but it's too hard for me to follow it, thus I would violate it whenever it would suit me”. That's not a treaty anymore, it's a joke.

But even then: the way forward is not to ignore the rules, but an attempt to patch them. Ideally new compilers should be written which would actually obey some rules, but that's too big of an endeavor, I don't think it'll happen.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:36 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

> Ideally new compilers should be written which would actually obey some rules, but that's too big of an endeavor, I don't think it'll happen.

There may be hope. Some docs: https://lightningcreations.github.io/lccc/xlang/pointer (though no mention of provenance yet(?)).

https://github.com/LightningCreations/lccc

DeVault: Announcing the Hare programming language

Posted May 5, 2022 9:59 UTC (Thu) by khim (subscriber, #9252) [Link]

I mean something which was done for safe Rust: formal mathematical model which may prove that transformation which are done during optimizations are sound.

Things like <a href="https://www.ralfj.de/blog/2020/12/14/provenance.html">that</a> should be impossible.

Today optimizations in compilers are dark art: not only they produce suboptimal code sometimes (that's probably something which we would never be able to fix), we discover, from time to time, that they are just simply invalid (as in: they turn perfectly valid programs into invalid).

Ideally we should ensure that all optimizations leave valid programs still valid (even if, perhaps, not optimal).

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:33 UTC (Wed) by khim (subscriber, #9252) [Link] (30 responses)

Indeed. That was a great violation of rules and that's what pushed me to start using Rust, finally. I always was sitting on the fence about it, but when it turned out that with C/C++ I have to watch not just for UBs actually listed in the standard, but also other, vague and uncertain requirements which were supposed to be added to it more than a decade ago… at this point at became apparent that C/C++ are not suitable for any purpose.

The number of UBs is just too vast, they are not enumerated (C tried, but, as provenance issues shows, failed to list them, C++ haven't tried at all) and, more importantly, there are no automated way to separate code where UB may happen (and which I'm supposed to rewrite when new version of compiler is released) from code where UB is guaranteed to be absent (and which I can explicitly trust).

But that was just the final straw. Even without this gross violation of rules it was becoming more and more apparent that UBs have to go: it's not clear if low-level language without UB is possible in principle, but even if you eliminate them from 99% of code using an automated tools that would still be a great improvement.

> It's also not like compiler writers have never ignored standards when it suited them.

Very rarely. Except for that crazy provenance business (which is justfied by DR260 and the main issue lies with the fact that compilers started miscompiling programs without appropriate prior changes to the standard) I can only recall DR236 (where committee acted as committee and refused to offer any sane way to deal with the issue and just noted that the requirement of global knowledge is problematic).

And in cases where it was shown that UB requirements are just too onerous to bear they changed standard in the favor of C++ developers, thus it was definitely not one-way street.

> If I think so, I can gather up a bunch of people go to some court to try it for a while and, if it works, maybe the basketball rules committee will take an interest and change the rules or maybe its the birth of a new sport.

You are perfectly free to do that with the compilers, too. Most are gcc/clang based novadays thus you can just start with implementing your proposed changes, then can measure things and promote your changes.

> So, yes, turning certain undefined behaviours into defined ones in a compiler is exactly the place to have this discussion and the rule changes can very well happen later.

Indeed, but it's your responsibility to change the compiler and show that it brings real benefits. Just like most basketball players: they wouldn't even consider trying to play basketball with some changed rules unless you can show that other prominent players are trying that, changed, variant.

Some changes may even become new -fwrav, who knows? But it's changes to the language that need a justification, the rules are rules by default.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:14 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> But that was just the final straw. Even without this gross violation of rules it was becoming more and more apparent that UBs have to go: it's not clear if low-level language without UB is possible in principle, but even if you eliminate them from 99% of code using an automated tools that would still be a great improvement.

It's easy to eliminate UB from any language. A computer language is Maths, and as such you can build a complete and pretty much perfect MODEL.

When you run your program, it's Science (or technology), and you can't guarantee that your model and reality co-incide, but any AND ALL UB should be "we have no control over the consequences of these actions, because we are relying on an outside agent". Be it a hardware random number generator, a network card receiving stuff over the network, a buggy CPU chip, etc etc. The language should be quite clear - "here we have control therefore the following is *defined* as happening, there we have no control therefore what happens is whatever that other system defines". If that other system doesn't bother to define it, it's not UB in your language.

And that's why Rust and Modula-2 and that all have unsafe blocks - it says "here be dragons, we cannot guarantee language invariants".

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:32 UTC (Wed) by khim (subscriber, #9252) [Link]

> And that's why Rust and Modula-2 and that all have unsafe blocks - it says "here be dragons, we cannot guarantee language invariants".

You just have to remember that not all unsafe blocks are marked with nice keywords. E.g. presumably “safe” Rust program may open /proc/self/mem.

But even then: it's time to stop making languages which allow UB to happen just in any random piece of code which does no such crazy things!

> It's easy to eliminate UB from any language. A computer language is Maths, and as such you can build a complete and pretty much perfect MODEL.

Sometimes model is just too restrictive. E.g. in Rust you have to go into unsafe realm just to create queue or linked list.

But even that, rigid and limited model, allows one to remove UB from a surprising percentage of your code. And it's the only way to write complex code. C/C++ approach doesn't scale. We don't have enough DJBs.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 0:33 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (27 responses)

As I understand it, WG21 (ie C++) does have a sub-committee attempting to enumerate the Undefined Behaviour.

Provenance is difficult, I would agree with you that C++ didn't do a great job here by trying to kick this ball into the long grass rather than wrestle with the difficult problem, but I think we should be serious for a moment and consider that if C++ 11 had tried to do what Aria is merely proposing (and of course hasn't anywhere close to consensus to actually do for real in stable yet) for Rust, it would have been stillborn.

The same exact people who are here talking about some hypothetical C dialect or new language where it does what they meant, (whatever the hell that is) would have said Provenance is an abomination, just let us do things the old way. Even though "the old way" doesn't have any coherent meaning which is why this came up as a Defect Report not as a future feature proposal.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 10:30 UTC (Thu) by khim (subscriber, #9252) [Link] (26 responses)

> Even though "the old way" doesn't have any coherent meaning which is why this came up as a Defect Report not as a future feature proposal.

It was abuse of the system, plain and simple.

One part of the standard say saying, back then, that only visible values matter and if two pointers are identical they should behave identically.

That's understanding of the vast majority of the practical programmers and this is what should have been kept as a default (even if it affected optimizations).

The other part of the standard was talking about pointer validity and contradicted the first, e.g. realloc in C99 (but, notably, not in C89) is permitted to return different pointer which can be bitwise identical to the original one.

That's what saboteurs wanted to hear and that's what they used to sabotage C/C++ community.

> Provenance is difficult, I would agree with you that C++ didn't do a great job here by trying to kick this ball into the long grass rather than wrestle with the difficult problem

It's not a “difficult problem” at all. If sabotage would have failed then rules for when identical pointers are not considered to be identical would have become not “hidden”, “unwritten” part of the standard, but a non-standard mode, extension.

If it would have been proven that they are helping to produce much better code then they would have been enabled explicitly in many projects. And people would have became aware.

Instead language was silently and without evidence changed behind the developer's back. That's real serious issue IMNSHO. Layman are not supposed to know all the fine points of law. S/he especially is not supposed to know all the unwritten rules.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 12:47 UTC (Thu) by foom (subscriber, #14868) [Link] (25 responses)

> saboteurs
> sabotage

This is inflammatory and unhelpful language. That certainly doesn't actually describe anyone involved, as I suspect you're well aware.

I believe what you actually mean is that you disagree strongly with some of the decisions made, and consider them to have had harmful effects. Say that, not "saboteurs".

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:02 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (2 responses)

I think it accurately reflects the sentiment of the vast majority of C programmers who nowadays are terrified to upgrade their toolchain on every system upgrade, and to discover new silent breakage that brings absolutely no value at all except frustration, waste of time and motivation, and costs.

The first rule should be not to break what has reliably worked for ages. *even if that was incorrect in the first place*. As Linus often explains, a bug may become a feature once everyone uses it; usage prevails over initial intent.

I'm pretty certain that most of the recent changes were driven exclusively by pride, so say "look how smart the compiler became after my changes", forgetting that their users would like it to be trustable instead of being smart.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:52 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

> The first rule should be not to break what has reliably worked for ages. *even if that was incorrect in the first place*. As Linus often explains, a bug may become a feature once everyone uses it; usage prevails over initial intent.

That's one possibility, yes. But there's another possibility: follow the standard. Dūra lēx, sed lēx.

Under that assumption you just follow the law. What law says… goes. Even if what the law says is nonsense.

That what C/C++ compiler developers promoted for years.

But if you pick that approach you cannot then turn around and say “oh, sorry, law is too harsh, I don't want to follow it”.

Either law is law and everyone has to follow it, it's enough to follow it, or it's not the law.

> I'm pretty certain that most of the recent changes were driven exclusively by pride, so say "look how smart the compiler became after my changes", forgetting that their users would like it to be trustable instead of being smart.

Provenance rules are not like that. They allow clang and gcc to eliminate calls from malloc and free in some cases. This may bring amazing speedups. And if these things were an opt-in option I would have applauded these efforts and it would have been a great way to promote them and to, eventually, add them to the standard.

Instead they were introduced in a submarine patent way, without any options, not even opt-out options. And they break standards-compliant programs.

That's an act of sabotage, sorry.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 15:26 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

> That's one possibility, yes. But there's another possibility: follow the standard. Dūra lēx, sed lēx.
> Under that assumption you just follow the law. What law says… goes. Even if what the law says is nonsense.
> That what C/C++ compiler developers promoted for years.

I wouldn't deny that, but:
- the law is behind a paywall
- lots of modern abstractions in interfaces sadly make it almost impossible to follow. Using a lot of foo_t everywhere without even telling you whether they're signed/unsigned, 32/64 causes lots of trouble when you have to perform operations on them, resulting in you being forced to cast them and enter into the nasty area of type promotion. That's even worse when you try hard to avoid an overflow based on a type you don't know and the compiler knows better than you and manages to get rid of it.

We're really fighting *against* the compiler to keep our code safe these years. This tool was supposed to help us instead. And it failed.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:38 UTC (Thu) by khim (subscriber, #9252) [Link] (21 responses)

> That certainly doesn't actually describe anyone involved, as I suspect you're well aware.

This describes the majority of people I have talked with. I was unable to find anyone who have honestly claimed that careful reading the standard is enough to write the code which wouldn't violate provenance rules.

On the contrary: people expressed regret or, in many cases, anger about the fact that standard doesn't enable usage of provenance rules by the compilers, yet none claimed that they follow from the existing standard.

If the standard is the treaty between compiler and programmer then it was deliberate breakage of the treaty. Worse: it placed the users of the compiler at a serious disadvantage. How are they supposed to follow the rules which don't exist? Especially if even people who are yet to present these rules agree that they are really complex and convoluted?

And when people, horrified, asked for the -fno-provenance option? They got answer: although provenance is not defined by either standard, it is a real and valid emergent property that every compiler vendor ever agrees on.

Sorry, but if that is not an act of sabotage, then I don't know what is. And people who are doing the sabotage are called saboteours.

> I believe what you actually mean is that you disagree strongly with some of the decisions made, and consider them to have had harmful effects.

No. It's not about me agreeing with something or disagreeing with something.

> Say that, not "saboteurs".

Let me summarize what happened:

Compiler developers have found out that certain parts of the standard allow language users to write certain questionable programs which are hard to optimize.
After that was acknowledged they haven't changed the standard yet decided to break these standard-compliant programs.
They haven't mentioned that fact in the release notes and spent no efforts to deliver that infortmation to users in any way.
They also refused to offer any options which would allow one to use the questionable constructs.
Naturally they also refuse to enable such options when you compile programs with -std=c89 (variant of the C standard which does not include any provisions for provenance whatsoever).

Sorry, but when you knowingly break standards-compliant programs and refuse to support them in any shape or form — that's an act of sabotage.

DeVault: Announcing the Hare programming language

Posted May 9, 2022 18:58 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (20 responses)

> Naturally they also refuse to enable such options when you compile programs with -std=c89 (variant of the C standard which does not include any provisions for provenance whatsoever).

Suppose it is the year 1889, the twentieth century seems bright ahead, and you have learned everything there is to know (or so you think) about Set Theory. (This is called Naive Set Theory). It seems to you that this works just fine.

Fast forward a few years, and this chap named Ernst Zermelo says your theory is incomplete and that's why it suffers from some famous paradoxes. His new axiomatised Set Theory requires several exciting new axioms, including the infamous Axiom Of Choice and with these axioms Zermelo vanquishes the paradoxes.

Now, is Ernst wrong? Was your naive theory actually fine? Shouldn't you be allowed to go back to using that simpler, "better" set theory and ignore Ernst's stupid paradoxes and his obviously nonsensical Axiom of Choice ? No. Ernst was right, your theory was unsound, _and it was already unsound in 1889_, you just didn't know it yet. Your naive theory _assumed_ things which Zermelo made into axioms.

Likewise, the C89 compilers you have nostalgia for were actually unsound, and it would have been possible (or if you resurrect them, is possible today on those compilers) to produce nonsensical results because in fact pointer provenance may not have been mentioned in the C89 standard but it was relied upon by compilers anyway. It was silently assumed, and had been for many years.

The excellent index for K&R's Second Edition of "The C Programming Language" covering C89, doesn't even have an entry for the words "alias" or "provenance". Because there are _assumptions_ about these things baked in to the language, but they haven't been surfaced.

The higher level programming languages get to have lots of optimisations here because assumptions like "pointer provenance" are necessarily true in a language that only has references anyway. To keep those same optimisations (as otherwise they'd be slower!) C and C++ must make the assumptions too, and yet to deliver on their "low level" promise they cannot. Squaring this circle is difficult which is why the committees punt rather than do it over all these years.

I happen to think Rust (more by luck than judgement so far as I can see) got this right. If you define most of the program in a higher level ("safe" in Rust terms) language, you definitely can have all those optimisations and then compartmentalize the scary assumption-violating low level stuff. This is what Aria's Tower of Weakenings is about too. Aria proposes that even most of unsafe Rust can safely keep these assumptions, something like Haphazard (the Hazard Pointer implementation) doesn't do anything that risks provenance confusion and so it's safe to optimize with those assumptions and more stuff like that can be safely labelled safe, until only the type of code that really mints "pointers" from integers out of nowhere cannot justify the assumptions and accordingly cannot be optimised successfully.

It's OK if five machine instructions can't be optimised, probably even if they're in your tight inner loop, certainly it's better than accidentally optimising them to four *wrong* instructions. What's a problem for C and C++ is that the provenance problem is infectious and might spread from that inner loop to the entire program and then you're back to slower than Python.

DeVault: Announcing the Hare programming language

Posted May 9, 2022 21:55 UTC (Mon) by Wol (subscriber, #4433) [Link] (19 responses)

> Your naive theory _assumed_ things which Zermelo made into axioms.

Whoops. I know a lot of people think my view of maths is naive, but that's not what I understand an axiom to be. An axiom is something which is *assumed* to be true, because it can't be proven. Zermelo would have provided a proof, which would have changed your naive theory from axiom to proven false.

This is Euclid's axiom that parallel lines never meet. That axiom *defines* the special case of 3D geometry. but because in the general case it's false, it's not an axiom of geometry.

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 9, 2022 22:39 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (18 responses)

Maybe I didn't express myself very well.

As I understand it, the problem with assumptions in naive set theories and C89 (and various other things) is that because it's an assumption rather than an axiom you don't spot where the problems are. You never write it down at all and so have no opportunity to notice that's it's too vague whereas when you're writing an axiom you can see what you're doing.

The naive theories let Russell's paradox sneak in by creating this poorly defined set, but the axioms in Zermelo's theory oblige you to define a set more carefully to have a set at all, and in that process the paradox gets defused. In particular ZFC has a "specification" axiom which says in essence OK, so, tell me how to make this "set" using another set and first order logic. The sets naive set theories were created for can all be constructed this way no problem, but weird sets with paradoxical properties cannot.

C89 assumes that pointers to things must be different, which sounds reasonable but does not formally explain how this works. I believe that it's necessary (in order for a language like C89 to avoid its equivalent of paradoxes, the programs which abuse language semantics to do something hostile to optimisation) to define such rules, and they're going to look like provenance.

I do not believe that C89 is fine, and thus that we should or could just implement C89 as if provenance isn't a thing and be happy. That's my point here. C89 wasn't written in defiance of a known reality, but in ignorance of an unknown one, like the Ptolemaic system. Geocentrists today are different from Ptolemy, but not because Ptolemy was right.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 2:18 UTC (Tue) by khim (subscriber, #9252) [Link] (17 responses)

> C89 assumes that pointers to things must be different, which sounds reasonable but does not formally explain how this works.

No. It's the opposite. C89 doesn't assume that pointers to things must be different. But yes, it asserts that if pointers are equal then they are completely equivalent — you can compare any two pointers and go from there.

Note that it doesn't work in the other direction: it's perfectly legal to have pointers which are different yet point to the same object. That's easily observable in MS-DOS's large memory model.

> I do not believe that C89 is fine, and thus that we should or could just implement C89 as if provenance isn't a thing and be happy.

Show me a Russel's paradox, please. Not in a form “optimizer could not do X or Y, which is nice thing to be able to do and thus we must punish the software developer who assumes pointers are just memory addresses”, but “this valid C89 program can produce one of two outputs depending or how we read the rules and thus it's impossible to define how it should work”.

Then and only then you would have a point.

> That's my point here.

I think you are skipping one important step there. Yes, there was an attempt to add rules about how certain pointers have to point to different objects. But it failed spectacularly. It was never adopted and final text of C89 standard don't even mention it. In C89 pointers are just addresses, no crazy talks about pointer validity, object creation and disappearance and so on.

There are some inaccuracies from that failed attempt: C89 defines only two lifetimes: static and automatic… yet somehow the value of a pointer that refers to freed space is indeterminate. Yet if you just declare that pointers are addresses then it should be possible to fix these inaccuracies without much loss.

Where non-trivial lifetimes first appeared is C99, not C89. There yes, it has become impossible to read from "union member other than the last one stored into", there limitation on whether you can compare two pointers or not were added (previously it was possible to compare two arbitrary pointers and if they happen to be valid and equal, they would be equivalent), etc.

But I don't see why C89 memory model would be, somehow, unsound. Hard to optimize? Probably. Wasterful? Maybe so. But I don't see where it's inconsistent. Show me. Please.

Yes, it absolutely rejects pointer provenance in any shape or form (except something like CHERI where provenance actually exist at runtime and is 100% consistent). Yes, it may limit optimization opportunities. But where's the Russel's paradox, hmm?

DeVault: Announcing the Hare programming language

Posted May 10, 2022 15:04 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (6 responses)

I don't have a Russell's Paradox equivalent under your constraints, but I think it's worth highlighting just how severe those constraints are.

Your resulting C compiler is not the GCC I grew up with (well, OK, the one that teenage me knew), or that with some minor optimisation passes disabled, it's an altogether different animal, perhaps closer to Python. In this language, pointers are all just indexes into an array containing all of memory - including the text segment and the stack, and so you can do some amazing acrobatics as a programmer, but your optimiser is badly hamstrung. C's already poor type checking is further reduced in power in the process, which again makes the comparison to Python seem appropriate.

I don't believe there is or was an audience for this compiler. People weren't writing C because of the expressive syntax, the unsurpassed quality of the tooling or the comprehensive "batteries included" standard library, it didn't have any of those things - they were doing it because C compilers produce reasonable machine code, and this alternative interpretation of C89 doesn't do that any more.

> (except something like CHERI where provenance actually exist at runtime and is 100% consistent)

You can only do this at all under CHERI via one of two equally useless routes:

1. The "Python" approach I describe where you declare that all "pointers" inherit a provenance with 100% system visibility, this obviously doesn't catch any bugs, and you might as well switch off CHERI, bringing us to...

2. The compatibility mode. As I understand it Morello provides a switch so you can say that now we don't enforce CHERI rules, the hidden "valid" bit is ignored and it behaves like a conventional CPU. Again you don't catch any bugs.

This is because under your preferred C89 "no provenance" model there isn't any provenance, CHERI isn't a fairytale spell it's just engineering.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 16:10 UTC (Tue) by khim (subscriber, #9252) [Link] (3 responses)

> Your resulting C compiler is not the GCC I grew up with (well, OK, the one that teenage me knew), or that with some minor optimisation passes disabled, it's an altogether different animal, perhaps closer to Python.

But that's is the language which Kernighan and Ritchie designed and used to write Unix. Their goal was not to create some crazy portable dream, they just wanted to keep supporting both 18-bit PDP-7 and 16-bit PDP-11 from the same codebase by rewriting some parts of code written in PDP-7 assembler in the higher-level language. They have been using B which had no types at all and improved it. By adding character types, then structs, arrays, pointers (yes, B conflated pointers and integers, it only had one type).

Yet malloc was not special, free was not special and not even all Unix programs used them (just look obn the source of original Bourne Shell some days).

> I don't believe there is or was an audience for this compiler.

How about “all the C users for the first decade of it's existence”? Initially C was used exclusively in Unix, but in 1980th it became used more widely. Yet I don't think any compilers of that era support anything even remotely resembling “pointer provenance”.

That craziness started after a failed attempt of C standard committee to redo the language. They then went back and replaced that with a simpler aliasing rules which prevented type puning, but even these weren't abused by compilers till XXI century.

> People weren't writing C because of the expressive syntax, the unsurpassed quality of the tooling or the comprehensive "batteries included" standard library, it didn't have any of those things - they were doing it because C compilers produce reasonable machine code, and this alternative interpretation of C89 doesn't do that any more.

Can you, please, stop rewriting history? C was quite popular way before ANSI C arrived and tried (but failed!) to introduce crazy aliasing rules. Yes, C compilers were restricted and couldn't do all the amazing optimizations… but C developers can do these, instead! When John Carmack was adopting his infamous 0x5f3759df-based trick he certainly haven't cared to think about the fact that there are some aliasing rules which may render code invalid and that was true for the majority of users who grew in an era before GCC started breaking good K&R programs.

> This is because under your preferred C89 "no provenance" model there isn't any provenance, CHERI isn't a fairytale spell it's just engineering.

It's engineering, yes, but you can add provenance to C89. It just has to be consistent. You can even model it with “poor man's CHERI” aka MS-DOS Large model by playing tricks with segment and offset. E.g. realloc could turn 0x0:0x1234 pointer into 0x1:0x1224 pointer if it decided not to move an object.

This way all the fast-path code would be negated and you would never have the situation where bitwise-identical pointers point to different objects. This may not be super-efficient but it is compatible with C89. Remember? Bitwise-different pointers can point to the same object, but the opposite is forbidden! Easy, no?

All these games where certain pointers can be compared but not others and its programmer's responsibility to remember all these unwritten rules… I don't know how that language can be used to development, sorry.

The advice I have gotten from our toolchain-support team is to ask clang developer about low-level constructs which I may wish to create!

So much for “standard is a treaty” talks…

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:03 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

Early C did not have a formal specification - what the one and only implementation did was what the language did.

And the problem is that formally specified C - including K&R C, and C89 - left a huge amount unspecified; users of C assumed that the behaviour of their implementation of the language was C behaviour, and not merely the way their implementation behaved.

Up until the early 1990s, this wasn't a big deal. The tools needed for compilers to do much more than peephole optimizations simply didn't exist in usable form; remember that SSA, the first of the tools needed to start reasoning about blocks or even entire programs doesn't appear in the literature until 1988. As a result, most implementations happened, more by luck than judgement, to behave in similar ways where the specification was silent.

But then we got SSA, the polytope model, and other tools that allowed compilers to do significant optimizations beyond peephole optimizations on the source code, their IR, and the object code. And now we have a mess - the provenance model, for example, is compiler authors trying to find a rigorous way to model what users "expect" from pointers, not just what C89 permits users to assume, while C11's concurrent memory model is an effort to come up with a rigorous model for what users can expect when multiple threads of execution alter the state of the abstract machine.

Remember that all you are guaranteed about your code in C89 is that the code behaves as-if it made certain changes to the abstract machine for each specified operation (standard library as well as language), and that all volatile accesses are visible in the real machine in program order. Nothing else is promised to you - there's no such thing as a "compiler barrier" in C89, for example.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:57 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> And the problem is that formally specified C - including K&R C, and C89 - left a huge amount unspecified; users of C assumed that the behaviour of their implementation of the language was C behaviour, and not merely the way their implementation behaved.

True, but irrelevant. The most important part that we discussing here was specified in both: pointers are addresses, if two pointers are equal they can be used interchangeably.

> But then we got SSA, the polytope model, and other tools that allowed compilers to do significant optimizations beyond peephole optimizations on the source code, their IR, and the object code.

Yes. And there was an attempt to inject ideas that make these useful into C89. Failed one. The committee has created an unreal language that no one can or will actually use. It was ripped out (and replaced with crazy aliasing rules, but that's another story).

> And now we have a mess - the provenance model, for example, is compiler authors trying to find a rigorous way to model what users "expect" from pointers, not just what C89 permits users to assume

Can you, please stop lying? Provenance models are trying to justify deliberate sabotage where fully-standard compliant programs are broken. It's not my words, the provenance proposal itself laments:

These GCC and ICC outcomes would not be correct with respect to a concrete semantics, and so to make the existing compiler behaviour sound it is necessary for this program to be deemed to have undefined behaviour.

To make the existing compiler behavior sound, my ass. The whole story of provenance started with sabotage: after failed attempt to bring provanance-like properties to C89 saboteurs returned in C99 and, finally, succeeded in adding some (and thus rendered some C89 programs invalid in the process), but that wasn't enough: they got the infamous DR260 resolution which was phrased like that: After much discussion, the UK C Panel came to a number of conclusions as to what it would be desirable for the Standard to mean.

Note: the resolution hasn't changed the standard. It hasn't allowed saboteurs to break more programs. No. It was merely a suggestion to develop adjustments to the standards — and listed three cases where such adjustments were supposed to cause certain outcomes.

Nothing like that happened. For more than two decades compilers invented more and more clever ways to screw the developers and used that resolution as a fig leaf.

And then… Rust happened. Since Rust guys are pretty concerned about program correctness (and LLVM sometimes miscomplied IR-programs they perceived correct) they went to C++ guys and asked “hey, what are the actual rules we have to follow”? And the answer was… “here is the defect report, we use it to screw the developers and miscompile their valid programs”. Rust developers weren't amused.

And that is when the lie was, finally, exposed.

So, please, don't liken problems with pointer provenance to problems with C11 memory model.

Indeed, C89 or C99 doesn't allow one to write valid multi-threaded programs. Everything is defined strictly for single-threaded program. To support programs where two threads of execution can touch objects “simultaneously” you need to extend the language somehow.

Provenance excuse is used to break completely valid C and C++ programs. It's not about extending the language, it's about narrowing it. Certain formerly valid programs have to be deemed to have undefined behaviour..And after more than two decades we don't even have the rules which we are supposed to follow finalized!

And they express it in a form of if you believe pointers are mere addresses, you are not writing C++; you are writing the C of K&R, which is a dead language. IOW: they know they sabotaged C developers —and they are proud of it.

> Nothing else is promised to you - there's no such thing as a "compiler barrier" in C89, for example.

Yes. And to express many things which would be needed to, e.g., write an OS kernel in C89, you need to extend the language in some way. This is deliberate: the idea was to make sure strictly-conforming C89 programs run everywhere, but conforming programs may require certain language extensions. Not ideal, but works.

Saboteurs managed to screw it all completely and categorically refuse to fix what they have done.

This looks like a good time to stop

Posted May 10, 2022 20:05 UTC (Tue) by corbet (editor, #1) [Link]

When we start seeing this type of language, and people accusing each other of lying, it's a strong signal that a thread may have outlived its useful lifetime. This article now has nearly 350 comments on it, so I may not be the only one who feels like that outliving happened a little while ago.

I would like to humbly suggest that we let this topic rest at this point.

Thank you.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:16 UTC (Tue) by Vipketsh (guest, #134480) [Link] (1 responses)

Where provenance makes sense is if the language you have has some higher level concept of an object which has some properties described in the program and known to the compiler. Most importantly object lifetime is known. This ends up bringing with it rules such as you can not just arbitrarily turn one object into another and back again (i.e. no pointer casts), you can not arbitrarily split one object into two (i.e. no pointer arithmetic) and you can not arbitrarily manufacture pointers out of random data. Unfortunately C is not such a language and by forcing provenance rules on it, one is in essence trying to retrofit some kind of object model to it without any of the expressiveness and enforced rules that are needed for the programmer to not make programmes that are obviously wrong under the assumptions (i.e. provenance). Worse yet the rules, or better said heuristics*, that standards and compilers chose to signal lifetime are counter to what existing programmes expect and exploit.

Re: your mathematics analogy. I think you have taken the wrong view point there: pretty much all of mainstream mathematics is concerned with either extending existing and useful theory (e.g. how rational numbers where extended to create irrational numbers) or to put existing theory on a more sound footing (e.g. Hilbert's axioms), possibly closing off various paradoxes. Realise how in pretty much all of the evolution of mathematics a very strong emphasis was placed on any new theory being backwards compatible -- no-one, taken seriously, wanted to ever end up with 1+1=3 but instead worked to solidify the intuition that 1+1=2. I think that if one wanted to paint mathematical evolution onto C, the definitions underpinning C would need to be changed in a way that (i) they are backwards compatible with existing programs and (ii) that loopholes exploited by compiler writers be closed instead of officially sanctioned. Right now, it's the opposite: people are trying to convince C programmers that the intuition they had all along was always false and reality is actually completely different.

*: Possibly the one with the most problems is the idea that realloc() will, in the absence of failure, always (i) destroy the object passed to it, and (ii) will allocate a completely new one. This is counter to the intuition of many a programmer and there is no enforced rule in C that prevents programmers from carrying pointers over the realloc() call, which would make the idea actually work. The reason people are annoyed is that such code exists, is used and has worked for a long time and there is no evidence that this idea has much, if any, benefit on the compiled program.

DeVault: Announcing the Hare programming language

Posted May 11, 2022 16:09 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

> Where provenance makes sense is if the language you have has some higher level concept of an object which has some properties described in the program and known to the compiler.

I shall quote K&R themselves on this subject in their second edition:

"An object, sometimes called a variable, is a location in storage, and its interpretation depends on two main attributes ..."

> This ends up bringing with it rules such as you can not just arbitrarily turn one object into another and back again (i.e. no pointer casts), you can not arbitrarily split one object into two (i.e. no pointer arithmetic)

Nope, a language can (and some do, notably Rust of course but also this is possible with care in C and C++) provide pointer casts and pointer arithmetic. Provenance works this just fine for these operations.

Rust's Vec::split_with_spare_mut() isn't even unsafe. Most practical uses for this feature are unsafe, but the call itself is fine, it merely gives you back your Vec<T> (which will now need to re-allocate if grown because any spare space at the end of it is gone) and that space which was not used yet as a mutable slice of MaybeUninit<T> to do with as you see fit.

> and you can not arbitrarily manufacture pointers out of random data.

But here's where your problem arises. Here provenance is conjured from nowhere. It's impossible magic.

> Unfortunately C is not such a language and by forcing provenance rules on it, one is in essence trying to retrofit some kind of object model to it without any of the expressiveness and enforced rules that are needed for the programmer to not make programmes that are obviously wrong under the assumptions (i.e. provenance)

As we saw C is in fact such a language after all. The fact that many of its staunchest proponents don't seem to understand it very well is a problem for them and for C.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 16:39 UTC (Tue) by farnz (subscriber, #17727) [Link] (9 responses)

When you say "if pointers are equal, then they are completely equivalent", are you talking at a single point in time, or over the course of execution of the program?

Given, for example, the following program, it is a fault to assume that ptr1 and ptr2 are equivalent throughout the runtime, because ptr1 is invalidated by the call to release_page:


handle page_handle = get_zeroed_page();
int test;
int *ptr2;
int *ptr1 = get_ptr_to_handle(page_handle);
*ptr1 = -1; // legitimate - ptr1 points to a page, which is larger than an int in this case and correctly aligned etc.
test = *ptr1; // makes test -1
release_page(page_handle);
page_handle = get_zeroed_page();
ptr2 = get_ptr_to_handle(page_handle); // ptr2 could have the same numeric value as ptr1.
if (ptr2 == ptr1 && *ptr1 == test) {
    puts("-1 == 0");
} else {
    puts("-1 != 0");
}
release_page(page_handle);

This is the sort of code that you need to be clear about; C89's language leaves it unclear whether it's legitimate to assume that *ptr1 == test, even though the only assignments in the program are to *ptr1 (setting it to -1) and test. The thing that hurts here is that even if, in bitwise terms including hidden CHERI bits etc, ptr1 == ptr2, it's possible for the underlying machine to change state over time, and any definition of "completely equivalent" has to take that into account.

One way to handle that is to say that even though the volatile keyword does not appear anywhere in that code snippet, you give dereferencing a pointer volatile-like semantics (basically asserting that locations pointed to can change outside the changes done by the C program), and say that each time it's dereferenced, it could be referring to a new location in physical memory. In that case, this program cannot print "-1 == 0", because it has to dereference ptr1 to determine that.

Another is to follow the letter of the C89 spec, which says that the only things that can change in the abstract machine's view of the world other than via a C statement are things marked volatile. In that case, this program is allowed to print "-1 == 0" or "-1 != 0" depending on whether ptr1 == ptr2, because the implementation "knows" that it is the only thing that can assign a value to *ptr1, and thus it "knows" that because no-one has assigned through *ptr1 since it read the value to get test it is tautologically true that *ptr1 == test.

Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile. But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

And that's the fundamental issue with rewinding to C89 rules - C89 implies very strongly that the only interface between things running on the "abstract C89 machine" and the real hardware are things that are marked as volatile in the C abstract machine. In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

Note, too, that I wasn't talking about optimization in either case - I'm simply looking at the semantics of the C abstract machine as defined in C89, and noting that they're not powerful enough to model a change that affects the abstract machine but happens outside it. I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:00 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile.

No. Calling fread and fwrite can certainly change things, too.

> But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

Yes and no. Change is dramatic, sure. But it's most definitely visible to C code.

By necessity such things have to either be implemented with volatile or by calling system routine (which must be added to the list of functions like fread and fwrite as system extension, or else you couldn't use them them).

Place where you pass your pointer to the invlpg would be place where compiler would know that object may suddenly change value.

> In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

In practice people who are doing these things have to use volatile at some point in the kernel, or else it just wouldn't work. Thus I don't see what you are trying to prove.

The fact that real OSes have to expand list of “special” functions which may do crazy things? It's obvious. In practice your functions are called mmap and munmap and they should be treated by compiler similarly to read and write: compiler either have to know what they are doing or it should assume they may touch and change any object they can legitimately refer given their arguments.

> I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

No. You couldn't do things like change to PTEs in a fully portable C89 program. It's pointless to talk about such programs since they don't exist.

The only way to do it is via asm and/or call to system routine which, by necessity, needs extensions to C89 standard to be usable. In both cases everything is fully defined.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:10 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

fread and fwrite are poor examples, because they are C code defined in terms of the state change they make to the abstract machine, and with a QoI requirement that the same state change happens to the real machine. Indeed, everything that's defined in C89 has its impact on the abstract machine fully defined by the spec; the only get-out is that volatile marks something where all reads and writes through it must be visible in the real machine in program order.

But note that this is a very minimal promise; the only thing happening in the real machine that I can reason about in C89 is the program order of accesses to volatiles. Nothing else that happens in the abstract machine is guaranteed to be visible outside it - everything else is left to the implementation's discretion.

And no, the state change is not visible inside the C89 abstract machine; if I write through a volatile pointer to a PTE, the implementation must ensure that my write happens in the real machine as well as the abstract machine, but it does not have to assume that anything has changed in the abstract machine. That, in turn, means that it may not know that ptr1 now has changed in the "real" machine, because it's not volatile and thus changes in the real machine are irrelevant.

And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change. Now, depending on the architecture, that almost certainly is not enough to guarantee an instant change - e.g. on x86, the processor can use old or new value of the PTE until the TLB is flushed, and I can force a TLB flush with invlpg to get deterministic behaviour - but I can bring the program into a non-deterministic state without calling into an assembly block or running a system routine, as long as I have the address of a PTE.

And there's no "list of system routines" in C89; the behaviour of fread, fwrite and other such functions is fully defined in the abstract machine by the spec, with a QoI requirement to have their behaviour be reflected in the "real" machine. By invoking the idea of a "list of system routines", you're extending the language beyond C89.

You're making the same mistake a lot of people make, of assuming that the behaviour of compilers in the early 1990s and earlier reflected the specification at the time, and wasn't just a consequence of limited compiler technology. If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible; provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89) while still allowing the compiler to reason about the meaning of your code in a way compatible with the C89 specification.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:01 UTC (Tue) by khim (subscriber, #9252) [Link]

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Which is the only way to have PTEs in C code.

> And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change.

If you have such an address then you have to extend the language somehow. Or, alternatively, don't touch it.

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Of course. Because it's impossible to write C89 program which changes the PTEs, such a concept just couldn't exist in it. You have to extend the language to cover that usecase.

> If compilers really did implement C89 to the letter of the specification

…then such compilers would have been as popular as ISO 7185. Means: no one would have cared about these and no one would have mentioned their existence.

> If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible

Yes. But some programs would still be possible. Programs which do tricks with pointers would work just fine, programs which touch PTEs wouldn't.

> provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89)

Citation needed. Badly. Because, I would repeat once more, in C89 (not in C99 and newer) the notion of “pointers which have the same bitpattern yet different” doesn't exist. If you add a few bits to the pointer converted to integer and then clear these same bits you would get the exact same pointer — guaranteed. The fact that these bits are lowest bits of converted integer is implementation-specific thing, you can imagine a case where these would live as top bits, e.g. So yet, that requires some implementation-specific extension. But pretty mild and limited.

Yes, provenance is an attempt to deal with the idea of C99+ that some pointers may be equal to others yet, somehow, still different — but that's not allowed in C89. If two pointers are equal then they are either both null pointers, or both point to the same object, end of story.

Sure, it makes some optimizations hard and/or impossible. So what? This just means that you cannot do such optimizations in C89 mode. Offer -fno-provenance option, enable it for -std=c89, done.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:26 UTC (Tue) by Vipketsh (guest, #134480) [Link] (5 responses)

You are playing a nice slight-of-hand here. Your code clearly shows function calls, and without knowing what's in those called functions, one has no choice but to assume that *ptr1 has changed (within the C abstract machine), and thus you can never get to the output of "-1 == 0". If, on the other hand, you do see the internals of those functions you will see the modification of some memory that may alias with ptr1 and then because of that you can not get the "-1 == 0" output.

The only way I can see your reasoning working is if you are somehow allowed to assume that function calls are an elaborate way of saying "nop".

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:36 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

If I promise that the function calls are just naming what code does, but it's real behaviour is poking global volatile pointers, and those functions are implemented in pure C89, there's no difference in behaviour. Given the following C definitions of get_zeroed_page, get_ptr_to_handle and release_page, you still have the non-determinism, albeit I've introduced a platform dependency:


const size_t PAGE_SIZE;
struct pt_entry {
    volatile char *v_addr;
    volatile char *p_addr;
}
volatile struct pt_entry *pte; // Initialized by outside code, with suitable guarantees on v_addr for the compiler implementation and on p_addr
int *page_location = pte->v_addr;

void *get_zeroed_page() {
   pte->p_addr += PAGE_SIZE;
   memset(pte->v_addr, 0, PAGE_SIZE);
   return pte->v_addr;
}

void release_page(void *handle) {
  assert(handle == pte->v_addr);
  pte->p_addr -= PAGE_SIZE;
}

void *get_ptr_to_handle(void* handle) {
  assert(handle == pte->v_addr);
  return page_location;
}

This has semantics on the real machine, because of the use of volatile - the writes to *pte are guaranteed to occur in program order. But the compiler does not have any way to know that volatile int *v_addr ends up with the same value between two separate calls to get_ptr_to_handle but points to different memory.

Also, I'd note that C89 does not have language asserting what you claim - it actually says quite the opposite, that the compiler does not have to assume that *ptr1 has changed within the C abstract machine, since ptr1 is not volatile. It's just that early implementations made that assumption because to do otherwise would require them to analyse not just the function at hand, but also other functions, to determine if *ptr1 could change. Like khim, you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:22 UTC (Tue) by Vipketsh (guest, #134480) [Link] (2 responses)

I still don't see it. You have that memset here:

> void *get_zeroed_page()
> [...]
> memset(pte->v_addr, 0, PAGE_SIZE);

If you don't have to assume that this writes over data pointed to by some other pointer it means that your aliasing rules say that no two pointers alias. Or put another way, for all practical purposes, having two pointers pointing to the same thing is unworkable. By some reading of C89 that may be the conclusion, but quite clearly that was never the intent and exactly no one expects things to work that way (including compiler writers, oddly enough).

> compiler does not have to assume that *ptr1 has changed within the C abstract machine

You mean across a function call ? That quite simply means that exactly no data could ever be shared by any two functions (in different compile units). Again, this would make the language completely unworkable and be counter what anyone expects.

> [...] you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

No. The language is defined, first and foremost, by what existing programs expect. If the standard allows interpretations and compilers to do things counter to what a majority of these programs expect, it is the standard that is broken and not the majority of all programs. I firmly believe that the job of a standard is to document existing behaviour and not to be a tool to change all programmes out there.

p.s.: I find it fascinating that instead of arguing about actual behaviour the C standard keeps coming up as if it where a bible handed down by some higher power and everything in it is completely infallible. Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 20:10 UTC (Tue) by khim (subscriber, #9252) [Link]

> Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

Rust hasn't needed any specs because till very recently there was just one compiler. Today we have 1.5: LLVM-based rustc and GCC-based rustc. One more is in development, thus I assume formal specs would be higher on list of priorities now.

This being said IMNSHO it's better to not have specs rather than have ones which are silently violated by actual compilers. At least when there are no specs you know that discussions between compiler developers and language users have to happen, when you have one which is ignored…

DeVault: Announcing the Hare programming language

Posted May 10, 2022 23:48 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

Rust doesn't have anything similar to ISO/IEC 14882:2020, a large written document which is the product of a lot of work but is of limited practical value since it doesn't describe anything that actually exists today.

However, Rust does extensively document what is promised (and what is not) about the Rust language and its standard library, and especially the safe subset which Rust programmers should (and most do) spend the majority of their time working with.

For example, all that ISO document has to say about what happens if I've got two byte-sized signed integers which may happen to have the value 100 in them and I add them together is that this is "Undefined Behaviour" and offers no suggestions as to what to do about that besides try to ensure it never happens. In Rust the "no specification" tells us that this will panic in debug mode, but, if it doesn't panic (because I'm not in debug mode and I didn't enable this behaviour in release builds) it will wrap, to -56. I don't know about you, but I feel like "Absolutely anything might happen" is less specific than "The answer is exactly -56".

Rust also provides plenty of alternatives, including checked_add() unchecked_add() wrapping_add() saturating_add() and overflowing_add() depending on what you actually mean to happen for overflow, as well as the type wrappers Saturating and Wrapping which are useful here (e.g. Saturating<i16> is probably the correct type for a 16-bit signed integer used to represent CD-style PCM audio samples)

DeVault: Announcing the Hare programming language

Posted May 10, 2022 23:48 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

The access to volatile memory at pte->v_addr through the non-volatile pointer ptr1 is UB because according to [C89]:

> If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.[57]

> [57] This applies to those objects that behave as if they were defined with qualified types, even if they are never actually defined as objects in the program (such as an object at a memory-mapped input/output address).

Objects in the page at pte->v_addr behave as if they were defined as volatile objects because the content changes in ways not described by the C abstract machine when pte->p_addr is updated. The same applies to passing a pointer to volatile object(s) to memset(), which takes a non-volatile pointer.

The initializer for page_location (pte->v_addr) is also not a constant, but I assume this is just pseudo-code for the value being set by some initialization function not shown here.

[C89] http://port70.net/~nsz/c/c89/c89-draft.html