DeVault: Announcing the Hare programming language

Posted May 2, 2022 7:25 UTC (Mon) by roc (subscriber, #30627)
Parent article: DeVault: Announcing the Hare programming language

I don't want to use systems built in a "trust the programmer" language. I don't trust any programmers, including myself. (OK, I trust DJB, but him only.)

So I'm curious what the "robust" claim in the language blurb actually means. Programs written in this language will only be robust if the programmer is perfect, which makes it a stone-soup kind of claim. Does it mean that the language specification and the compiler are robust?

DeVault: Announcing the Hare programming language

Posted May 2, 2022 7:34 UTC (Mon) by ddevault (subscriber, #99589) [Link] (148 responses)

The key is in the point that follows "trust the programmer":

> Provide tools the programmer may use when they don’t trust themselves.

If you don't trust yourself, make use of the tools. It does the right thing by default and trusts you if you tell it you know better, which is required to handle certain systems programming use-cases which are well supported by C but tend to get marginalized by other languages.

A lot of people seem to latch onto the first point and fire straight into criticism without considering the second. Believe me, we know programmers are untrustworthy, which is what these tools are for - but you need to trust them to get certain things done.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 7:44 UTC (Mon) by roc (subscriber, #30627) [Link] (143 responses)

I didn't see any tools for preventing use-after-free bugs.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 7:46 UTC (Mon) by ddevault (subscriber, #99589) [Link] (128 responses)

No, there are no tools to prevent use-after-free. Hare uses manual memory management. There are other tools, like bounds checked slices and arrays, mandatory error handling, nullable pointer types, and so on, but it's not as comprehensive in this respect as, say, Rust, by design.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 8:32 UTC (Mon) by atnot (subscriber, #124910) [Link] (127 responses)

That seems at least mildly concerning, considering the fact that e.g. the majority of 0days analyzed by google were use-after-frees: https://googleprojectzero.blogspot.com/2022/04/the-more-y...

Of course, the tooling for detecting uaf is complex. But it's complex because especially in concurrent programs, uaf is hard to get right, both for humans and computers. I don't find the idea of just glossing over that in 2022 particularly compelling.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 8:34 UTC (Mon) by ddevault (subscriber, #99589) [Link] (122 responses)

Security is just one of many traits we're balancing in Hare, and it's weighed against other trade-offs. We came away with a different answer than Rust. Let's see how it performs in practice before judging it too harshly based on speculation over what its vulnerabilities might look like.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:29 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (121 responses)

Do you have a theory for why use-after-free is less likely in Hare than in C? If not, what is the metric you're going to use for determining whether the trade-off is preferable to Rust's?

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:36 UTC (Mon) by ddevault (subscriber, #99589) [Link] (120 responses)

Use-after-free may or may not be less likely in Hare than C (I really couldn't say off-hand). The trade-offs to which I refer are broader than simply use-after-free, however: it's the entire domain of saftey-oriented language design. Use-after-free is one issue we've chosen not to address, though we have addressed many others, and unlike many Rust advocates, I don't think that writes off the entire language.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:49 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (119 responses)

I think we're at the point in history where anyone who writes a compiler that permits use-after-free should be held liable for anyone who manages to fuck up as a result of that. Security issues aren't a matter of inconvenience now - we're reached a level of pervasive tech that results in people literally dying as a result of memory unsafety. If you're fine with that then hey go with it, but you should be explicit about making design choices that increase the risk of actually awful outcomes instead of punting that to the people who use your language.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:52 UTC (Mon) by ddevault (subscriber, #99589) [Link] (117 responses)

There are significantly more non-life-critical use-cases than there are life-critical use-cases. At no point has anyone suggested that Hare is used to write a pacemaker firmware. The introduction to the crypto module is also quite serious about security obligations:

> Cryptography is a difficult, high-risk domain of programming. The life and well-being of your users may depend on your ability to implement cryptographic applications with due care. Please carefully read all of the documentation, double-check your work, and seek second opinions and independent review of your code. Our documentation and API design aims to prevent easy mistakes from being made, but it is no substitute for a good background in applied cryptography.

Not everyone is working on airplane guidance systems.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:55 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

> Not everyone is working on airplane guidance systems.

No, some of them are writing chat apps or media decoders or other things that simultaneously contain private data and attempt to parse untrusted data. The set of useful apps you can write these days that are at zero risk of memory unsafety is tiny, and it doesn't need to be a tradtionally safety-critical use case to risk deaths as a result.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:35 UTC (Mon) by roc (subscriber, #30627) [Link] (19 responses)

The problem is that any code exposed to potentially malicious input is a security attack surface. And even if you don't care about your device being compromised, it's still a hazard for others; e.g. any compromised network-attached device can be part of a botnet for DDoS or a relay for ransomware attacks.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:38 UTC (Mon) by ddevault (subscriber, #99589) [Link] (18 responses)

Aye, this is why Hare *does* offer some security features. We just don't offer all of the same features as Rust does, and that's fine. It's a different set of trade-offs.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 11:33 UTC (Mon) by khim (subscriber, #9252) [Link] (6 responses)

The majority of problems in real-world C/C++ programs come from issues with memory management. Unless you can explain what you have done to avoid this issue other security features are not all that interesting.

It's like making a car and then deciding that adding wheels to it would be too hard. Such car would still be useful for some things, but it wouldn't be useful for most people.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 11:36 UTC (Mon) by ddevault (subscriber, #99589) [Link] (4 responses)

I have already explained Hare's security features many times, some of which relate directly to memory management, such as nullable pointer types or bounds-checked slices. Memory issues are much more rare in Hare programs than in C programs.

Part of Hare's goal is to question the orthodoxy pushed by Rust and its proponents and to look for a different solution. No, it's not Rust. No, it's not going to be Rust. That does not rule it out as an interesting language.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 13:56 UTC (Mon) by ncm (guest, #165) [Link] (2 responses)

It does, when you demonstrate you have utterly failed to comprehend Rust's, let alone C++'s, value proposition.

It is 2022, not 1972. Bring something new and valuable to the table. Don't waste our time on 1970s retreads.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 13:58 UTC (Mon) by ddevault (subscriber, #99589) [Link] (1 responses)

Rest assured that I fully understand the value Rust supposes it offers, and C++ also. You have not invented the God language, and you are not His servant. If you aren't interested then Hare, then fine, more power to you, but bugger off while we build the language we want - not the language you want.

Sigh

Posted May 2, 2022 14:02 UTC (Mon) by corbet (editor, #1) [Link]

Someday we'll be able to post an item on programming languages without things degrading this way. Evidently not today. Let's all please back off a bit now.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 23:48 UTC (Mon) by roc (subscriber, #30627) [Link]

> Memory issues are much more rare in Hare programs than in C programs.

Do you have evidence this is true in practice? I guess you've fuzzed Hare programs and equivalent C programs and measured the bug rates?

DeVault: Announcing the Hare programming language

Posted May 2, 2022 16:40 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

> The majority of problems in real-world C/C++ programs come from issues with memory management

I disagree with that claim. At least it's not what I'm suffering the most from. They're among the longest ones to debug though. Most problems I'm seeing actually result from undefined behaviors (especially recently when compilers started to abuse them to win the compilers war), and problems with integer type promotion that's a real mess in C, and that started to turn something bening into real bugs with the arrival of size_t everywhere and 64-bit machines while dealing with plenty of APIs written for ints. These ones are often considered to be memory management issues because they result in buffer overflows and so on but they actually are not at all.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 13:14 UTC (Mon) by HelloWorld (guest, #56129) [Link] (10 responses)

So what trade-off is that? What does it buy you not to prevent resource leaks and use-after-free bugs statically? Really the only answer I can think of is that it makes the language easier to learn. So you save perhaps a couple of weeks of fighting with the borrow checker in order to later spend months debugging leaks and use-after-free issues – assuming you even find them before they cause a production outage and cost you millions of dollars. That doesn't seem like a good trade-off to me.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 13:20 UTC (Mon) by ddevault (subscriber, #99589) [Link] (9 responses)

The trade-offs come in the form of the complexity of the language and its implementation, which in Rust's case is utterly out of control. I'm not convinced that borrow checking cannot be done more simply - we intend to research this for Hare - but it's not the holy grail of programming which creates a moral lower class among other language, as the Rust adherents in this thread seem to believe.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:17 UTC (Mon) by linuxrocks123 (subscriber, #34648) [Link] (5 responses)

How about this: keep track of all the pointers in the runtime, and, when a memory region is freed, replace all the pointers that refer to it with NULL. Then, if anyone tries to do a use after free, they'll get an immediate crash. This wouldn't be a feature you'd want to have turned on for production code, but it could help a lot during development and testing, and it would avoid burdening the programmer with additional borrow-checking rules. I hope, also, that you'll allow bounds checking and similar features to be turned off after debugging and testing are done: I'd rather not pay the cost of those debugging aids in production.

The reception you've gotten from others in this comment section is very unfortunate, although, given the people involved, I'm also not terribly surprised they're acting this way. A good philosophy for handling negativity from ignorant, opinionated blowhards is to just ignore them and keep doing what you love. I'm sure you'll make something great with Hare.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:19 UTC (Mon) by ddevault (subscriber, #99589) [Link] (4 responses)

>How about this: keep track of all the pointers in the runtime

This is a non-starter, it's too expensive. Interesting idea, though. We are planning on writing an optional debug allocator which will address many memory-related bugs, such as use-after-free, in a similar manner to valgrind.

> The reception you've gotten from others in this comment section is very unfortunate, although, given the people involved, I'm also not terribly surprised they're acting this way. A good philosophy for handling negativity from ignorant, opinionated blowhards is to just ignore them and keep doing what you love. I'm sure you'll make something great with Hare.

Thanks :)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:53 UTC (Mon) by linuxrocks123 (subscriber, #34648) [Link] (3 responses)

> This is a non-starter, it's too expensive.

What if you used a pool allocator to separate the pointers from the non-pointers? That would add a level of indirection to structs that had pointers, since they'd have to be converted to pointers to pointers, and also to pointers on the stack, since ditto. But then you'd have a nice contiguous array of all your pointers that you'd just have to scan upon calls to free, and you might be able to use SIMD for that.

If you've thought of that and it's too expensive, I'll stop now, but I just thought I'd mention it in case you hadn't thought of it :)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:55 UTC (Mon) by ddevault (subscriber, #99589) [Link]

I don't have time to give this idea the thought it deserves right now, but thanks for sharing. Added it to my notes for improvements to allocator insights.

Use-after-free checking at low runtime cost

Posted May 4, 2022 1:12 UTC (Wed) by akkartik (guest, #158307) [Link] (1 responses)

Since you seem interested in this space, I'll throw out one idea I particularly like and have used in a project [1]: manage heap allocations using a fat pointer that includes an allocation id. The pointer contains the allocation id and so does the payload. Every dereference of the fat pointer compares the allocation id in the pointer and payload. Freeing an allocation resets its allocation id. Future allocations that reuse the allocation will never generate the same allocation id. A use-after-free dereference then leads to an immediate abort, which is easier to debug and more secure.

The overhead of this scheme is too great for most C/Rust programmers, but I think it's much lower than tracking all pointers or indirections in structs containing pointers.

[1] https://github.com/akkartik/mu

Use-after-free checking at low runtime cost

Posted May 4, 2022 13:22 UTC (Wed) by HelloWorld (guest, #56129) [Link]

The best that a run-time check for this sort of thing can do is turn one bug into a different kind of bug, at a considerable performance cost. While that can be useful for legacy programming languages like C (primarily as a debugging tool), it's simply the wrong approach for new languages. Modern programming language design should be focused on statically preventing bugs, and messing around with run-time checks is simply a waste of time.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:54 UTC (Mon) by HelloWorld (guest, #56129) [Link] (1 responses)

> I'm not convinced that borrow checking cannot be done more simply - we intend to research this for Hare
I don't think something this fundamental can be retrofitted. It specifically says on the Hare website that you intend to place a strong emphasis on backward compatibility, which means that once programs with lifetime issues (leaks/use-after-free) are out there, the compiler needs to be able to compile them, and thus the same bugs can occur in new code as well.

I wish you well with your efforts regarding a simpler borrow checker.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:57 UTC (Mon) by ddevault (subscriber, #99589) [Link]

To be clear, we're only committed to backwards compatibility following 1.0. Borrow checker research will be done prior to 1.0.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 23:57 UTC (Mon) by roc (subscriber, #30627) [Link]

Optimizing for a simpler language and implementation makes sense when your audience is mostly the people working on the language and implementation. It makes less sense when your audience is mainly people developing in the language, and even less sense when you also consider users of software developed in the language.

Language simplicity is good for developers, of course, but absence of memory corruption and data races provides another kind of simplicity. (And I think Rust is relatively free of the kind of "accidental" complexity that C and C++ are full of.)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 16:49 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (95 responses)

> Not everyone is working on airplane guidance systems.

And bugs in such environments are usually much nastier, like NaNs propagation due to a 0/0 or inf-inf somewhere in a calculation, int-to-float-to-int loss of precision, etc. There's actually a very wide class of bugs that result from people deliberately ignoring the target system and standards, and constantly relying on the compiler to hide problems is not going to make these classes of issues disappear, quite the opposite. I've seen developers ask me "what's so special with this value 4294967295 ?". You definitely never want to climb on a plane that uses that person's code. Even with the help of a compiler...

DeVault: Announcing the Hare programming language

Posted May 2, 2022 17:14 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (94 responses)

Combining this:

> There's actually a very wide class of bugs that result from people deliberately ignoring the target system and standards

with a sibling reply of yours[1]:

> Most problems I'm seeing actually result from undefined behaviors (especially recently when compilers started to abuse them to win the compilers war)

If one doesn't like the compiler (ab)using undefined behaviors in the standard, isn't that a consequence of ignoring the standard? Sure, one could argue that the *standard* is the silly thing here, but what compiler is going to give "what I meant" to you? And at that point, are you really coding in C anymore?

I feel like if you were to apply the sentiment expressed here to the complaints in the other one, you'd be cursing the developer for writing code that didn't adhere to the standard than to the compiler for (arguably rightly) not understanding what the developer actually wanted because it wasn't communicated properly.

[1] https://lwn.net/Articles/893500/

DeVault: Announcing the Hare programming language

Posted May 2, 2022 17:38 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (93 responses)

> Sure, one could argue that the *standard* is the silly thing here, but what compiler is going to give "what I meant" to you? And at that point, are you really coding in C anymore?

C code that only works with -fwrapv is still C code.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 17:51 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (92 responses)

Sure, if it only ever gets built in an environment where such a flag is provided. But vendoring involves the moral equivalent of `$(CC) subdir/*.c` often enough that claiming it as C in such a situation is quite dangerous.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:36 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (91 responses)

Let me be less flippant and more explicit: C was originally conceived as a more portable assembler. UB did not mean "the compiler twists your code into a pretzel," it meant "we don't know what will happen, because there's an obscure implementation from 1972 that traps, and then the OS gets involved, and then who the hell knows what happens after that, so therefore if you do this thing, weird stuff may happen on that one implementation." If you knew, for a fact, that you were not targeting this obscure implementation, then you could write UB just fine, and treat it as if it were implementation-defined instead.

Many of the UB rules were written to support architectures that, by modern standards, are just not things people use any more. Nobody uses ones' complement or sign-magnitude for integers. Hardly anybody uses sNANs. Segmented architectures are very uncommon these days, as is the infamous NaT bit from Itanium. In a sane world, I'd be able to just say "well, I don't care about targeting any of those weird platforms; if you want to support them, you're on your own."

But I can't say that, because compiler writers have decided that UB means "the compiler twists your code into a pretzel." I understand that some of these optimizations do improve performance in various ways, but they also result in things like optimizing out NULL checks if you can prove that the pointer was previously dereferenced (a problem which has struck the kernel more than once, as I recall). I just wish we could have more of a happy medium, where you don't get UB unless you actually corrupt the heap, overflow the stack, or some similar catastrophe. All other forms of UB, IMHO, should have been specified as "implementation-defined, but the implementation may specify UB if it is unable to provide any guarantees." Then at least we could characterize this as a quality-of-implementation issue.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:14 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> Let me be less flippant and more explicit

Thanks :) .

> In a sane world, I'd be able to just say "well, I don't care about targeting any of those weird platforms; if you want to support them, you're on your own."

I agree that some kind of:

std::static_assert(std::target::is_twos_complement()); // definitely required
std::warn_on(!std::target::os::any_of(std::target::os::macos, std::target::os::linux)); // Windows is untested, but allow tinkering

support would be wonderful. But that's missing and is currently only metadata.

> But I can't say that, because compiler writers have decided that UB means "the compiler twists your code into a pretzel."

The twos complement example gets used all the time, but I heard another tale of why signed overflow is undefined: type promotion. If I iterate over a `short` and overflow is important, promoting to an int to access faster instructions or whatever and just letting it overflow to `USHORT_MAX+1` is no longer a valid optimization. C type promotion is weird and the differences between signed and unsigned promotions are likely reasons why one has defined extrema behaviors, but I don't know that much about that side of it. I admit that my grokking of the relevant rules is fuzzy at best and compiler warning-guided at worst.

Rust's `Wrapping` and `Saturating`types make this far better by making such assumptions explicit. The debug check vs. release free-for-all is an acknowledgement that such behavior is fine in most cases, but should be considered. Those that want specific assumptions recognized should really tell the compiler about them. But Rust has ways of doing so and C and C++ still lack it.

> I understand that some of these optimizations do improve performance in various ways, but they also result in things like optimizing out NULL checks if you can prove that the pointer was previously dereferenced (a problem which has struck the kernel more than once, as I recall).

Yes. I'd love the compiler to signal "hey, I made UB-assumptions", but that is apparently very non-trivial in practice. At least with current compiler architectures. Maybe someone will implement better origin tracking when working on LLVM bytecode or GCC's GIMPLE to "see" what was actually written, but I don't see new C++ compiler infrastructure getting anywhere in the next 5 years seeing how many have "given up" and migrated to being LLVM/Clang reskins (though many were EDG skins, this consolidation is not promising IMO). I suspect the prevalence of macro-stamped code and optimizing inline code makes this difficult because you *want* such optimizations in those cases. Template instantiation almost certainly has similar problems.

Additionally, Ralf Jung's blogs[1] on[2] provenance[3] show that there are ways that different passes, while fine on their own, may *combine* to abuse UB into something unintended (though this is more about showing that provenance has to be a thing than UB in general, I would be surprised if there were no cases of such behaviors between various dead code and value analysis optimization passes). I have no idea how that is expected to be tracked and handled internally.

[1] https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html
[2] https://www.ralfj.de/blog/2020/12/14/provenance.html
[3] https://www.ralfj.de/blog/2022/04/11/provenance-exposed.html

DeVault: Announcing the Hare programming language

Posted May 3, 2022 10:30 UTC (Tue) by nye (subscriber, #51576) [Link] (4 responses)

> things like optimizing out NULL checks if you can prove that the pointer was previously dereferenced

This sort of thing is exactly why we shouldn't be asking for languages which "trust the programmer". By dereferencing a pointer, according to the rules of C, the programmer is instructing the compiler that it can assume the pointer is not null. It seems churlish to then turn around and complain that the compiler took you at your word.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:01 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (3 responses)

Quite the opposite. In practice I *know* that dereferencing a NULL on any modern platform causes a SEGV and I'm using it exactly for that purpose. In C that's UB so compilers decided that since the program doesn't exist afterwards it's not their problem and they can eliminate the code. Except that my code was there precisely to provoke a panic and crash the program before it degenerates, while still preserving registers and frame pointer intact. Using abort() is not an option for this (it ruins everything and sometimes you can't even unwind the stack when you're mailed the core using on libs that are not exactly yours). Result: I had to cheat and dereference (int*)1 because the compiler didn't know it was NULL as well. It's constantly a can-and-mouse game between C developers and compiler developers, with the former saying "let me use my processor and OS for the purpose they were built" and the latter saying "we don't want you to do that because that's stupid". It may be stupid from a compiler developer's point of view, but if all C developers were compiler developers we wouldn't need gcc nor clang and would each one develop our own compiler. So please let us dictate the compiler what we want to do so we can use our hardware in peace.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:44 UTC (Tue) by excors (subscriber, #95769) [Link]

> In practice I *know* that dereferencing a NULL on any modern platform causes a SEGV and I'm using it exactly for that purpose.

That depends on what you consider a modern platform - there are ARMv8-M microcontrollers where address 0 is the start of flash, and reading NULL will return some non-zero data (because that's where the vector table is stored). (Writing to NULL might raise an exception or actually modify flash, depending on the current configuration of the flash controller.)

> Except that my code was there precisely to provoke a panic and crash the program before it degenerates

If you're explicitly writing to NULL, Clang will optimise it away but helpfully tells you:

> warning: indirection of non-volatile null pointer will be deleted, not trap [-Wnull-dereference]
> *(int *)0 = 0;
> ^~~~~~~~~
> note: consider using __builtin_trap() or qualifying pointer with 'volatile'

volatile prevents the optimisation, so it will emit the write instruction. __builtin_trap typically emits an undefined instruction (ud2 on x86), which will crash even on systems where 0 is a valid address.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 15:18 UTC (Tue) by nye (subscriber, #51576) [Link]

Oh come on - you're talking about deliberately writing something that you know is invalid in the chosen language and then complaining that the compiler implementer didn't do what you mean. And *then* getting angry about them not trusting you! I don't see how anyone could consider this a defensible position.

Do What I Mean

Posted May 3, 2022 16:27 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

Rather than "let me use my processor and OS for the purpose they were built" what you actually seem to want, though you don't seem aware of it, is "Do What I Mean" which is not deliverable, hence why you're unsatisfied.

First of all you will need to write what you meant, if you can do that, compilers can produce programs that do what you wrote and you'll be satisfied. But most often unhappy C programmers find that they really struggle to write what they meant, and that's where the problem lies, because if you can't write what you meant, then a program which does what you wrote won't do what you meant and you'll be unhappy.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 13:55 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

Exactly what I meant, thanks Kevin ;-)

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:00 UTC (Tue) by felix.s (guest, #104710) [Link] (83 responses)

Many of the UB rules were written to support architectures that, by modern standards, are just not things people use any more. Nobody uses ones' complement or sign-magnitude for integers. Hardly anybody uses sNANs. Segmented architectures are very uncommon these days, as is the infamous NaT bit from Itanium.

The fact that the popularity of those architectures has faded recently doesn’t make them any less legitimate, and doesn’t preclude reusing the ideas they are based on in the future.

Do you want to condemn software to being either re-written from scratch whenever a new architecture appears or subjected to costly, painful emulation of behaviour that nobody really wants, because it’s just too hopeless to attempt to scrutinise the entire codebase for any unportable assumptions it may contain?

This is (in a case of poetic spite, I must admit) more or less the situation Rust finds itself in when it comes to porting to CHERI architectures. Rust made the blithe assumption that size_t is the same as uintptr_t because ‘come on, nobody uses segmented architectures any more’, and named the common type usize. And then Morello appeared and suddenly there is an architecture where C is easier to port than Rust, for this and a couple of other reasons.

Last I checked, the direction Rust seems to be going is to swallow the bitter pill and define usize to be uintptr_t, accepting the resulting memory bloat in situations where size_t happens to be smaller.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:07 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (79 responses)

> Do you want to condemn software to being either re-written from scratch whenever a new architecture appears

That's not the point. Right now with compilers abusing UB, there's no way for you as a user to have that portable code because it will work on neither architecture. If the compiler would translate your code into machine code, just thinking "this developer does stupid things, but that's their problem", then you could deal with special cases when you face them, as has always been done when porting code to other platforms or operating systems.

The problem got worse with UB abuse because the tricks you have to use to work around the compiler's stubbornness are even less portable than the original code itself. Is it normal that in 2022 I'm using more and more asm() statements to prevent the compiler from lurking into what I'm doing ? I don't think so. It feels like one day my whole C code will only be a bunch of macroes based on asm() statements. That's not my goal when I'm using a C compiler, really.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 15:07 UTC (Tue) by Wol (subscriber, #4433) [Link] (3 responses)

As somebody pointed out quite some while ago, compiler writers seems to be redefining defined behaviour as undefined.

What they SHOULD be doing is turning undefined behaviour into implementation defined. "On an x86_64 system, we don't check for addition overflow. You get what the hardware gives you". NOT "if you're stupid enough to add two integers both large enough for the high order bit to be set, we'll multiply them together instead then give you the middle bytes of the result". Okay, that example is facetious, but as people keep pointing out, when the programmer knows enough to put an "if ptr is null" guard in place, they do NOT want the compiler deleting it as undefined behaviour! FFS, the programmer clearly *knows* something could be wrong, and has put a test in there for it!

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 3, 2022 18:24 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

> when the programmer knows enough to put an "if ptr is null" guard in place, they do NOT want the compiler deleting it as undefined behaviour! FFS, the programmer clearly *knows* something could be wrong

And when this is stamped out code from a macro or template instantiation, should it also not be removed? What a silly optimization to leave on the cutting room floor. Compilers could probably better track this stuff to know the difference between macros and template code, but it doesn't seem to be high on the priority list right now.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:01 UTC (Wed) by khim (subscriber, #9252) [Link]

> Compilers could probably better track this stuff to know the difference between macros and template code, but it doesn't seem to be high on the priority list right now.

It's not high on the priority list because people just couldn't agree on what things should be retained and which shouldn't be retained.

Without clear, consistent rules there are nothing to discuss. No matter what the compiler does or doesn't do there would always be someone who would claim it's wrong behavior.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:56 UTC (Wed) by khim (subscriber, #9252) [Link]

> FFS, the programmer clearly *knows* something could be wrong, and has put a test in there for it!

What makes that test any different from many other tests which the programmer knows that the compiler knows how to remove?

There were an attempt to define a friendly C dialect. And all failed because people just couldn't agree which checks are superflous and should be removed from the program and which are important and should be retained.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 16:40 UTC (Tue) by felix.s (guest, #104710) [Link] (70 responses)

Is it normal that in 2022 I'm using more and more asm() statements to prevent the compiler from lurking into what I'm doing ?

I don’t think so. I happen to do just the opposite, write the dumbest possible pure ISO C code and watch in satisfaction as the optimizer turns it into a compact and performant opcode sequence.

That's not the point. Right now with compilers abusing UB, there's no way for you as a user to have that portable code because it will work on neither architecture.

There is a way: you avoid the cases that trigger UB, and rely only on what the abstract machine guarantees. I can agree this is not always easy, but there are tools to help you with that. As long as the abstract machine is implemented correctly and its invariants are upheld, the program will work on any target it is compiled for.

And yes, this is very much the point. Either you insist that the C abstract machine map exactly to the primitives of the platform it’s implemented on, even in cases that are undefined on the abstract machine itself, or you don’t.

If you don’t, you forfeit any right to complain that compilers ‘abuse’ UB: if it’s undefined, it’s undefined, and it doesn’t even have to act deterministically. Undefined behaviour can change when the hardware changes, when the OS changes, when the compiler changes, when the placement of your program within the address space changes, when the day of the week changes, when the precise location of all electrons within the atmosphere changes. You are expected to prevent the situation triggering UB from happening in the first place. If you don’t, it’s your fault.

If you do so require, you give up on optimizations and portability, including portability to future versions of the same architecture. You accept that people are going to say ‘I have learned to write null pointer checks, that’s why there isn’t one present.’ and there is nothing you can say to convince them otherwise. You agree that software is going to do crazy shit like forging pointers to memory is has no right to assume is there, assuming that 640K is enough for anybody, and relying on open bus behaviour, and all that has to be preserved in perfect detail as long as you want to keep it running, even cases that were erroneous to begin with, until it’s rewritten to rely on another platform’s implementation details. A throwback to DOS days, if you ask me.

This is (a somewhat exaggerated version of) the dilemma you face. There is no third way. Based on your response, it seems you prefer the latter.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 18:22 UTC (Tue) by atnot (subscriber, #124910) [Link] (24 responses)

> you give up on optimizations and portability

I feel like people often mention "giving up on some optimizations" without expressing the full implications of what that actually means. It sort of makes it sound like if those annoying compiler authors just simply removed the bad passes, one could sacrifice a few percent here or there to get more predictable behavior. Now, as others have pointed out, the first problem with this idea is that these confusing non-obvious results often come from combining multiple obvious passes with no clear culprit.

But for this specific case, as far as I can tell, making it well-defined to turn arbitrary addresses into pointers and dereference them like would be required to make dereferencing null pointers valid, which is a complete non-starter. It makes it basically impossible to perform any optimizations at all. You can no longer rely on anything in memory still being the way it was across function calls, so no more storing things in registers. You have to spill everything to the stack. Removing an unused variable? Can't do it, something might be getting the address of it from somewhere. Just assigned something to a struct member? Well, you can't be sure what it is now, because there was a function call in between, and the implementation of free() might have held onto the address of that allocation and fiddled with it in between.

You can definitely argue about overflow UB and such, sure. But without some level of understanding of allocations and what a pointer can and can't point to, it is basically impossible to do anything at all. I'm sure some would prefer it that way, making C an actual "portable assembler" with no abstract machine of it's own. But that has huge implications.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:02 UTC (Wed) by Vipketsh (guest, #134480) [Link] (23 responses)

> making it well-defined to turn arbitrary addresses into pointers and dereference them [...]. You can no longer rely on anything in memory still being the way it was across function calls

Compilers can not rely on that today, at least not in general. Without the seldom used 'pure' and 'const' attributes, the compiler has to assume that an (extern) function call has modified any and all memory accessible through some pointer. Furthermore, there are rules in the C standard for when the compiler has to assume things may have been indirectly modified through random pointers: the aliasing rules. These rules are generally so loose that many people make them much more strict with -fno-strict-alias, yet somehow we haven't seen a huge fallout from lack of optimisations as you would suggest. Being able to manufacture pointers out of random data does not have to effect on any of those rules!

It's interesting that in exactly *no* discussion of undefined behaviour have I ever seen any sort of numbers passed around along the lines of "if we would define that thing this way, we would loose an estimated X% of performance on some code bases", instead it's all in the lines of your comment saying "Oh, the hysteria, quiver in fear because you could do exactly no optimisations". People arguing to remove some undefined behaviour tend to give examples of what that undefined behaviour makes a big pain or impossible, but there is little concrete arguments from the other side about what removing the undefined behaviour in question would loose. That makes discussions, awareness of the problem, and finding some sort of middle ground exceedingly difficult.

> you can't be sure what it is now, because there was a function call in between, and the implementation of free() might have held onto the address of that allocation and fiddled with it

Guess what ? Every implementation of free() "holds onto the address" given to it (puts it on some free list) and "fiddles with it" (marks the area as unallocated).

Why is everything always painted in a way that if you can't fix any and all possible cases of a certain undefined behaviour without even a minimum of compromise we may as well through the baby out with the bath water ? We don't have to make everything perfect and foolproof to make things better.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:36 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

> These rules are generally so loose that many people make them much more strict with -fno-strict-alias, yet somehow we haven't seen a huge fallout from lack of optimisations as you would suggest.

Surely you mean `-fstrict-aliasing` here. Or are you saying that people *loosen* the rules with `-fno-strict-aliasing` and still see no fallout?

> Every implementation of free() "holds onto the address" given to it (puts it on some free list) and "fiddles with it" (marks the area as unallocated).

That's an argument that `free` cannot be implemented in (ISO) C and ends up doing more platform-specific things with the pointer than C would normally allow. Just like `std::memmove` isn't technically possible (AFAIK) in ISO C++ (because of the rules around comparing pointer from separate allocations). See also `std::bless` in C++ to have a way to inform the compiler "I did some memory shenanigans, the object there is now C++-okay". I suspect that compilers "know" when they're compiling these functions and act accordingly (probably through some compiler flag or pragma whatnots). Or very careful coding around the rules that C has to make sure the intent is preserved across the abstract machine.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:57 UTC (Wed) by Vipketsh (guest, #134480) [Link]

> Surely you mean `-fstrict-aliasing` here.

Heh. I had a suspicion this would come up. The 'strict' in the compiler option refers to how strictly the compiler's alias analysis adheres to the standard. My use of 'strict' was referring to how many transformations are allowed by the standard. Wish I could have explained better.

> That's an argument that `free` cannot be implemented in (ISO) C

Indeed, but even so we don't have to disable all possible optimisations like the post I'm replying to is implying, while free() is routinely implemented in C.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:21 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (4 responses)

> It's interesting that in exactly *no* discussion of undefined behaviour have I ever seen any sort of numbers passed around along the lines of "if we would define that thing this way, we would loose an estimated X% of performance on some code bases",

I totally agree. Gcc 4.7 used to abuse UB way less than 6 and above, and I've yet to see a program run faster with gcc 11 than it used to with gcc 4, usually it's even the opposite!

I said a few times (probably in this thread I don't remember) that if I knew how to do it and had enough time I would be happy to create a new "standard" for gcc such as "safe11" or something like this that next to gnu99 and friends, would be C11 with most (ideally all) UB defined to the most commonly expected case (it wouldn't be that far from the "linux kernel C").

And I'm quite sure it would be quickly adopted by many of us suffering from such jokes. Plus it would remove a ton of non-sensical warnings such as the ones that force you to scratch you head for a moment when trying to implement a binary integer rotate operation without any warning (32-bit doesn't work, you need to use bit^31 in the opposite shift and the compiler doesn't always recognize it to optimize it into a single rol/ror operation).

DeVault: Announcing the Hare programming language

Posted May 4, 2022 16:30 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (3 responses)

> I would be happy to create a new "standard" for gcc such as "safe11" or something like this

There have been attempts[1]. I've not heard news about meaningful progress (though I've also not sought it out). I'd expect any announcements of such a thing to show up on LWN in some manner :) .

[1] https://blog.regehr.org/archives/1287

DeVault: Announcing the Hare programming language

Posted May 6, 2022 2:59 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (2 responses)

Ah interesting, thanks for the link!

Probably the mistake this person made was to try to reach a consensus. If the proposal worked for some old code base, surely it wasn't that bad, and ought to have been proposed as-is as a patch to gcc.

DeVault: Announcing the Hare programming language

Posted May 6, 2022 13:05 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

> If the proposal worked for some old code base, surely it wasn't that bad, and ought to have been proposed as-is as a patch to gcc.

And it would be promptly rejected. Because the question which would be asked would be simple: why do you think you are especially special and deserve a separate treatment?

As was noted in the blog post: there are many programs typically compiled for ARM that would fail if this produced something besides 0, and there are also many programs typically compiled for x86 that would fail when this evaluates to something other than the original value… and both types can be rewritten to work within limitations of standard C… so why should the compiler developers care?

More-or-less the only guy who they give special treatment is Linus: not only he leads a huge and important project, but, more importantly, it's obvious that said project need to go beyond boundaries of Standard C, sometimes.

Even then leeway is extremely limited, Linus have to argue about things a lot for these to be accepted as GCC C extension.

DeVault: Announcing the Hare programming language

Posted May 6, 2022 18:40 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

> As was noted in the blog post: there are many programs typically compiled for ARM that would fail if this produced something besides 0, and there are also many programs typically compiled for x86 that would fail when this evaluates to something other than the original value… and both types can be rewritten to work within limitations of standard C… so why should the compiler developers care?

I do have a response to this: just look at the code for each of them to adapt to the other one's behavior to figure which choice has the least impact, and purposely break the other one, given that it currently is broken or about to break anyway during a future compiler upgrade. But at least this will be clearly documented. And when the cost is the same I'd choose x86 by default since 1) it's accumulated way more older code (arm code tends to be more modern and less arch-specific), and 2) it's where users go when they want the highest performance level nowadays.

> More-or-less the only guy who they give special treatment is Linus: not only he leads a huge and important project, but, more importantly, it's obvious that said project need to go beyond boundaries of Standard C, sometimes. Even then leeway is extremely limited, Linus have to argue about things a lot for these to be accepted as GCC C extension.

Yes, I know, and that's sad.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:32 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

> These rules are generally so loose that many people make them much more strict with -fno-strict-alias, yet somehow we haven't seen a huge fallout from lack of optimisations as you would suggest. [...] It's interesting that in exactly *no* discussion of undefined behaviour have I ever seen any sort of numbers passed around along the lines of "if we would define that thing this way, we would loose an estimated X% of performance on some code bases", instead it's all in the lines of your comment saying "Oh, the hysteria, quiver in fear because you could do exactly no optimisations".

It's trivial to construct plausible examples where aliasing has a huge effect on performance, especially in C++ where you don't want everything to alias with 'this'. E.g. https://godbolt.org/z/zddTYK378 executes over 4x faster with -fstrict-aliasing on my CPU (because the compiler can autovectorize the loop when it realises the input and output don't alias). You can probably do similar with most other undefined-behaviour optimisations, but I'm not sure that would really prove much.

I think one major problem with trying to translate that into "X% of performance on some code bases" is that there's a massive range of code bases, and no benchmark suite is representative of them all, so it's impossible to get representative numbers. But even if it was: If an optimisation has no effect on 99% of programs, but it makes 1% of programs 4x faster, is that worth it? It seems the most common positions are "it's always worth it, regardless of the exact numbers" (modern compiler developers), and "it's never worth it, regardless of the exact numbers" (people who want C to be nicer syntax for assembly code), and the exact numbers probably won't change anyone's mind.

And in any code base where performance is important, the developer should have already profiled and optimised it around their current compiler's capabilities - e.g. if they had code like my example with -fno-strict-aliasing then they'd probably extract 'sum' into a local variable to help the compiler. Then a benchmark would show no benefit from -fstrict-aliasing, because the programmer has already paid the cost of working around aliasing problems. Optimisation isn't just a compiler algorithm, it's a feedback loop between compiler and programmer, so you can't evaluate it properly by running compilers on a static set of benchmarks.

And it's a feedback loop that spans decades: e.g. compilers get really good at inlining and constant-folding and eliminating dead code, so people invent techniques like expression templates (where a C++ expression doesn't compute a value, it essentially computes a type that represents the AST of the expression, which can be manipulated at compile-time before eventually turning into hundreds of function calls that produce a single line of code), then they build a linear algebra library like Eigen using that technique, then applications start using the library, then compiler developers are motivated to improve autovectorization because there's all these applications doing linear algebra, etc.

At the end of that process, you can't just turn off one of the old compiler optimisations and expect to get meaningful results; too much code implicitly depends on it. And at the start of the process, you couldn't have predicted exactly what that optimisation would lead to; all you could predict is that if you had waited for quantifiable evidence of a major benefit then you'd never had made any progress.

(This argument mostly applies to C++, not C, but I think nobody cares enough about C to develop a serious compiler for it - you'll just get a C++ compiler with a cut-down parser, so you'll get the costs of these fancy optimisations without much of the benefit. That's the downside of sticking with a niche language like C.)

DeVault: Announcing the Hare programming language

Posted May 4, 2022 17:44 UTC (Wed) by Vipketsh (guest, #134480) [Link] (1 responses)

> that there's a massive range of code bases, and no benchmark suite is representative of them all,

That's exactly the kind of argument I was talking about in my last sentence that does not help these discussions. Somewhere along the way, someone put in a ton of work to write an optimisation pass to, I hope, produce more optimal output. Since everything is about optimising output, again, I would hope that there were at least *some* benchmarks published along with the new optimisation to show that maintaining the optimisation pass for the future is a good idea. Therefore when these discussions come up it should be pretty simple: "look, when this new NULL check deleting pass was added it brought X% to the table on this benchmark". At that point we would have a basis for discussion: maybe the code base in question isn't so important any more, maybe some other newer passes make the gains less relevant, or maybe just decide that the gains are not a good trade-off. With random hand-waiving and fear mongering there is no way a meaningful discussion can be had.

> At the end of that process, you can't just turn off one of the old compiler optimisations and expect to get meaningful results;

On the flip side optimisation passes can turn out to be meaningless because some other new passes don't create the sequences any more for it to be meaningful. It's also not like performance regressions are unheard of in compiler land. If we could have a discussion with numbers we could very well come to some tentative conclusion and disable the pass by default to see what falls out (let your users do the testing on "massive range of code bases").

DeVault: Announcing the Hare programming language

Posted May 6, 2022 0:28 UTC (Fri) by khim (subscriber, #9252) [Link]

> Since everything is about optimising output, again, I would hope that there were at least *some* benchmarks published along with the new optimisation to show that maintaining the optimisation pass for the future is a good idea.

True and you can find such benchmarks in the bugzilla (or github for clang). But nobody bothers to measure impact of optimizations based on different UBs. Because the assumption is that code doesn't have any.

In the end you have hundreds of passes and absolutely zero knowledge about which of them are applicable in which cases (except for a few, niche, UBs which are simple enough to deserve a dedicated flag).

> On the flip side optimisation passes can turn out to be meaningless because some other new passes don't create the sequences any more for it to be meaningful.

Sure. Compiler writers keep track of these things. What they don't keep is mapping between UBs and optimizations (again: with exception of explicitly created flags like -fstrong-aliasing or -fwrapv).

You can measure effect of different optimization passes, but you have absolutely no idea which of them are safe or not safe to use when you want to turn some UB into defined behavior.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 16:09 UTC (Wed) by atnot (subscriber, #124910) [Link] (12 responses)

> there are rules in the C standard for when the compiler has to assume things may have been indirectly modified through random pointers: the aliasing rules

Indeed. But those rely on the fact that just creating pointers to arbitrary memory is invalid. If de-referencing arbitrary addresses is valid, all bets are off.

> These rules are generally so loose that many people make them much more strict with -fno-strict-alias

It does the opposite, it makes them weaker, but only a bit. But that's kind of besides the point, which is that it is basically impossible to interface with memory at all without some kind of aliasing rules.

> People arguing to remove some undefined behavior tend to give examples of what that undefined behaviour makes a big pain or impossible, but there is little concrete arguments from the other side about what removing the undefined behaviour in question would loose.

Well, it's kind of impossible to know. There's not a single flag or pass you could turn off to e.g. reliably leave in null checks. The compiler might or might not have a specific code path for eliminating null pointers, but removing that doesn't mean those null dereferences won't be removed by other passes operating on similar assumptions. Or that something else critical won't be removed next time.

The thing is, even if it is phrased that way, the complaint is rarely actually "I would like this specific thing to be defined", it is "I would like the C abstract machine to behave exactly as simply as I think it does". But in a language as unconstrained as C, that's not really possible, nor would it really be a desirable slope to ride.

At the end of the day, that's what this is really about to me. I'm not personally elated when the compiler optimizes out my checks either. I'm not sitting here refreshing the gcc homepage, eagerly anticipating new optimization passes to break my code. But I recognize that these are the consequence of a language that desires to be both fast and accept programs that do arbitrary memory manipulations. And to me it's very clear that if we want to write programs that behave as we think they do, ones where our mental model and the compiler's model are one and the same, we have no choice but to give up one or the other. Just defining a few things won't be enough to make the problems go away.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 16:40 UTC (Wed) by excors (subscriber, #95769) [Link] (11 responses)

> There's not a single flag or pass you could turn off to e.g. reliably leave in null checks. The compiler might or might not have a specific code path for eliminating null pointers, but removing that doesn't mean those null dereferences won't be removed by other passes operating on similar assumptions.

There is -fno-delete-null-pointer-checks, which may not be reliable enough for security purposes but can easily pessimize code: https://godbolt.org/z/PGje44zna is autovectorized unless you enable that flag or remove the "*sum = 0;" line (which tells the compiler it can ignore the subsequent NULL checks).

(And for completeness a similar example with -fwrapv: https://godbolt.org/z/exzs7ocaj is only autovectorized when it can assume the loop does not overflow to negative values.)

DeVault: Announcing the Hare programming language

Posted May 4, 2022 17:45 UTC (Wed) by Vipketsh (guest, #134480) [Link]

> remove the "*sum = 0;" line (which tells the compiler it can ignore the subsequent NULL checks).

I don't think that example demonstrates a case for "derferencing NULL is undefined behaviour". The compiled code has the "if (!sum)" hoisted out of the loop, and once you do that optimisation the loop is no different than if the check where completely removed. Seems to me like the reason for the failed vectorisation is more an internal compiler issue, possibly because of the ordering of passes and not because "dereferencing NULL being undefined behaviour" is vital to the vectorisation.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 2:54 UTC (Thu) by foom (subscriber, #14868) [Link] (9 responses)

> There is -fno-delete-null-pointer-checks

Yes, this flag has a remarkably poor name. In fact, the flag doesn't "turn off deleting null pointer checks" (whatever that might mean). Rather, the underlying behavior (at least as implemented in Clang -- I believe the same is true for GCC) is entirely principled: it informs the compiler that the null pointer might actually refer to valid memory that a program can successfully (potentially even intentionally!) access as an object.

A _consequence_ is that "*foo = 0;" doesn't imply "foo != nullptr", as it otherwise does (so it does have the effect of "not deleting" THAT null pointer check).

DeVault: Announcing the Hare programming language

Posted May 5, 2022 17:28 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (8 responses)

> A _consequence_ is that "*foo = 0;" doesn't imply "foo != nullptr", as it otherwise does (so it does have the effect of "not deleting" THAT null pointer check).

A further consequence of enabling this flag is that you are no longer programming in ISO C:

> An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.

… or C++:

> A null pointer constant is an integer literal (5.13.2) with value zero or a prvalue of type std::nullptr_t. A null pointer constant can be converted to a pointer type; the result is the null pointer value of that type (6.8.2) and is distinguishable from every other value of object pointer or function pointer type.

… since a null pointer can no longer be distinguished from a pointer to an object or function.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 18:10 UTC (Thu) by farnz (subscriber, #17727) [Link] (5 responses)

I'm not sure I follow your reasoning, and I'd appreciate you expanding on it.

Take the following C++ code:


bool bad_code(bool deref_null) {
    int *foo;
    int real_val;
    int *bar = deref_null ? nullptr : ℜ_val;
    *bar = 0;
    return bar == nullptr;
}

I don't see how the snippets you've quoted make it impossible for this function's return value to differ from its deref_null parameter. The null pointer remains a unique value; the behaviour of *bar = 0 is undefined, but importantly, if I remove that line, the function behaves the same in both ISO C++ and C++ with -fno-delete-null-pointer-checks - the distinction is that in ISO C++, this function can be optimized to the equivalent of:


bool bad_code(bool) { return false; }

while with -fno-delete-null-pointer-checks, it can only be optimized to:


bool bad_code(bool deref_null) { return deref_null; }

Although, in both cases, it's perfectly reasonable to elide or not elide the write to pointer value 0, since that write is UB.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 19:12 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (4 responses)

> The null pointer remains a unique value; …

Unique, yes—there is only one null pointer value—but not distinct from any pointer to an object or function. With the -fno-delete-null-pointer-checks flag enabled you can have a pointer to a valid object which compares equal to a null pointer.

> I don't see how the snippets you've quoted make it impossible for this function's return value to differ from its deref_null parameter.

(I am assuming that "ℜ_val" in your example was supposed to be "&real_val". I'm not sure of the purpose of the unused pointer variable "foo".)

According to ISO C++, with the "*bar = 0" line deleted, the return value must be equal to "deref_null". The "bar" pointer can only be "nullptr" when deref_null is true or "&real_val" when deref_null is false, and "&real_val", as a pointer to an object, can never compare equal to "nullptr". With the "*bar = 0" line it's UB when deref_null is true and so could be optimized to just "return false", as you said.

However, with the -fno-delete-null-pointer-checks is enabled, we do not have the guarantees of ISO C++ and "nullptr" could in theory compare equal to a pointer to an object, e.g. if pointers are represented as byte addresses, "nullptr" is represented as byte address zero, and the object (in this case "real_val") happens to be placed at byte address zero. If this happened then "bar == nullptr" would be true even if deref_null is false, so the function cannot be optimized to just "return deref_null".

DeVault: Announcing the Hare programming language

Posted May 6, 2022 13:38 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

Sorry about the bad code formatting - I have no idea how copying and pasting from Emacs did that.

I don't see how you get "not distinct from any pointer to an object or function" from the description of the -fno-delete-null-pointer-checks flag. As I read the documentation, -fno-delete-null-pointer-checks does not permit you to have a pointer to a valid object that compares equal to a null pointer; instead it says that the act of dereferencing a pointer implies nothing about its value. Without the flag, dereferencing a pointer implies the pointer value must not be a null pointer, since if it was a null pointer, the dereference would result in UB (since a null pointer cannot point to a valid object). With the flag, however, while the dereference itself is still UB (since a null pointer cannot point to a valid object), the compiler acts as-if each dereference of a nullptr was immediately followed by an assignment of an unknown value to the pointer.

Because the value is unknown, it could still be a null pointer, but it could also be a new pointer to a valid object - the compiler's analysis passes simply don't know at this point, and thus it cannot rely on the dereference to permit it to remove a nullptr check, since it does not know what the pointer's value is.

DeVault: Announcing the Hare programming language

Posted May 6, 2022 14:55 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (1 responses)

> I don't see how you get "not distinct from any pointer to an object or function" from the description of the -fno-delete-null-pointer-checks flag.

The default is -fdelete-null-pointer-checks, which has the description: "Assume that programs cannot safely dereference null pointers, and that no code or data element resides at address zero."[0] The -fno-delete-null-pointer-checks flag affects *both* of these assumptions, meaning that the compiler cannot assume "that no code or data element resides at address zero" (i.e. that no pointer to an object has the same representation as a null pointer).

As stated in the documentation the intended use of the -fno-delete-null-pointer-checks flag is platforms such as AVR where objects *can* be placed at address zero, which implies that &variable can be indistinguishable from a null pointer. Though this is more likely to be true for a global or static object than for a stack variable.

[0] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#...

DeVault: Announcing the Hare programming language

Posted May 6, 2022 16:52 UTC (Fri) by farnz (subscriber, #17727) [Link]

Thanks for clearing up my misunderstanding - for some reason, I was mentally skipping the second assumption (since on my platforms of choice,there cannot be a code or data element at address 0, and only focusing on the first assumption (that programs cannot safely dereference null pointers, which is the one that allows a compiler to deduce that if you dereference a pointer, it cannot be a null pointer).

DeVault: Announcing the Hare programming language

Posted May 6, 2022 17:26 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

> Sorry about the bad code formatting - I have no idea how copying and pasting from Emacs did that.

ampersand is the HTML/XML metacharacter for starting an entity, and although the standard says that entity references should include a final semicolon, HTML-handling software is more tolerant of the missing semicolon than is entirely ideal.

So it appears that the sequence &real gets overgenerously interpreted as an HTML entity equivalent to Unicode codepoint U+211C BLACK-LETTER CAPITAL R from the Letterlike Symbols block of the Mathematical Symbols, which has an alias name of "Real part".

DeVault: Announcing the Hare programming language

Posted May 7, 2022 6:03 UTC (Sat) by Vipketsh (guest, #134480) [Link] (1 responses)

I have to ask, what were you trying to add to the discussion ?

Sure, you are absolutely correct but what relevance does it have ? If the *compiler* has to assume that NULL points to a valid object what programs would break ? What other fallout would there be ? The only thing I can think of is that when of lawyering about "if (my_pointer == NULL)" you would have to say "Does my_pointer point to the object at address NULL?" instead of "Is my_pointer pointing to an invalid object?".

I think most interpretations of the standard, in the context of undefined behaviour, are simply done in bad faith. My opinion is that the reason that language is in there, and has to be there, is so that malloc(), or anything else that works with a pointer, can return or check for an error. And the reason dereferencing a NULL pointer is undefined is because there is no telling how a platform behaves when you do so. See how none of this has anything to do with the compiler ?

It would do so much good for these discussions if the standard and what it says were put aside. Talk about how one or another change would affect real existing programs and/or platforms. Talk about possible fallout. Talk about potential issues. Talk about benefits. Because "oh, how terrible, now some standard does not match up if I squint at this way" is completely meaningless and adds nothing. Standards, in general, should be looked upon as a nothing more than a aid to achieving interoperability (they can and do contain falsehoods). We all know that standards are violated all the time and to make things work one needs domain specific experience. Lastly, if you are writing a standard your goal should be to document the status quo and most definitely not an attempt to change the world.

DeVault: Announcing the Hare programming language

Posted May 8, 2022 12:48 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

> My opinion is that the reason that language is in there, and has to be there, is so that malloc(), or anything else that works with a pointer, can return or check for an error. And the reason dereferencing a NULL pointer is undefined is because there is no telling how a platform behaves when you do so. See how none of this has anything to do with the compiler ?

You have muddled the NULL pointer (an abstract idea) with the all zeroes address on a typical CPU, these are intentionally not the same thing.

While it's obviously a bad idea, C has no trouble with using actual values from a type as sentinels, atoi("junk") and atoi("0") are both zero. So it wouldn't have been a problem to define that malloc() returning zero can be either an error or an actual zero address. And because C runs on the abstract machine, not some actual platform with whatever weird behaviour, the question of what happens if we try to do platform illegal operations never comes up.

Most platforms are likely to either not be phased at all by the all-zeroes address, or to be equally concerned with some other address values, including values beyond some logical "end of memory", ROMs, and memory mapped peripherals. We can observe that the C language does not define special behaviour for any of these, only NULL which means something in the abstract machine and *that* is why it's used as a sentinel value.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 8:50 UTC (Wed) by kleptog (subscriber, #1183) [Link] (1 responses)

> And yes, this is very much the point. Either you insist that the C abstract machine map exactly to the primitives of the platform it’s implemented on, even in cases that are undefined on the abstract machine itself, or you don’t.

Isn't that the conflict though? On the one hand you have claims the C is a better assembler and good for writing low-level software (like the Linux kernel). On the other hand, C works with an abstract machine and if you go outside that you get undefined behaviour.

When writing something like the Linux kernel you have to do things that go outside the C abstract machine and so you end up fighting the C compiler the whole way. It assumes you have a functional abstract machine, yet that is what the kernel is trying to create.

The conclusion would seem to be: C is good for writing low-level software, except for the low-level parts of kernels.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 9:15 UTC (Wed) by Wol (subscriber, #4433) [Link]

Which is why languages like Rust, and in earlier times Modula-2, provided ways to step outside the language invariants, with the caveat "here be dragons". Languages which try and force good programming practice are impractical in practice.

And more and more I get the impression that modern C is trying to enforce good programming practice on an ancient code base (and failing miserably ...)

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 4, 2022 9:22 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (1 responses)

> There is a way: you avoid the cases that trigger UB, and rely only on what the abstract machine guarantees. I can agree this is not always easy, but there are tools to help you with that. As long as the abstract machine is implemented correctly and its invariants are upheld, the program will work on any target it is compiled for.

That doesn't work at all in practice due to portability. Look at syscalls, some used to take an int a long time ago, which was replaced with a socklen_t or a size_t or ssize_t over time. Integer promotion in C is a disaster. You cannot basically use any single integer in a portable way without having to write 1 or 2 consecutive casts without fearing that it might be incorrectly mapped. And it's getting worse when the input data you have was also defined as one of these types.

Casts are a big cause of bugs and they're made more and more common due to all the crappy abstraction types everywhere. Try to pass a time_t over the network. Hmmm does it need to be signed or unsigned ? 32 or 64 bits ? In doubt you might want to pass it as signed 64 bits. But then how to reliably decode it on the other side ? What if you picked the wrong type on the encoding side, won't you risk to get it decoded wrong for special values like -1 which could mean "forever" or "event not happened" for some syscalls ?

> If you don’t, you forfeit any right to complain that compilers ‘abuse’ UB: if it’s undefined, it’s undefined, and it doesn’t even have to act deterministically

As someone said above, they used to be undefined in that it was only hardware dependent. Now it's a free-pass for the compiler to say "awesome, this developer fell into my trap, then I can overoptimize that code and show my rival how faster my code is without all these useless checks". In addition, let me remind you that the C spec isn't open, you have to pay for it, and you discover the undefined behaviors very late in your developers' life. Sure, now some drafts are accessible, that you may consider almost identical to the official spec. But this alone is a big problem.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 11:39 UTC (Wed) by farnz (subscriber, #17727) [Link]

C has, since at least ANSI C89, had both undefined behaviour and implementation defined behaviour. Implementation defined behaviour is the stuff that's hardware-defined - for implementation defined behaviour, the compiler must tell you how it implements it (but can say things like "arithmetic overflow for 32 bit integers is defined by the behaviour of ADD EAX, EBX on your CPU, while for doubles, it's defined by the behaviour of FADD ST0, STi with truncation to 64 bit only happening if the compiler chooses to store the value to memory"), while for undefined behaviour, the compiler can do anything it likes.

The underlying "gotcha" with C for us older programmers is that optimizing compilers weren't very good until the late 1980s; before then, it was reasonable to model compilers as translating what I wrote 1:1 to a lower level language, then peephole optimizing that language, then repeating the process until the lower level language is machine code.

That's not how modern compilers work, however. They do much more sophisticated analyses to drive optimization, and can thus easily detect many more opportunities to optimize, but those analyses come up with results that are surprising if you're thinking in terms of peephole optimization.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 13:56 UTC (Wed) by Vipketsh (guest, #134480) [Link] (40 responses)

Why are we all talking about undefined behaviour as if this stuff would be some laws-of-nature that is set in stone ? They are *not* set in stone and there is no (technical) reason they can't be changed. There is also no reason why compiler authors couldn't say "this makes no sense, we'll define it like this" (e.g. -fwrapv) and thereby force the standards committee's hand. In other words, I can not accept arguments which end with "that's undefined behaviour, end of story" -- explain why it has to be undefined and/or what does the "undefined" nature of the behaviour get us (e.g. X% faster code, portability to X architecture, etc.) and then we can have a discussion about what is a better trade off: the benefits of the undefined behaviour or the benefits of not having it.

It would also be great if there were words to differentiate between undefined behaviour that can not be avoided (e.g. use-after-free) and those which we talk about only because compiler authors decided to explicitly add transformations based on said allowances (e.g. deleting NULL pointer checks due to NULL dereference being undefined). Lastly, I think we would be in a much better situation if 'undefined behaviour' would be read as 'allowed by the underlying machine, forbidden for the compiler'.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:25 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

> (e.g. deleting NULL pointer checks due to NULL dereference being undefined)

Everyone keeps repeating this, but given something where you have some inline-able utility functions that all safeguard against `NULL` internally (because that's the sensible thing to do). Do you not want to let the compiler optimize out those checks when inlining the bodies into a caller that, itself, has a NULL-then-bail check on that pointer (which a dereference is implicitly a check for as well)? If you want to remove this optimization, what kinds of optimizations are you willing to leave on the floor?

FWIW, once you get to LTO, the language it was written in may be long gone and seeing something like (forgive my lack of asm familiarity):

load eax, *ecx ; int x = *ptr;
…
xor ecx, ecx ; if (!ptr) return;
jnz pc+2
ret
…

and not remove that `xor/jnz/ret` sequence?

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:48 UTC (Wed) by Vipketsh (guest, #134480) [Link]

> inline-able utility functions that all safeguard against `NULL` internally [...] Do you not want to let the compiler optimize out those checks when inlining the bodies into a caller that, itself, has a NULL-then-bail check on that pointer

It is a very reasonable optimisation, but you don't need "dereferencing NULL is undefined behaviour" to make it! Proving that optimisation valid can be done with range analysis: since you have the first check for NULL, you know that the pointer is not NULL afterwards, and thus any following checks against NULL can not evaluate to true.

> what kinds of optimizations are you willing to leave on the floor?

If someone would actually quantify how much one or another optimisation gets us, we could have a discussion. I would guess that in many cases, it is something I could live with, but if people who do compiler benchmarks would tell me "well, that can loose you up to 10%" (or similar) I may very well concede that the undefined behaviour is worth keeping.

> FWIW, once you get to LTO, the language it was written in may be long gone

Yes, and no. LTO operates on the compiler's middle-end representation -- the same on which most all transformations occur. So, while the original language is indeed lost, all information from it needed to decide if one or another transformation is valid is still present.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:55 UTC (Wed) by khim (subscriber, #9252) [Link] (37 responses)

> They are *not* set in stone and there is no (technical) reason they can't be changed.

It can be done and it was done. Here is an example with undefined behavior, here is another with unspecified behavior.

> Why are we all talking about undefined behaviour as if this stuff would be some laws-of-nature that is set in stone ?

They are not set of stone, yet they are the rules of the game. But most discussions about undefined behavior go like “why do you say I couldn't hold basketball and run — I tried that and it works!”.

Yes, it may work if the referee looks the other way. But it's against the rules. If you want to make it rules-compliant then you have to change the rules first!

> I can not accept arguments which end with "that's undefined behaviour, end of story"

It is the end of story for a compiler writer or a C developer. Just like “it's against the rules” is “the end of story” for the basketball player.

> explain why it has to be undefined and/or what does the "undefined" nature of the behaviour get us (e.g. X% faster code, portability to X architecture, etc.) and then we can have a discussion about what is a better trade off: the benefits of the undefined behaviour or the benefits of not having it.

Sure. Raise that issue with ISO/IEC JTC1/SC22/WG21 committee. If they would agree — rules would be changed. Or you can try to see what it takes to change the compiler and report.

But rules are rules, they are “the status quo”. Adherence to rules doesn't need any justifications. But changes to the rules need a justification, sure.

> Lastly, I think we would be in a much better situation if 'undefined behaviour' would be read as 'allowed by the underlying machine, forbidden for the compiler'.

Such thing already exist in the standard. It's called “implementation-defined behavior”. Undefined behavior is called undefined because it's, well… undefined.

P.S. There are cases where modern compilers break a perfectly valid programs which don't, actually, trigger UB. That can only be called an act of sabotage, but these are different storied from what we are discussing here.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:03 UTC (Wed) by Vipketsh (guest, #134480) [Link] (36 responses)

You know full well that no individual can engage a committee and have any meaningful hope of ever bringing about change, and you also know full well that committees are more about preserving the entrenchment of their members than real technical progress. Thus any reference to them comes across as a simple means of excluding individuals without discussion.

It's also not like compiler writers have never ignored standards when it suited them. I think it was yourself who mentioned in some thread from a while ago that LLVM is doing all sorts of optimisations based on pointer provenance when there is exactly no mention of them in any standard. So, no, I can't just simply accept "it's against the rules" to end a discussion.

> “why do you say I couldn't hold basketball and run — I tried that and it works!”

To take your analogy further, maybe basketball would be better that way. If I think so, I can gather up a bunch of people go to some court to try it for a while and, if it works, maybe the basketball rules committee will take an interest and change the rules or maybe its the birth of a new sport. So, yes, turning certain undefined behaviours into defined ones in a compiler is exactly the place to have this discussion and the rule changes can very well happen later.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:28 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (4 responses)

There are certainly individuals at C++'s standards committee. I don't know that much about C since I haven't been there though. I don't think that "preserving the entrenchment" is at all how I'd describe it. There are certainly different pressures on things, but there is *not* a unified one anywhere other than "improve the language overall" (and not everyone agrees about every step taken). I've seen long-standing members get told "no, that idea doesn't work" about as much as I've seen new members get "oh, yeah, that's neat" (in rough proportion to their prevalence at least). Implementors have their hobby horses, library designers have theirs, and users are there too to express their views on things. "Unified" is not how I would describe it in relation to any single feature-in-progress.

FWIW, I go because I work on CMake and would like to ensure that modules are buildable (as would my employer). But I go to other presentations that I am personally interested in (when not a schedule conflict with what I'm there to do) and participate.

With the pandemic making face-to-face less tenable, I expect it to be easier than ever for folks to attend committee meetings.

> I think it was yourself who mentioned in some thread from a while ago that LLVM is doing all sorts of optimisations based on pointer provenance when there is exactly no mention of them in any standard.

That was me I believe. My understanding is that without provenance, pointers are extremely hard to actually reason about and make any kind of sensible optimizations around (this includes aliasing, thread safety, etc.). It was something that was encountered when implementing the language that turned out to be underlying a lot of things but was never spelled out. There is work to figure out what these rules actually are and how to put them into standard-ese ("pointer zap" is the search term to use for the paper titles).

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:51 UTC (Wed) by Vipketsh (guest, #134480) [Link]

If the C/C++ committees are indeed that nice, I have at least some hope of sensible changes -- thanks for sharing your experience.

(For reference, long ago I have engaged with the Unicode people as part of a research group and that experience was one of dealing with megalomanic cesspools, lobby groups, legal threats and the like -- truly awe full)

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:06 UTC (Wed) by khim (subscriber, #9252) [Link] (2 responses)

> That was me I believe.

That was different discussion.

> There is work to figure out what these rules actually are and how to put them into standard-ese ("pointer zap" is the search term to use for the paper titles).

This would have been great if, while these rules are not finalized, compilers produced working programs by default.

Instead clang not just breaks programs by default, it even refuses to provide -fno-provenance switch which may be used to stop miscompiling them!

And it's not as if rules were actually designed, presented to C/C++ committee (like happend with DR 236) then rejected because of typical committee politics. At least then you can say “yes, changes to the standard are not accepted yet, but you can read about our rules here”.

Instead, for two decades compilers silently miscompiled valid programs and their developers offered no rules which software developers can follow to develop programs which wouldn't be miscompiled!

That's an act of deliberate sabotage. And, worst of all, it have only gained prominence because of Rust: since Rust developers actually care about program correctness (and because LLVM miscompiled programs which they believed to be correct) they tried to find the actual list of rules that govern provenance in C/C++… and found nothing.

Yes, this provenance fiasco is awful blow against C/C++ compiler developers. You couldn't say “standard is a treaty, you have to follow it” and then turn around and add “oh, but it's too hard for me to follow it, thus I would violate it whenever it would suit me”. That's not a treaty anymore, it's a joke.

But even then: the way forward is not to ignore the rules, but an attempt to patch them. Ideally new compilers should be written which would actually obey some rules, but that's too big of an endeavor, I don't think it'll happen.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:36 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

> Ideally new compilers should be written which would actually obey some rules, but that's too big of an endeavor, I don't think it'll happen.

There may be hope. Some docs: https://lightningcreations.github.io/lccc/xlang/pointer (though no mention of provenance yet(?)).

https://github.com/LightningCreations/lccc

DeVault: Announcing the Hare programming language

Posted May 5, 2022 9:59 UTC (Thu) by khim (subscriber, #9252) [Link]

I mean something which was done for safe Rust: formal mathematical model which may prove that transformation which are done during optimizations are sound.

Things like <a href="https://www.ralfj.de/blog/2020/12/14/provenance.html">that</a> should be impossible.

Today optimizations in compilers are dark art: not only they produce suboptimal code sometimes (that's probably something which we would never be able to fix), we discover, from time to time, that they are just simply invalid (as in: they turn perfectly valid programs into invalid).

Ideally we should ensure that all optimizations leave valid programs still valid (even if, perhaps, not optimal).

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:33 UTC (Wed) by khim (subscriber, #9252) [Link] (30 responses)

Indeed. That was a great violation of rules and that's what pushed me to start using Rust, finally. I always was sitting on the fence about it, but when it turned out that with C/C++ I have to watch not just for UBs actually listed in the standard, but also other, vague and uncertain requirements which were supposed to be added to it more than a decade ago… at this point at became apparent that C/C++ are not suitable for any purpose.

The number of UBs is just too vast, they are not enumerated (C tried, but, as provenance issues shows, failed to list them, C++ haven't tried at all) and, more importantly, there are no automated way to separate code where UB may happen (and which I'm supposed to rewrite when new version of compiler is released) from code where UB is guaranteed to be absent (and which I can explicitly trust).

But that was just the final straw. Even without this gross violation of rules it was becoming more and more apparent that UBs have to go: it's not clear if low-level language without UB is possible in principle, but even if you eliminate them from 99% of code using an automated tools that would still be a great improvement.

> It's also not like compiler writers have never ignored standards when it suited them.

Very rarely. Except for that crazy provenance business (which is justfied by DR260 and the main issue lies with the fact that compilers started miscompiling programs without appropriate prior changes to the standard) I can only recall DR236 (where committee acted as committee and refused to offer any sane way to deal with the issue and just noted that the requirement of global knowledge is problematic).

And in cases where it was shown that UB requirements are just too onerous to bear they changed standard in the favor of C++ developers, thus it was definitely not one-way street.

> If I think so, I can gather up a bunch of people go to some court to try it for a while and, if it works, maybe the basketball rules committee will take an interest and change the rules or maybe its the birth of a new sport.

You are perfectly free to do that with the compilers, too. Most are gcc/clang based novadays thus you can just start with implementing your proposed changes, then can measure things and promote your changes.

> So, yes, turning certain undefined behaviours into defined ones in a compiler is exactly the place to have this discussion and the rule changes can very well happen later.

Indeed, but it's your responsibility to change the compiler and show that it brings real benefits. Just like most basketball players: they wouldn't even consider trying to play basketball with some changed rules unless you can show that other prominent players are trying that, changed, variant.

Some changes may even become new -fwrav, who knows? But it's changes to the language that need a justification, the rules are rules by default.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:14 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> But that was just the final straw. Even without this gross violation of rules it was becoming more and more apparent that UBs have to go: it's not clear if low-level language without UB is possible in principle, but even if you eliminate them from 99% of code using an automated tools that would still be a great improvement.

It's easy to eliminate UB from any language. A computer language is Maths, and as such you can build a complete and pretty much perfect MODEL.

When you run your program, it's Science (or technology), and you can't guarantee that your model and reality co-incide, but any AND ALL UB should be "we have no control over the consequences of these actions, because we are relying on an outside agent". Be it a hardware random number generator, a network card receiving stuff over the network, a buggy CPU chip, etc etc. The language should be quite clear - "here we have control therefore the following is *defined* as happening, there we have no control therefore what happens is whatever that other system defines". If that other system doesn't bother to define it, it's not UB in your language.

And that's why Rust and Modula-2 and that all have unsafe blocks - it says "here be dragons, we cannot guarantee language invariants".

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 4, 2022 19:32 UTC (Wed) by khim (subscriber, #9252) [Link]

> And that's why Rust and Modula-2 and that all have unsafe blocks - it says "here be dragons, we cannot guarantee language invariants".

You just have to remember that not all unsafe blocks are marked with nice keywords. E.g. presumably “safe” Rust program may open /proc/self/mem.

But even then: it's time to stop making languages which allow UB to happen just in any random piece of code which does no such crazy things!

> It's easy to eliminate UB from any language. A computer language is Maths, and as such you can build a complete and pretty much perfect MODEL.

Sometimes model is just too restrictive. E.g. in Rust you have to go into unsafe realm just to create queue or linked list.

But even that, rigid and limited model, allows one to remove UB from a surprising percentage of your code. And it's the only way to write complex code. C/C++ approach doesn't scale. We don't have enough DJBs.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 0:33 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (27 responses)

As I understand it, WG21 (ie C++) does have a sub-committee attempting to enumerate the Undefined Behaviour.

Provenance is difficult, I would agree with you that C++ didn't do a great job here by trying to kick this ball into the long grass rather than wrestle with the difficult problem, but I think we should be serious for a moment and consider that if C++ 11 had tried to do what Aria is merely proposing (and of course hasn't anywhere close to consensus to actually do for real in stable yet) for Rust, it would have been stillborn.

The same exact people who are here talking about some hypothetical C dialect or new language where it does what they meant, (whatever the hell that is) would have said Provenance is an abomination, just let us do things the old way. Even though "the old way" doesn't have any coherent meaning which is why this came up as a Defect Report not as a future feature proposal.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 10:30 UTC (Thu) by khim (subscriber, #9252) [Link] (26 responses)

> Even though "the old way" doesn't have any coherent meaning which is why this came up as a Defect Report not as a future feature proposal.

It was abuse of the system, plain and simple.

One part of the standard say saying, back then, that only visible values matter and if two pointers are identical they should behave identically.

That's understanding of the vast majority of the practical programmers and this is what should have been kept as a default (even if it affected optimizations).

The other part of the standard was talking about pointer validity and contradicted the first, e.g. realloc in C99 (but, notably, not in C89) is permitted to return different pointer which can be bitwise identical to the original one.

That's what saboteurs wanted to hear and that's what they used to sabotage C/C++ community.

> Provenance is difficult, I would agree with you that C++ didn't do a great job here by trying to kick this ball into the long grass rather than wrestle with the difficult problem

It's not a “difficult problem” at all. If sabotage would have failed then rules for when identical pointers are not considered to be identical would have become not “hidden”, “unwritten” part of the standard, but a non-standard mode, extension.

If it would have been proven that they are helping to produce much better code then they would have been enabled explicitly in many projects. And people would have became aware.

Instead language was silently and without evidence changed behind the developer's back. That's real serious issue IMNSHO. Layman are not supposed to know all the fine points of law. S/he especially is not supposed to know all the unwritten rules.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 12:47 UTC (Thu) by foom (subscriber, #14868) [Link] (25 responses)

> saboteurs
> sabotage

This is inflammatory and unhelpful language. That certainly doesn't actually describe anyone involved, as I suspect you're well aware.

I believe what you actually mean is that you disagree strongly with some of the decisions made, and consider them to have had harmful effects. Say that, not "saboteurs".

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:02 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (2 responses)

I think it accurately reflects the sentiment of the vast majority of C programmers who nowadays are terrified to upgrade their toolchain on every system upgrade, and to discover new silent breakage that brings absolutely no value at all except frustration, waste of time and motivation, and costs.

The first rule should be not to break what has reliably worked for ages. *even if that was incorrect in the first place*. As Linus often explains, a bug may become a feature once everyone uses it; usage prevails over initial intent.

I'm pretty certain that most of the recent changes were driven exclusively by pride, so say "look how smart the compiler became after my changes", forgetting that their users would like it to be trustable instead of being smart.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:52 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

> The first rule should be not to break what has reliably worked for ages. *even if that was incorrect in the first place*. As Linus often explains, a bug may become a feature once everyone uses it; usage prevails over initial intent.

That's one possibility, yes. But there's another possibility: follow the standard. Dūra lēx, sed lēx.

Under that assumption you just follow the law. What law says… goes. Even if what the law says is nonsense.

That what C/C++ compiler developers promoted for years.

But if you pick that approach you cannot then turn around and say “oh, sorry, law is too harsh, I don't want to follow it”.

Either law is law and everyone has to follow it, it's enough to follow it, or it's not the law.

> I'm pretty certain that most of the recent changes were driven exclusively by pride, so say "look how smart the compiler became after my changes", forgetting that their users would like it to be trustable instead of being smart.

Provenance rules are not like that. They allow clang and gcc to eliminate calls from malloc and free in some cases. This may bring amazing speedups. And if these things were an opt-in option I would have applauded these efforts and it would have been a great way to promote them and to, eventually, add them to the standard.

Instead they were introduced in a submarine patent way, without any options, not even opt-out options. And they break standards-compliant programs.

That's an act of sabotage, sorry.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 15:26 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

> That's one possibility, yes. But there's another possibility: follow the standard. Dūra lēx, sed lēx.
> Under that assumption you just follow the law. What law says… goes. Even if what the law says is nonsense.
> That what C/C++ compiler developers promoted for years.

I wouldn't deny that, but:
- the law is behind a paywall
- lots of modern abstractions in interfaces sadly make it almost impossible to follow. Using a lot of foo_t everywhere without even telling you whether they're signed/unsigned, 32/64 causes lots of trouble when you have to perform operations on them, resulting in you being forced to cast them and enter into the nasty area of type promotion. That's even worse when you try hard to avoid an overflow based on a type you don't know and the compiler knows better than you and manages to get rid of it.

We're really fighting *against* the compiler to keep our code safe these years. This tool was supposed to help us instead. And it failed.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:38 UTC (Thu) by khim (subscriber, #9252) [Link] (21 responses)

> That certainly doesn't actually describe anyone involved, as I suspect you're well aware.

This describes the majority of people I have talked with. I was unable to find anyone who have honestly claimed that careful reading the standard is enough to write the code which wouldn't violate provenance rules.

On the contrary: people expressed regret or, in many cases, anger about the fact that standard doesn't enable usage of provenance rules by the compilers, yet none claimed that they follow from the existing standard.

If the standard is the treaty between compiler and programmer then it was deliberate breakage of the treaty. Worse: it placed the users of the compiler at a serious disadvantage. How are they supposed to follow the rules which don't exist? Especially if even people who are yet to present these rules agree that they are really complex and convoluted?

And when people, horrified, asked for the -fno-provenance option? They got answer: although provenance is not defined by either standard, it is a real and valid emergent property that every compiler vendor ever agrees on.

Sorry, but if that is not an act of sabotage, then I don't know what is. And people who are doing the sabotage are called saboteours.

> I believe what you actually mean is that you disagree strongly with some of the decisions made, and consider them to have had harmful effects.

No. It's not about me agreeing with something or disagreeing with something.

> Say that, not "saboteurs".

Let me summarize what happened:

Compiler developers have found out that certain parts of the standard allow language users to write certain questionable programs which are hard to optimize.
After that was acknowledged they haven't changed the standard yet decided to break these standard-compliant programs.
They haven't mentioned that fact in the release notes and spent no efforts to deliver that infortmation to users in any way.
They also refused to offer any options which would allow one to use the questionable constructs.
Naturally they also refuse to enable such options when you compile programs with -std=c89 (variant of the C standard which does not include any provisions for provenance whatsoever).

Sorry, but when you knowingly break standards-compliant programs and refuse to support them in any shape or form — that's an act of sabotage.

DeVault: Announcing the Hare programming language

Posted May 9, 2022 18:58 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (20 responses)

> Naturally they also refuse to enable such options when you compile programs with -std=c89 (variant of the C standard which does not include any provisions for provenance whatsoever).

Suppose it is the year 1889, the twentieth century seems bright ahead, and you have learned everything there is to know (or so you think) about Set Theory. (This is called Naive Set Theory). It seems to you that this works just fine.

Fast forward a few years, and this chap named Ernst Zermelo says your theory is incomplete and that's why it suffers from some famous paradoxes. His new axiomatised Set Theory requires several exciting new axioms, including the infamous Axiom Of Choice and with these axioms Zermelo vanquishes the paradoxes.

Now, is Ernst wrong? Was your naive theory actually fine? Shouldn't you be allowed to go back to using that simpler, "better" set theory and ignore Ernst's stupid paradoxes and his obviously nonsensical Axiom of Choice ? No. Ernst was right, your theory was unsound, _and it was already unsound in 1889_, you just didn't know it yet. Your naive theory _assumed_ things which Zermelo made into axioms.

Likewise, the C89 compilers you have nostalgia for were actually unsound, and it would have been possible (or if you resurrect them, is possible today on those compilers) to produce nonsensical results because in fact pointer provenance may not have been mentioned in the C89 standard but it was relied upon by compilers anyway. It was silently assumed, and had been for many years.

The excellent index for K&R's Second Edition of "The C Programming Language" covering C89, doesn't even have an entry for the words "alias" or "provenance". Because there are _assumptions_ about these things baked in to the language, but they haven't been surfaced.

The higher level programming languages get to have lots of optimisations here because assumptions like "pointer provenance" are necessarily true in a language that only has references anyway. To keep those same optimisations (as otherwise they'd be slower!) C and C++ must make the assumptions too, and yet to deliver on their "low level" promise they cannot. Squaring this circle is difficult which is why the committees punt rather than do it over all these years.

I happen to think Rust (more by luck than judgement so far as I can see) got this right. If you define most of the program in a higher level ("safe" in Rust terms) language, you definitely can have all those optimisations and then compartmentalize the scary assumption-violating low level stuff. This is what Aria's Tower of Weakenings is about too. Aria proposes that even most of unsafe Rust can safely keep these assumptions, something like Haphazard (the Hazard Pointer implementation) doesn't do anything that risks provenance confusion and so it's safe to optimize with those assumptions and more stuff like that can be safely labelled safe, until only the type of code that really mints "pointers" from integers out of nowhere cannot justify the assumptions and accordingly cannot be optimised successfully.

It's OK if five machine instructions can't be optimised, probably even if they're in your tight inner loop, certainly it's better than accidentally optimising them to four *wrong* instructions. What's a problem for C and C++ is that the provenance problem is infectious and might spread from that inner loop to the entire program and then you're back to slower than Python.

DeVault: Announcing the Hare programming language

Posted May 9, 2022 21:55 UTC (Mon) by Wol (subscriber, #4433) [Link] (19 responses)

> Your naive theory _assumed_ things which Zermelo made into axioms.

Whoops. I know a lot of people think my view of maths is naive, but that's not what I understand an axiom to be. An axiom is something which is *assumed* to be true, because it can't be proven. Zermelo would have provided a proof, which would have changed your naive theory from axiom to proven false.

This is Euclid's axiom that parallel lines never meet. That axiom *defines* the special case of 3D geometry. but because in the general case it's false, it's not an axiom of geometry.

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 9, 2022 22:39 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (18 responses)

Maybe I didn't express myself very well.

As I understand it, the problem with assumptions in naive set theories and C89 (and various other things) is that because it's an assumption rather than an axiom you don't spot where the problems are. You never write it down at all and so have no opportunity to notice that's it's too vague whereas when you're writing an axiom you can see what you're doing.

The naive theories let Russell's paradox sneak in by creating this poorly defined set, but the axioms in Zermelo's theory oblige you to define a set more carefully to have a set at all, and in that process the paradox gets defused. In particular ZFC has a "specification" axiom which says in essence OK, so, tell me how to make this "set" using another set and first order logic. The sets naive set theories were created for can all be constructed this way no problem, but weird sets with paradoxical properties cannot.

C89 assumes that pointers to things must be different, which sounds reasonable but does not formally explain how this works. I believe that it's necessary (in order for a language like C89 to avoid its equivalent of paradoxes, the programs which abuse language semantics to do something hostile to optimisation) to define such rules, and they're going to look like provenance.

I do not believe that C89 is fine, and thus that we should or could just implement C89 as if provenance isn't a thing and be happy. That's my point here. C89 wasn't written in defiance of a known reality, but in ignorance of an unknown one, like the Ptolemaic system. Geocentrists today are different from Ptolemy, but not because Ptolemy was right.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 2:18 UTC (Tue) by khim (subscriber, #9252) [Link] (17 responses)

> C89 assumes that pointers to things must be different, which sounds reasonable but does not formally explain how this works.

No. It's the opposite. C89 doesn't assume that pointers to things must be different. But yes, it asserts that if pointers are equal then they are completely equivalent — you can compare any two pointers and go from there.

Note that it doesn't work in the other direction: it's perfectly legal to have pointers which are different yet point to the same object. That's easily observable in MS-DOS's large memory model.

> I do not believe that C89 is fine, and thus that we should or could just implement C89 as if provenance isn't a thing and be happy.

Show me a Russel's paradox, please. Not in a form “optimizer could not do X or Y, which is nice thing to be able to do and thus we must punish the software developer who assumes pointers are just memory addresses”, but “this valid C89 program can produce one of two outputs depending or how we read the rules and thus it's impossible to define how it should work”.

Then and only then you would have a point.

> That's my point here.

I think you are skipping one important step there. Yes, there was an attempt to add rules about how certain pointers have to point to different objects. But it failed spectacularly. It was never adopted and final text of C89 standard don't even mention it. In C89 pointers are just addresses, no crazy talks about pointer validity, object creation and disappearance and so on.

There are some inaccuracies from that failed attempt: C89 defines only two lifetimes: static and automatic… yet somehow the value of a pointer that refers to freed space is indeterminate. Yet if you just declare that pointers are addresses then it should be possible to fix these inaccuracies without much loss.

Where non-trivial lifetimes first appeared is C99, not C89. There yes, it has become impossible to read from "union member other than the last one stored into", there limitation on whether you can compare two pointers or not were added (previously it was possible to compare two arbitrary pointers and if they happen to be valid and equal, they would be equivalent), etc.

But I don't see why C89 memory model would be, somehow, unsound. Hard to optimize? Probably. Wasterful? Maybe so. But I don't see where it's inconsistent. Show me. Please.

Yes, it absolutely rejects pointer provenance in any shape or form (except something like CHERI where provenance actually exist at runtime and is 100% consistent). Yes, it may limit optimization opportunities. But where's the Russel's paradox, hmm?

DeVault: Announcing the Hare programming language

Posted May 10, 2022 15:04 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (6 responses)

I don't have a Russell's Paradox equivalent under your constraints, but I think it's worth highlighting just how severe those constraints are.

Your resulting C compiler is not the GCC I grew up with (well, OK, the one that teenage me knew), or that with some minor optimisation passes disabled, it's an altogether different animal, perhaps closer to Python. In this language, pointers are all just indexes into an array containing all of memory - including the text segment and the stack, and so you can do some amazing acrobatics as a programmer, but your optimiser is badly hamstrung. C's already poor type checking is further reduced in power in the process, which again makes the comparison to Python seem appropriate.

I don't believe there is or was an audience for this compiler. People weren't writing C because of the expressive syntax, the unsurpassed quality of the tooling or the comprehensive "batteries included" standard library, it didn't have any of those things - they were doing it because C compilers produce reasonable machine code, and this alternative interpretation of C89 doesn't do that any more.

> (except something like CHERI where provenance actually exist at runtime and is 100% consistent)

You can only do this at all under CHERI via one of two equally useless routes:

1. The "Python" approach I describe where you declare that all "pointers" inherit a provenance with 100% system visibility, this obviously doesn't catch any bugs, and you might as well switch off CHERI, bringing us to...

2. The compatibility mode. As I understand it Morello provides a switch so you can say that now we don't enforce CHERI rules, the hidden "valid" bit is ignored and it behaves like a conventional CPU. Again you don't catch any bugs.

This is because under your preferred C89 "no provenance" model there isn't any provenance, CHERI isn't a fairytale spell it's just engineering.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 16:10 UTC (Tue) by khim (subscriber, #9252) [Link] (3 responses)

> Your resulting C compiler is not the GCC I grew up with (well, OK, the one that teenage me knew), or that with some minor optimisation passes disabled, it's an altogether different animal, perhaps closer to Python.

But that's is the language which Kernighan and Ritchie designed and used to write Unix. Their goal was not to create some crazy portable dream, they just wanted to keep supporting both 18-bit PDP-7 and 16-bit PDP-11 from the same codebase by rewriting some parts of code written in PDP-7 assembler in the higher-level language. They have been using B which had no types at all and improved it. By adding character types, then structs, arrays, pointers (yes, B conflated pointers and integers, it only had one type).

Yet malloc was not special, free was not special and not even all Unix programs used them (just look obn the source of original Bourne Shell some days).

> I don't believe there is or was an audience for this compiler.

How about “all the C users for the first decade of it's existence”? Initially C was used exclusively in Unix, but in 1980th it became used more widely. Yet I don't think any compilers of that era support anything even remotely resembling “pointer provenance”.

That craziness started after a failed attempt of C standard committee to redo the language. They then went back and replaced that with a simpler aliasing rules which prevented type puning, but even these weren't abused by compilers till XXI century.

> People weren't writing C because of the expressive syntax, the unsurpassed quality of the tooling or the comprehensive "batteries included" standard library, it didn't have any of those things - they were doing it because C compilers produce reasonable machine code, and this alternative interpretation of C89 doesn't do that any more.

Can you, please, stop rewriting history? C was quite popular way before ANSI C arrived and tried (but failed!) to introduce crazy aliasing rules. Yes, C compilers were restricted and couldn't do all the amazing optimizations… but C developers can do these, instead! When John Carmack was adopting his infamous 0x5f3759df-based trick he certainly haven't cared to think about the fact that there are some aliasing rules which may render code invalid and that was true for the majority of users who grew in an era before GCC started breaking good K&R programs.

> This is because under your preferred C89 "no provenance" model there isn't any provenance, CHERI isn't a fairytale spell it's just engineering.

It's engineering, yes, but you can add provenance to C89. It just has to be consistent. You can even model it with “poor man's CHERI” aka MS-DOS Large model by playing tricks with segment and offset. E.g. realloc could turn 0x0:0x1234 pointer into 0x1:0x1224 pointer if it decided not to move an object.

This way all the fast-path code would be negated and you would never have the situation where bitwise-identical pointers point to different objects. This may not be super-efficient but it is compatible with C89. Remember? Bitwise-different pointers can point to the same object, but the opposite is forbidden! Easy, no?

All these games where certain pointers can be compared but not others and its programmer's responsibility to remember all these unwritten rules… I don't know how that language can be used to development, sorry.

The advice I have gotten from our toolchain-support team is to ask clang developer about low-level constructs which I may wish to create!

So much for “standard is a treaty” talks…

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:03 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

Early C did not have a formal specification - what the one and only implementation did was what the language did.

And the problem is that formally specified C - including K&R C, and C89 - left a huge amount unspecified; users of C assumed that the behaviour of their implementation of the language was C behaviour, and not merely the way their implementation behaved.

Up until the early 1990s, this wasn't a big deal. The tools needed for compilers to do much more than peephole optimizations simply didn't exist in usable form; remember that SSA, the first of the tools needed to start reasoning about blocks or even entire programs doesn't appear in the literature until 1988. As a result, most implementations happened, more by luck than judgement, to behave in similar ways where the specification was silent.

But then we got SSA, the polytope model, and other tools that allowed compilers to do significant optimizations beyond peephole optimizations on the source code, their IR, and the object code. And now we have a mess - the provenance model, for example, is compiler authors trying to find a rigorous way to model what users "expect" from pointers, not just what C89 permits users to assume, while C11's concurrent memory model is an effort to come up with a rigorous model for what users can expect when multiple threads of execution alter the state of the abstract machine.

Remember that all you are guaranteed about your code in C89 is that the code behaves as-if it made certain changes to the abstract machine for each specified operation (standard library as well as language), and that all volatile accesses are visible in the real machine in program order. Nothing else is promised to you - there's no such thing as a "compiler barrier" in C89, for example.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:57 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> And the problem is that formally specified C - including K&R C, and C89 - left a huge amount unspecified; users of C assumed that the behaviour of their implementation of the language was C behaviour, and not merely the way their implementation behaved.

True, but irrelevant. The most important part that we discussing here was specified in both: pointers are addresses, if two pointers are equal they can be used interchangeably.

> But then we got SSA, the polytope model, and other tools that allowed compilers to do significant optimizations beyond peephole optimizations on the source code, their IR, and the object code.

Yes. And there was an attempt to inject ideas that make these useful into C89. Failed one. The committee has created an unreal language that no one can or will actually use. It was ripped out (and replaced with crazy aliasing rules, but that's another story).

> And now we have a mess - the provenance model, for example, is compiler authors trying to find a rigorous way to model what users "expect" from pointers, not just what C89 permits users to assume

Can you, please stop lying? Provenance models are trying to justify deliberate sabotage where fully-standard compliant programs are broken. It's not my words, the provenance proposal itself laments:

These GCC and ICC outcomes would not be correct with respect to a concrete semantics, and so to make the existing compiler behaviour sound it is necessary for this program to be deemed to have undefined behaviour.

To make the existing compiler behavior sound, my ass. The whole story of provenance started with sabotage: after failed attempt to bring provanance-like properties to C89 saboteurs returned in C99 and, finally, succeeded in adding some (and thus rendered some C89 programs invalid in the process), but that wasn't enough: they got the infamous DR260 resolution which was phrased like that: After much discussion, the UK C Panel came to a number of conclusions as to what it would be desirable for the Standard to mean.

Note: the resolution hasn't changed the standard. It hasn't allowed saboteurs to break more programs. No. It was merely a suggestion to develop adjustments to the standards — and listed three cases where such adjustments were supposed to cause certain outcomes.

Nothing like that happened. For more than two decades compilers invented more and more clever ways to screw the developers and used that resolution as a fig leaf.

And then… Rust happened. Since Rust guys are pretty concerned about program correctness (and LLVM sometimes miscomplied IR-programs they perceived correct) they went to C++ guys and asked “hey, what are the actual rules we have to follow”? And the answer was… “here is the defect report, we use it to screw the developers and miscompile their valid programs”. Rust developers weren't amused.

And that is when the lie was, finally, exposed.

So, please, don't liken problems with pointer provenance to problems with C11 memory model.

Indeed, C89 or C99 doesn't allow one to write valid multi-threaded programs. Everything is defined strictly for single-threaded program. To support programs where two threads of execution can touch objects “simultaneously” you need to extend the language somehow.

Provenance excuse is used to break completely valid C and C++ programs. It's not about extending the language, it's about narrowing it. Certain formerly valid programs have to be deemed to have undefined behaviour..And after more than two decades we don't even have the rules which we are supposed to follow finalized!

And they express it in a form of if you believe pointers are mere addresses, you are not writing C++; you are writing the C of K&R, which is a dead language. IOW: they know they sabotaged C developers —and they are proud of it.

> Nothing else is promised to you - there's no such thing as a "compiler barrier" in C89, for example.

Yes. And to express many things which would be needed to, e.g., write an OS kernel in C89, you need to extend the language in some way. This is deliberate: the idea was to make sure strictly-conforming C89 programs run everywhere, but conforming programs may require certain language extensions. Not ideal, but works.

Saboteurs managed to screw it all completely and categorically refuse to fix what they have done.

This looks like a good time to stop

Posted May 10, 2022 20:05 UTC (Tue) by corbet (editor, #1) [Link]

When we start seeing this type of language, and people accusing each other of lying, it's a strong signal that a thread may have outlived its useful lifetime. This article now has nearly 350 comments on it, so I may not be the only one who feels like that outliving happened a little while ago.

I would like to humbly suggest that we let this topic rest at this point.

Thank you.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:16 UTC (Tue) by Vipketsh (guest, #134480) [Link] (1 responses)

Where provenance makes sense is if the language you have has some higher level concept of an object which has some properties described in the program and known to the compiler. Most importantly object lifetime is known. This ends up bringing with it rules such as you can not just arbitrarily turn one object into another and back again (i.e. no pointer casts), you can not arbitrarily split one object into two (i.e. no pointer arithmetic) and you can not arbitrarily manufacture pointers out of random data. Unfortunately C is not such a language and by forcing provenance rules on it, one is in essence trying to retrofit some kind of object model to it without any of the expressiveness and enforced rules that are needed for the programmer to not make programmes that are obviously wrong under the assumptions (i.e. provenance). Worse yet the rules, or better said heuristics*, that standards and compilers chose to signal lifetime are counter to what existing programmes expect and exploit.

Re: your mathematics analogy. I think you have taken the wrong view point there: pretty much all of mainstream mathematics is concerned with either extending existing and useful theory (e.g. how rational numbers where extended to create irrational numbers) or to put existing theory on a more sound footing (e.g. Hilbert's axioms), possibly closing off various paradoxes. Realise how in pretty much all of the evolution of mathematics a very strong emphasis was placed on any new theory being backwards compatible -- no-one, taken seriously, wanted to ever end up with 1+1=3 but instead worked to solidify the intuition that 1+1=2. I think that if one wanted to paint mathematical evolution onto C, the definitions underpinning C would need to be changed in a way that (i) they are backwards compatible with existing programs and (ii) that loopholes exploited by compiler writers be closed instead of officially sanctioned. Right now, it's the opposite: people are trying to convince C programmers that the intuition they had all along was always false and reality is actually completely different.

*: Possibly the one with the most problems is the idea that realloc() will, in the absence of failure, always (i) destroy the object passed to it, and (ii) will allocate a completely new one. This is counter to the intuition of many a programmer and there is no enforced rule in C that prevents programmers from carrying pointers over the realloc() call, which would make the idea actually work. The reason people are annoyed is that such code exists, is used and has worked for a long time and there is no evidence that this idea has much, if any, benefit on the compiled program.

DeVault: Announcing the Hare programming language

Posted May 11, 2022 16:09 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

> Where provenance makes sense is if the language you have has some higher level concept of an object which has some properties described in the program and known to the compiler.

I shall quote K&R themselves on this subject in their second edition:

"An object, sometimes called a variable, is a location in storage, and its interpretation depends on two main attributes ..."

> This ends up bringing with it rules such as you can not just arbitrarily turn one object into another and back again (i.e. no pointer casts), you can not arbitrarily split one object into two (i.e. no pointer arithmetic)

Nope, a language can (and some do, notably Rust of course but also this is possible with care in C and C++) provide pointer casts and pointer arithmetic. Provenance works this just fine for these operations.

Rust's Vec::split_with_spare_mut() isn't even unsafe. Most practical uses for this feature are unsafe, but the call itself is fine, it merely gives you back your Vec<T> (which will now need to re-allocate if grown because any spare space at the end of it is gone) and that space which was not used yet as a mutable slice of MaybeUninit<T> to do with as you see fit.

> and you can not arbitrarily manufacture pointers out of random data.

But here's where your problem arises. Here provenance is conjured from nowhere. It's impossible magic.

> Unfortunately C is not such a language and by forcing provenance rules on it, one is in essence trying to retrofit some kind of object model to it without any of the expressiveness and enforced rules that are needed for the programmer to not make programmes that are obviously wrong under the assumptions (i.e. provenance)

As we saw C is in fact such a language after all. The fact that many of its staunchest proponents don't seem to understand it very well is a problem for them and for C.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 16:39 UTC (Tue) by farnz (subscriber, #17727) [Link] (9 responses)

When you say "if pointers are equal, then they are completely equivalent", are you talking at a single point in time, or over the course of execution of the program?

Given, for example, the following program, it is a fault to assume that ptr1 and ptr2 are equivalent throughout the runtime, because ptr1 is invalidated by the call to release_page:


handle page_handle = get_zeroed_page();
int test;
int *ptr2;
int *ptr1 = get_ptr_to_handle(page_handle);
*ptr1 = -1; // legitimate - ptr1 points to a page, which is larger than an int in this case and correctly aligned etc.
test = *ptr1; // makes test -1
release_page(page_handle);
page_handle = get_zeroed_page();
ptr2 = get_ptr_to_handle(page_handle); // ptr2 could have the same numeric value as ptr1.
if (ptr2 == ptr1 && *ptr1 == test) {
    puts("-1 == 0");
} else {
    puts("-1 != 0");
}
release_page(page_handle);

This is the sort of code that you need to be clear about; C89's language leaves it unclear whether it's legitimate to assume that *ptr1 == test, even though the only assignments in the program are to *ptr1 (setting it to -1) and test. The thing that hurts here is that even if, in bitwise terms including hidden CHERI bits etc, ptr1 == ptr2, it's possible for the underlying machine to change state over time, and any definition of "completely equivalent" has to take that into account.

One way to handle that is to say that even though the volatile keyword does not appear anywhere in that code snippet, you give dereferencing a pointer volatile-like semantics (basically asserting that locations pointed to can change outside the changes done by the C program), and say that each time it's dereferenced, it could be referring to a new location in physical memory. In that case, this program cannot print "-1 == 0", because it has to dereference ptr1 to determine that.

Another is to follow the letter of the C89 spec, which says that the only things that can change in the abstract machine's view of the world other than via a C statement are things marked volatile. In that case, this program is allowed to print "-1 == 0" or "-1 != 0" depending on whether ptr1 == ptr2, because the implementation "knows" that it is the only thing that can assign a value to *ptr1, and thus it "knows" that because no-one has assigned through *ptr1 since it read the value to get test it is tautologically true that *ptr1 == test.

Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile. But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

And that's the fundamental issue with rewinding to C89 rules - C89 implies very strongly that the only interface between things running on the "abstract C89 machine" and the real hardware are things that are marked as volatile in the C abstract machine. In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

Note, too, that I wasn't talking about optimization in either case - I'm simply looking at the semantics of the C abstract machine as defined in C89, and noting that they're not powerful enough to model a change that affects the abstract machine but happens outside it. I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:00 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> Both are valid readings of this source under the rules set by C89, because C89 states explicitly that the only thing expected to change outside of explicit changes done by C code are things marked as volatile.

No. Calling fread and fwrite can certainly change things, too.

> But in this case, the get_zeroed_page and release_page pair change the machine state in a fairly dramatic way, but in a way that's not visible to C code - changing PTEs, for example.

Yes and no. Change is dramatic, sure. But it's most definitely visible to C code.

By necessity such things have to either be implemented with volatile or by calling system routine (which must be added to the list of functions like fread and fwrite as system extension, or else you couldn't use them them).

Place where you pass your pointer to the invlpg would be place where compiler would know that object may suddenly change value.

> In practice, nobody has bothered being that neat, and we accept that there's a whole world of murky, underdefined behaviour where the real hardware changes things that affect the behaviour of the C abstract machine, but it happens that C compilers have not yet exploited that.

In practice people who are doing these things have to use volatile at some point in the kernel, or else it just wouldn't work. Thus I don't see what you are trying to prove.

The fact that real OSes have to expand list of “special” functions which may do crazy things? It's obvious. In practice your functions are called mmap and munmap and they should be treated by compiler similarly to read and write: compiler either have to know what they are doing or it should assume they may touch and change any object they can legitimately refer given their arguments.

> I find it very tough, within the C89 language, to find anything that motivates the position that *ptr1 != test given that ptr2 == ptr1 and *ptr2 != test - it's instead explicitly undefined behaviour.

No. You couldn't do things like change to PTEs in a fully portable C89 program. It's pointless to talk about such programs since they don't exist.

The only way to do it is via asm and/or call to system routine which, by necessity, needs extensions to C89 standard to be usable. In both cases everything is fully defined.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:10 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

fread and fwrite are poor examples, because they are C code defined in terms of the state change they make to the abstract machine, and with a QoI requirement that the same state change happens to the real machine. Indeed, everything that's defined in C89 has its impact on the abstract machine fully defined by the spec; the only get-out is that volatile marks something where all reads and writes through it must be visible in the real machine in program order.

But note that this is a very minimal promise; the only thing happening in the real machine that I can reason about in C89 is the program order of accesses to volatiles. Nothing else that happens in the abstract machine is guaranteed to be visible outside it - everything else is left to the implementation's discretion.

And no, the state change is not visible inside the C89 abstract machine; if I write through a volatile pointer to a PTE, the implementation must ensure that my write happens in the real machine as well as the abstract machine, but it does not have to assume that anything has changed in the abstract machine. That, in turn, means that it may not know that ptr1 now has changed in the "real" machine, because it's not volatile and thus changes in the real machine are irrelevant.

And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change. Now, depending on the architecture, that almost certainly is not enough to guarantee an instant change - e.g. on x86, the processor can use old or new value of the PTE until the TLB is flushed, and I can force a TLB flush with invlpg to get deterministic behaviour - but I can bring the program into a non-deterministic state without calling into an assembly block or running a system routine, as long as I have the address of a PTE.

And there's no "list of system routines" in C89; the behaviour of fread, fwrite and other such functions is fully defined in the abstract machine by the spec, with a QoI requirement to have their behaviour be reflected in the "real" machine. By invoking the idea of a "list of system routines", you're extending the language beyond C89.

You're making the same mistake a lot of people make, of assuming that the behaviour of compilers in the early 1990s and earlier reflected the specification at the time, and wasn't just a consequence of limited compiler technology. If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible; provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89) while still allowing the compiler to reason about the meaning of your code in a way compatible with the C89 specification.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:01 UTC (Tue) by khim (subscriber, #9252) [Link]

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Which is the only way to have PTEs in C code.

> And I absolutely can change a PTE without assembly or a system routine, using plain C code; all I need is something that gives me the address of the PTE I want to change.

If you have such an address then you have to extend the language somehow. Or, alternatively, don't touch it.

> By invoking the idea of a "list of system routines", you're extending the language beyond C89.

Of course. Because it's impossible to write C89 program which changes the PTEs, such a concept just couldn't exist in it. You have to extend the language to cover that usecase.

> If compilers really did implement C89 to the letter of the specification

…then such compilers would have been as popular as ISO 7185. Means: no one would have cared about these and no one would have mentioned their existence.

> If compilers really did implement C89 to the letter of the specification, then much of what makes C useful wouldn't be possible

Yes. But some programs would still be possible. Programs which do tricks with pointers would work just fine, programs which touch PTEs wouldn't.

> provenance is not something that's new, but rather an attempt to let people do all the tricks like bit-stuffing into aligned pointers (which is technically UB in C89)

Citation needed. Badly. Because, I would repeat once more, in C89 (not in C99 and newer) the notion of “pointers which have the same bitpattern yet different” doesn't exist. If you add a few bits to the pointer converted to integer and then clear these same bits you would get the exact same pointer — guaranteed. The fact that these bits are lowest bits of converted integer is implementation-specific thing, you can imagine a case where these would live as top bits, e.g. So yet, that requires some implementation-specific extension. But pretty mild and limited.

Yes, provenance is an attempt to deal with the idea of C99+ that some pointers may be equal to others yet, somehow, still different — but that's not allowed in C89. If two pointers are equal then they are either both null pointers, or both point to the same object, end of story.

Sure, it makes some optimizations hard and/or impossible. So what? This just means that you cannot do such optimizations in C89 mode. Offer -fno-provenance option, enable it for -std=c89, done.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 17:26 UTC (Tue) by Vipketsh (guest, #134480) [Link] (5 responses)

You are playing a nice slight-of-hand here. Your code clearly shows function calls, and without knowing what's in those called functions, one has no choice but to assume that *ptr1 has changed (within the C abstract machine), and thus you can never get to the output of "-1 == 0". If, on the other hand, you do see the internals of those functions you will see the modification of some memory that may alias with ptr1 and then because of that you can not get the "-1 == 0" output.

The only way I can see your reasoning working is if you are somehow allowed to assume that function calls are an elaborate way of saying "nop".

DeVault: Announcing the Hare programming language

Posted May 10, 2022 18:36 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

If I promise that the function calls are just naming what code does, but it's real behaviour is poking global volatile pointers, and those functions are implemented in pure C89, there's no difference in behaviour. Given the following C definitions of get_zeroed_page, get_ptr_to_handle and release_page, you still have the non-determinism, albeit I've introduced a platform dependency:


const size_t PAGE_SIZE;
struct pt_entry {
    volatile char *v_addr;
    volatile char *p_addr;
}
volatile struct pt_entry *pte; // Initialized by outside code, with suitable guarantees on v_addr for the compiler implementation and on p_addr
int *page_location = pte->v_addr;

void *get_zeroed_page() {
   pte->p_addr += PAGE_SIZE;
   memset(pte->v_addr, 0, PAGE_SIZE);
   return pte->v_addr;
}

void release_page(void *handle) {
  assert(handle == pte->v_addr);
  pte->p_addr -= PAGE_SIZE;
}

void *get_ptr_to_handle(void* handle) {
  assert(handle == pte->v_addr);
  return page_location;
}

This has semantics on the real machine, because of the use of volatile - the writes to *pte are guaranteed to occur in program order. But the compiler does not have any way to know that volatile int *v_addr ends up with the same value between two separate calls to get_ptr_to_handle but points to different memory.

Also, I'd note that C89 does not have language asserting what you claim - it actually says quite the opposite, that the compiler does not have to assume that *ptr1 has changed within the C abstract machine, since ptr1 is not volatile. It's just that early implementations made that assumption because to do otherwise would require them to analyse not just the function at hand, but also other functions, to determine if *ptr1 could change. Like khim, you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 19:22 UTC (Tue) by Vipketsh (guest, #134480) [Link] (2 responses)

I still don't see it. You have that memset here:

> void *get_zeroed_page()
> [...]
> memset(pte->v_addr, 0, PAGE_SIZE);

If you don't have to assume that this writes over data pointed to by some other pointer it means that your aliasing rules say that no two pointers alias. Or put another way, for all practical purposes, having two pointers pointing to the same thing is unworkable. By some reading of C89 that may be the conclusion, but quite clearly that was never the intent and exactly no one expects things to work that way (including compiler writers, oddly enough).

> compiler does not have to assume that *ptr1 has changed within the C abstract machine

You mean across a function call ? That quite simply means that exactly no data could ever be shared by any two functions (in different compile units). Again, this would make the language completely unworkable and be counter what anyone expects.

> [...] you're picking up on a limitation of 1980s and early 1990s compilers, and assuming it's part of the language as defined, and not merely an implementation detail.

No. The language is defined, first and foremost, by what existing programs expect. If the standard allows interpretations and compilers to do things counter to what a majority of these programs expect, it is the standard that is broken and not the majority of all programs. I firmly believe that the job of a standard is to document existing behaviour and not to be a tool to change all programmes out there.

p.s.: I find it fascinating that instead of arguing about actual behaviour the C standard keeps coming up as if it where a bible handed down by some higher power and everything in it is completely infallible. Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

DeVault: Announcing the Hare programming language

Posted May 10, 2022 20:10 UTC (Tue) by khim (subscriber, #9252) [Link]

> Then the conclusion is that "See? It all sucks, so use Rust" because Rust is so excruciatingly well specified that, last I checked, it has no specification at all.

Rust hasn't needed any specs because till very recently there was just one compiler. Today we have 1.5: LLVM-based rustc and GCC-based rustc. One more is in development, thus I assume formal specs would be higher on list of priorities now.

This being said IMNSHO it's better to not have specs rather than have ones which are silently violated by actual compilers. At least when there are no specs you know that discussions between compiler developers and language users have to happen, when you have one which is ignored…

DeVault: Announcing the Hare programming language

Posted May 10, 2022 23:48 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

Rust doesn't have anything similar to ISO/IEC 14882:2020, a large written document which is the product of a lot of work but is of limited practical value since it doesn't describe anything that actually exists today.

However, Rust does extensively document what is promised (and what is not) about the Rust language and its standard library, and especially the safe subset which Rust programmers should (and most do) spend the majority of their time working with.

For example, all that ISO document has to say about what happens if I've got two byte-sized signed integers which may happen to have the value 100 in them and I add them together is that this is "Undefined Behaviour" and offers no suggestions as to what to do about that besides try to ensure it never happens. In Rust the "no specification" tells us that this will panic in debug mode, but, if it doesn't panic (because I'm not in debug mode and I didn't enable this behaviour in release builds) it will wrap, to -56. I don't know about you, but I feel like "Absolutely anything might happen" is less specific than "The answer is exactly -56".

Rust also provides plenty of alternatives, including checked_add() unchecked_add() wrapping_add() saturating_add() and overflowing_add() depending on what you actually mean to happen for overflow, as well as the type wrappers Saturating and Wrapping which are useful here (e.g. Saturating<i16> is probably the correct type for a 16-bit signed integer used to represent CD-style PCM audio samples)

DeVault: Announcing the Hare programming language

Posted May 10, 2022 23:48 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

The access to volatile memory at pte->v_addr through the non-volatile pointer ptr1 is UB because according to [C89]:

> If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.[57]

> [57] This applies to those objects that behave as if they were defined with qualified types, even if they are never actually defined as objects in the program (such as an object at a memory-mapped input/output address).

Objects in the page at pte->v_addr behave as if they were defined as volatile objects because the content changes in ways not described by the C abstract machine when pte->p_addr is updated. The same applies to passing a pointer to volatile object(s) to memset(), which takes a non-volatile pointer.

The initializer for page_location (pte->v_addr) is also not a constant, but I assume this is just pseudo-code for the value being set by some initialization function not shown here.

[C89] http://port70.net/~nsz/c/c89/c89-draft.html

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:24 UTC (Wed) by khim (subscriber, #9252) [Link] (3 responses)

> I'm using more and more asm() statements to prevent the compiler from lurking into what I'm doing ?

Yes, absolutely.

> I don't think so.

Why no? You said you want to “use my processor and OS for the purpose they were built” and in C all such code have to live in asm block.

The fact that it took so long for you to realize that is kinda unfortunate, but why would you perceive it as a bad thing?

> It feels like one day my whole C code will only be a bunch of macroes based on asm() statements.

If you insist on using non-portable construct in every line of code then sure, that's the proper outcome.

> That's not my goal when I'm using a C compiler, really.

That's the only proper way to write non-portable code in C. It makes non-portable code look like a portable code which is obviously a good thing.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 13:08 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (2 responses)

> You said you want to “use my processor and OS for the purpose they were built” and in C all such code have to live in asm block.

Ah no, sorry for not being clear. I have to use asm statements to prevent the compiler from being smart!

Typically stuff like this that current compilers are not yet able to optimize away to produce stupid code :

#define GUESSWHAT(v) ({ typeof(v) _v; asm volatile("" : "=rm"(_v) : "0"(_v)); _v; })

It usually only costs a move or two due to register allocation, so it's cheap. And using that in high-level code is ugly. But at least it doesn't know my pointer's value and doesn't play games in my back with it.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 14:08 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

> I have to use asm statements to prevent the compiler from being smart!

That's wrong way of doing things and you know that. Code outside of asm block have to follow the rules.

> But at least it doesn't know my pointer's value and doesn't play games in my back with it.

GCC doesn't know anything. It just emits asm blocks blindly. Clang certainly does know what happens in your asm block, it has a built-in assembler specifically for such cases.

I think what you actually want is std::launder (in C you can just call __builtin_launder directly).

DeVault: Announcing the Hare programming language

Posted May 5, 2022 15:22 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

Thanks for the tip, but there isn't any trace of it in gcc's docs... Thus I'll keep that hack for a few more decades it seems.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 17:16 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (2 responses)

> Last I checked, the direction Rust seems to be going is to swallow the bitter pill and define usize to be uintptr_t, accepting the resulting memory bloat in situations where size_t happens to be smaller.

I would say that Aria's "experiment" in nightly suggests exactly the opposite. Rust may choose to promise only that usize is the same size as the address, not the pointer - deliver APIs that reflect this more sophisticated understanding of how provenance works, and tell everybody whose black magic now doesn't work on CHERI that's too bad, try again with new knowledge.

https://doc.rust-lang.org/nightly/core/primitive.pointer....

DeVault: Announcing the Hare programming language

Posted May 3, 2022 18:25 UTC (Tue) by felix.s (guest, #104710) [Link] (1 responses)

Let’s hope that it will be fruitful. Still, the conflation of those two is not the only such ‘all the world’s x86 and ARM’ assumption that I am saddened to see in Rust. I think it’s the ‘weird’ (segmented, non-twos’-complement, narrow address space, maybe even non-octet-based, etc.) architectures that are the ones that could benefit the most from a Rust port, because they are the ones starved the most for any good tooling. I would love to see some day a Rust port to Win16 or DOS, which is to say, to x86-16 with a ‘large’ or ‘huge’ memory model. And some may disagree with me, but I think a commitment to backwards compatibility is one of the few things that C ought to be applauded for.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 18:42 UTC (Tue) by joib (subscriber, #8541) [Link]

There is some ongoing work to make Rust view of pointers more generic. That is, to break the assumptions that usize (roughly equivalent to size_t for you C-heads) is the same size as a pointer, and that a pointer->usize->pointer roundtrip doesn't lose information. AFAICS the motivation is not to work with all those weird old and obsolete architectures, but more to work with things like CHERI (including ARM Morello which is an implementation of CHERI), but I guess some of that work might help with stuff like segmented architectures as well.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:34 UTC (Mon) by ballombe (subscriber, #9523) [Link]

Ironically, the two most serious security bugs this year are both in java programs. Memory safety only bring you so far.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:06 UTC (Mon) by roc (subscriber, #30627) [Link] (3 responses)

To be fair, that blog post doesn't show "the majority of 0days analyzed by google were use-after-frees". Out of 58 zero-days, only 17 were use-after-free, i.e. about 30%.

Of course, I agree with you that encouraging people to write systems software in a language that doesn't prevent UaF --- when Rust is an option --- is a mistake.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:17 UTC (Mon) by atnot (subscriber, #124910) [Link] (2 responses)

> Out of 58 zero-days, only 17 were use-after-free, i.e. about 30%

I was looking at the percentage of memory corruption vulnerabilities, which is 54%. But both are very big numbers.

> Of course, I agree with you that encouraging people to write systems software in a language that doesn't prevent UaF --- when Rust is an option --- is a mistake.

I don't think Rust's answer has to be the only option here. There's a lot of options, between Rust's lifetimes, go style escape analysis, zig's allocator shenanigans, Google's MiraclePtr, etc. What I think is unfortunate however is saying "it's hard so we shouldn't try" and releasing a language in 2022 where UAF, the #1 memory corruption issue in modern systems, is not even a consideration. That's just very disappointing.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:27 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

I agree.

I'm not saying Rust is the only answer to UaF. E.g. Go or Java GC is another fine answer for many applications. I mention Rust only because it is the most likely to be compatible whatever scenarios people dream up.

(And FWIW I don't think Zig's allocator shenanigans are a realistic answer to preventing UaF in production.)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 11:41 UTC (Mon) by khim (subscriber, #9252) [Link]

> (And FWIW I don't think Zig's allocator shenanigans are a realistic answer to preventing UaF in production.)

I also don't think so. But at least they have honestly tried to do something (even if that something is not very good).

To say that we haven't even tried? In 2022? Gosh.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 16:32 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (13 responses)

> I didn't see any tools for preventing use-after-free bugs.

Something that *prevents* use-after-free bugs will force you to do horrible things when what you're trying to do looks like a use-after-free. Never forget that a free() is just a matter of saying "I am no longer interested in that memory area in this scope". Nothing more. When you start to manage your own memory pools for example, you realize so much as UAF is a totaly gray area, because what's considered "free" at a level still has to be tampered into at a lower level. UAF prevention is nothing but a lie, or a way to force you to do extremely complicated things at a level that already suffers from extreme sensitivity, which easily guarantees you'll mess up with something when doing low-level programming.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 17:53 UTC (Mon) by atnot (subscriber, #124910) [Link]

> When you start to manage your own memory pools for example, you realize so much as UAF is a totaly gray area, because what's considered "free" at a level still has to be tampered into at a lower level

I think that's only really true in languages like C where there's no real mechanism for handling arbitrary memory safely, so the various static analyzers are forced to guess in ways that will invariably turn out to be incorrect.

Not to bring up Rust again in this thread, but it does offer a good example here: You would implement freeing for your pool by converting your value into a MaybeUninit<MyType> and dropping (freeing) it in place. At this point, the original value no longer exists as far as the language is concerned, but you still have a write-only handle to it's memory, which you can safely use as you please. Then, when the time comes to use that memory again, you can write to it and use an unsafe call to assume_init() to promise the memory is now valid again. This consumes your MaybeUninit<MyType> and gives you a shiny new value of MyType in return.

By using the type system cleverly in this way, you can uphold the guarantee that all values must always be valid and that UAFs are hence impossible, without losing the ability to tamper with freed memory at a lower level. I wonder if C static analyzers could be taught a similar thing.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 0:17 UTC (Tue) by roc (subscriber, #30627) [Link] (11 responses)

> Never forget that a free() is just a matter of saying "I am no longer interested in that memory area in this scope".

"And I promise to never access that memory again through this pointer". Making that promise and failing to honour it is where the problems arise.

> When you start to manage your own memory pools for example, you realize so much as UAF is a totaly gray area

When you recycle objects, e.g. using a pool, you may encounter "high level" UaF bugs, but language-level UaF prevention continues to prevent "low level" UaF bugs, where that memory could be reused for an entirely different purpose. That is important because it means that whatever type system guarantees the language provides continue to hold. It rules out stuff like UaF leading to "wild write"/"wild read" primitives that are almost inevitably expoitable.

Figuring out those "high level" UaF bugs is also a lot easier than general UaF bugs, because only code with access to that pool can be involved in those bugs. With arbitrary memory corruption any code in that address space can be at fault.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 11:04 UTC (Tue) by nix (subscriber, #2304) [Link]

> Figuring out those "high level" UaF bugs is also a lot easier than general UaF bugs, because only code with access to that pool can be involved in those bugs. With arbitrary memory corruption any code in that address space can be at fault.

Even then a language could in theory improve on things, by, perhaps, annotating pointers with the memory allocator or pool they point to (and requiring that to be part of the pointer's type). Now a high-level pointer can go free while an identically-valued but differently-typed pointer to the same memory that happens to be part of the implementation of the allocator knows that, from its POV, it is not free, prohibiting access through the first pointer but not the second.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:14 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (9 responses)

> "And I promise to never access that memory again through this pointer". Making that promise and failing to honour it is where the problems arise.

No, that's not how it works. That's only the view of the end user who relies on libs but does not write them.

You only promise not to access that memory again between the *end* of the free() and the beginning of the next malloc()/free()/realloc(). Because free() itself and malloc() will use it a lot. And even calls to free() on other objects or malloc() returning another object might touch that area to cut it into pieces, merge it with another one, or restitute it to the system. That's it is important to understand how memory allocation works and not consider that free() is something that strict, because it is not (otherwise it would be impossible to write a memory allocator and you would have to stop and restart your program when you'd have used all your system memory since you wouldn't be allowed to reuse a previously used pointer).

DeVault: Announcing the Hare programming language

Posted May 3, 2022 16:49 UTC (Tue) by Tobu (subscriber, #24111) [Link] (6 responses)

You may get an address reused with a malloc() / free() / malloc() sequences but the pointer won't have the same provenance and won't be the same from the point of view of the abstract machine that defines the operational semantics of the language. The compiler will either know about malloc or offer a building block below it.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 8:34 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (5 responses)

But how will the compiler distinguish:

free(foo);
printf("Just released %p\n", foo);

which doesn't dereference the pointer, from:

free(foo);
printf("Just released %s\n", foo);

especially if the example is a bit more constricted by just calling a debug(void *p) function that takes the pointer in argument without telling if it just uses its value or dereferences it ?

DeVault: Announcing the Hare programming language

Posted May 4, 2022 14:05 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (4 responses)

The former is fine. The latter is definitely UB (as `MALLOC_PERTURB_` would show since it does a `memset` on `free`'d memory). FWIW, I run with `MALLOC_PERTURB_` at all times.

Provenance is what makes:

free(foo);
char* new_foo = malloc(1);
if (foo == new_foo) {
// by golly, we got lucky.
*foo = 1; // UB
}

That comparison is misleading due to provenance. It can be assumed to be false because `foo` is not allowed to *access* anything after that `free` even if its integer representation happens to be the same as `new_foo`. See the C and C++ papers by Paul McKenney about "pointer zap" about how to finally put provenance into the standard (instead of being something that implementers have had to craft to make sense of things as the languages have evolved).

Additionally, CHERI would show the folly of this code. C allows CHERI to exist. If you want to say "I don't care about CHERI", it'd be real nice if C would allow the code to have some marker that says "this code abuses pointer equality because we assume the target platform allows us to do this" so that any CHERI-like target can just say "this is broken" up front instead of waiting for whatever the optimizer does to finally trip up something in production.

As I said elsewhere: if you want to abuse C to be assembler for your target, it'd be real nice if that could be explicit instead of the doing "I'm using C for my list of targets, damn C's portability goals" and leaving "fun" landmines for others to run over later.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:28 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (3 responses)

I remember seeing this example somewhere but to be honest it really shocks me, that's purely forgetting that there is a real machine performing the operations underneath and using registers to set addresses. If the types are the same and the pointers are the same, it is foolish to permit the compiler to decide that they can hold different values. That's definitely a way to create massive security issues.

My feeling over the last 10 years is that the people working on C are actually trying to kill it by making it purposely unsuitable to all the cases that made its success. Harder to use, harder to code without creating bugs, harder to read. How many programs break when upgrading gcc on a distro ? A lot. The worst ones being those that break at runtime. This just shows that even if past expecations were wrong, they were based on something that made sense in a certain context and that was based on how a computer works. Now instead you have to imagine what a compiler developer thinks about the legitimacy of your use case and how much pain he wants to inflict you to punish you for trying to do it. Not surprising to see that high wave of new languages emerging in such conditions!

DeVault: Announcing the Hare programming language

Posted May 4, 2022 15:53 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (2 responses)

> … that's purely forgetting that there is a real machine performing the operations underneath and using registers to set addresses.

The error here lies in thinking that C makes any claims about pointers holding *addresses*. The representation of pointer values is not defined by the standard. In this area the ease with which most compilers permit integer/pointer conversions beyond what the standard defines is an attractive nuisance; pointers are not just "integers which can be dereferenced". Pointers are abstract references to memory locations (objects or storage for objects) which have associated lifetimes and can behave differently based on whether or not they've been initialized. There may or may not be a "real machine" underneath, as C can be compiled for an abstract virtual machine (like WASM) or even interpreted. Even when targeting real hardware (e.g. CHERI) there is no guarantee that you can freely convert between pointers and integers, or treat the representation of a pointer as a plain memory address. Pointer provenance may be something only the compiler is aware of, or it may have some representation at runtime (via tagged pointers).

Really the rules for using pointers in C without triggering undefined behavior are not that different from the rules for references in Rust. C just doesn't offer any help in tracking whether the requirements have been *met*, where Rust requires the compiler to take on most of that task.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 17:23 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (1 responses)

Let's say we disagree. Even the aliasing rules are contrary to the example above: both types are the same, modifying one pointer means the data at the other one *might* have been modified and the data from that other pointer must be reloaded before being used, unless the compiler knows that pointers are the same in which case it can directly use the just written data, which is the case here.

This type of crap is *exactly* what makes people abandon C. The compiler betrays the programmer based on rules which are stretched to their maximum extent, just to say that there was a tiny bit of possibility to interpret a standard for the sole purpose of "optimizing code" further, at the expense of reliability, trustability, reviewability and security.

I'm sorry but a compiler that validates an equality test of both types and values between two pointers and fails to see a change when one is overwritten is a moronic compiler, regardless of the language. You cannot simply trust anything produced by that crap at this point.

DeVault: Announcing the Hare programming language

Posted May 4, 2022 18:01 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

> Even the aliasing rules are contrary to the example above: both types are the same

I never gave a type for `foo`. I don't think it really matters what type `foo` is here (beyond allowing `1` to be assigned to its pointee type). Dereferencing is verboten after passing it to `free` regardless of its bitwise representation.

> just to say that there was a tiny bit of possibility to interpret a standard for the sole purpose of "optimizing code" further, at the expense of reliability, trustability, reviewability and security.

I don't know that "optimizing code" is the end goal per sē. I think it's actually about making things more consistent across C's target platforms.

If you write `int i = other_int << 32;` and expect 0, one target may be happy. Another might give you `other_int` (let's say it encodes the operation's right-hand value in 5 bits with the register index in the other 3 because why not). Mandating either requires one of them to avoid using its built-in shifting operation. What do you want C to do here? Just say "implementation-defined" and leave porters to trip over this later? You *still* need to maintain the version that has actual meaning for both targets if you care.

Now, if you want to say "I'm coding for x86 and screw anything else", that's great! Let's tell the *compiler* that so it can understand your intent. But putting your thoughts into code and then telling a compiler "I've got C here" when you actually mean "I've got x86-assuming C here" is just bad communication.

I'd like to see some way in the C language itself. Not a pragma. Not a compiler flag. Not a preprocessor guard using `__x86_64__`, but some actual honest-to-K&R C syntax that communicates your intent to the compiler. FWIW, I don't know that maintaining that preamble will be worth the work outside of the kernel's `arch/` directory, but hey, some people live there. I say that because you'll either have:

- divergent code anyways to tell the compiler about each assumption set; or
- have to write the C standard code that requires you to check how much you're shifting by before doing it anyways.

So, as said elsewhere in this thread, improving C is fine. But complaining that you're coding in C, breaking its rules, then complaining that the compiler isn't playing fair is just not reconciling beliefs with reality.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 18:04 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (1 responses)

As I understand it, that's actually not the same pointer!

The pointer I got from malloc() for my 400 byte allocation, and the pointer malloc itself relies on for managing the heap after I free() it are not actually the same pointer even though on a typical modern CPU they're the same magic number in a CPU register.

The pointer I had comes with provenance given to it by malloc(), it's a pointer to my 400 byte allocation. If I add 390 to it, that's a pointer to the last 10 bytes of the allocation, and we're all fine. But if I add -16 to it, that's not a pointer to anything.

The pointer used by the allocator uses different provenance, it's a pointer into the heap, and you can totally add -16 to it, because that's just a different pointer into the heap and as such fine.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 20:02 UTC (Tue) by ssokolow (guest, #94568) [Link]

Exactly. The abstract machine that the compiler's optimizers (and tools like LLVM sanitizers and miri) operate on tags each pointer with its parent allocation and, when you free(), that allocation is revoked, making dereferencing any pointers derived from it an invalid operation.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:20 UTC (Mon) by mpr22 (subscriber, #60784) [Link] (3 responses)

The programmer I am most concerned about trusting is not Present Me; ultimately, there is no way to avoid the need to trust Present Me when I am programming, since even if the language stops me committing memory safety errors, it generally can't stop me committing all the other kinds of error that can make my application accidentally give the User's security details to the minions of Mob, God, and Cop without the User's permission (though some languages and runtime libraries are better at inconveniencing me when I try to make those errors than others).

The programmers I am concerned about trusting are the different people called Future Me (will I be able to extend this code safely?), Past Me (did I write my old code safely in the first place?), and Not Me (did they write their code safely, and will they be able to extend my code safely?)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:32 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Those are all good points, but it's not about *completely avoiding* the need to trust programmers, that's a straw-man. It's about *minimizing* the need to trust programmers where we have proven techniques to reduce such trust that aren't unduly burdensome. And Rust and Swift and other languages with good type systems have lots of ways to do that beyond just memory safety!

DeVault: Announcing the Hare programming language

Posted May 2, 2022 11:20 UTC (Mon) by mpr22 (subscriber, #60784) [Link]

Oh, absolutely; I think I've phrased myself poorly.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 15:08 UTC (Mon) by farnz (subscriber, #17727) [Link]

And specifically what I want, as a programmer with similar views to you, is for Present Me to be able to write my code such that when Future Me or Not Me (lacking the context Present Me is immersed in right now) does something foolish with it, they will get protests from the language implementation (compiler or interpreter) that tell them what context they need to acquire in order to be able to do their job without the protesting.

This is just because programming is hard, and there's a lot of context behind every decision you make while programming. If you lack that context, you'll make mistakes, and one of the things I now value in programming languages is telling you that you're missing context.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 8:23 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (1 responses)

> OK, I trust DJB, but him only.

You really shouldn't - https://www.qualys.com/2020/05/19/cve-2005-1513/remote-co...

DeVault: Announcing the Hare programming language

Posted May 2, 2022 8:54 UTC (Mon) by roc (subscriber, #30627) [Link]

Oh dear, thanks for clearing that up.

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

*Sigh*

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Use-after-free checking at low runtime cost

Use-after-free checking at low runtime cost

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Do What I Mean

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Sigh