Insulating layer? [LWN.net]

Insulating layer?

Posted Oct 12, 2024 6:47 UTC (Sat) by mb (subscriber, #50428) [Link] (38 responses)

If you put refcounting everywhere, you are making the language a lot slower and it would still *not* be fully safe. Think about threads.

There is no such thing as a "safe subset" of C++ and there never will be. If you create a "safe subset", then it will be a completely different language that people will have to learn. Why bother? You would still end up with a language containing massive amount of obsolete backward compatibility stuff next to your new "safe subset" language.

Just switch to a safe language (for new developments). It's much simpler. And it's much safer.

Insulating layer?

Posted Oct 12, 2024 16:45 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

I'm not saying I think this is a good idea, and I'm certainly not trying to argue that it will be a serious competitor to Rust in terms of overall safety. I'm saying that the very same political factor you identify (massive legacy codebases) will also create great pressure for C++ to provide a "safe profile" for use in contexts where Compliance™ is mandatory. It is reasonable to assume that the smart people who are working on this will end up producing something that has at least some improvements in safety over the status quo (which is a very low bar for C++). Compare and contrast MISRA C, which has the advantage of starting from a much simpler language, but the disadvantage of not having smart pointers for simple "object X owns object Y" cases.

Insulating layer?

Posted Oct 14, 2024 9:31 UTC (Mon) by paulj (subscriber, #341) [Link] (36 responses)

Isn't Rust basically a ref-counting language, for long-lived programmes with non-trivial data relationships?

Sure, for a compute-and-complete programme, processing a finite input to give some result, Rust programmes can allocate from the stack, and/or have acyclically-linked heap objects with tight lifetimes. But... for anything long-lived with complex relationships between different data objects, isn't the pattern in Rust to use Rc/Arc? I.e., refcounting?

Insulating layer?

Posted Oct 14, 2024 10:26 UTC (Mon) by farnz (subscriber, #17727) [Link] (14 responses)

Your long-lived data is often refcounted (especially in async Rust) because that's simpler to get right than rearranging so that your data doesn't have complicated lifetime relationships. But your short-lived accesses to that data usually borrow it, rather than refcounting it, because their relationship to the bit of data they need is simple.

And that's something that's not easy to express in Python - the idea that I borrow this data in a simple fashion, even though the structure as a whole is complicated and refcounted.

Insulating layer?

Posted Oct 14, 2024 11:00 UTC (Mon) by paulj (subscriber, #341) [Link] (13 responses)

I have tried playing with Rust. There's a lot to like. There's some stuff I dislike - the macro system especially makes me go "Wut?", Zig's approach of allowing the /same/ language to be used to compute at compile time looks to be much more ergonomic (though, havn't tried Zig yet).

In the little I've played with Rust, the lifetime annotations make sense on simpler stuff. But it seems to get very baroque on more complex stuff. And I havn't gotten over a certain "barrier" on how to design the not-simpler kinds of inter-linked relationships that are unavoidable in real-world stuff. E.g. a network protocol, you may have objects and state for "peers", "connections", "TLS contexts", "Streams", "message entity" (plurality of these) with m:n relationships between them. A peer may have multiple connections, messages may have a lifetime related to a /set/ of peers and not just the peer it was received from, etc. You have to use RC for at least some of these - so now you're dealing with /both/ refcounting AND lifetime annotations for the refcounting objects (in other languages you would embed the RC information in the refcounted object, but in Rust the RC object holds the refcounted object), and there are rules for getting data into and out of RCs and cells and what not. And I end up with long screeds of compiler error messages that give me flashbacks to C++ - maybe Rust's are better, but I've been traumatised. The complexity of implementing simple linked lists in Rust, without unsafe, is a bit of a red flag to me.

I think the local stack variable lifetimes possibly could be dealt with in simpler ways, perhaps with more limited annotations, or perhaps none at all but just compiler analysis. It might be more limited than Rust, but be enough to deal with typical needs for automatic data in long-lived programmes.

What I'd like, for long-lived programmes with non-simple relationships in their data model (e.g., many network protocols) is a language that did ref-counting /really/ well - ergonomically, with low programming complexity, and minimal runtime overhead. Cause refcounting is how all non-trivial, long-lived programmes, that can not tolerate scannning GC, end-up managing state.

Maybe I'm just too stupid for Rust. Like I'm too stupid for Haskell. Maybe refcounting is the right answer for my lack of intelligence - but then it seems many other programmers are like me. And it seems maybe even the best Rust programmers too, given how they always end up reaching for (A)RC too.

Insulating layer?

Posted Oct 14, 2024 11:28 UTC (Mon) by mb (subscriber, #50428) [Link]

>the macro system especially makes me go "Wut?", Zig's approach of allowing the /same/ language
>to be used to compute at compile time looks to be much more ergonomic (though, havn't tried Zig yet).

Rust can do that, too. It's called proc-macro

Challenges learning how to model data structures in Rust

Posted Oct 14, 2024 12:26 UTC (Mon) by farnz (subscriber, #17727) [Link] (10 responses)

One of the challenges I found in moving from C++ to Rust is that in C++, you model ownership in your head; you sort-of know whether a given pointer represents owning something, shared ownership, or simply referring to something someone else owns. Rust doesn't let you play fast-and-loose with ownership like this; the borrow checker means that you have to be extremely clear about who owns what, and when, using T for "this T is embedded here", Box<T> for "this T is owned, but is stored out of line", and Arc/Rc<T> for "ownership of this T is shared".

You can also write a simple linked list in 40 lines of Rust without any unsafe at all - the use of unsafe is because you want a complicated linked list that meets more interesting properties than the simple linked list implementation does.

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 9:09 UTC (Thu) by paulj (subscriber, #341) [Link] (9 responses)

I was referring more to "simple" doubly-linked lists on the complexity side.

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 9:39 UTC (Thu) by farnz (subscriber, #17727) [Link] (8 responses)

Making a doubly-linked list in safe Rust isn't that much more complicated, although it does bring in the need for Rc, since a doubly-linked list has complex ownership (the previous pointer cannot be owning).

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 10:10 UTC (Thu) by paulj (subscriber, #341) [Link] (2 responses)

Yes, course you can do with with RC. But RC is a "fuck it, we'll do it live!" escape hatch to runtime, out from the compile-time lifetime typing system,

To stay within the typed lifetime-scope of Rust, without unsafe, is somewhere between extremely complex and impossible, isn't it? (I thought impossible, but then I thought I saw a tutorial claiming it was possible once - but I can't find it back right now).

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 14:23 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

The difficult bit is the ownership model if you don't use refcounting; you want each node to be individually mutable, but you also want each node to be shared by both the previous and the next nodes. That's inevitably difficult in a "shared, or mutable, but not both" model as Rust imposes, unless you use refcounting.

Challenges learning how to model data structures in Rust

Posted Oct 18, 2024 14:38 UTC (Fri) by taladar (subscriber, #68407) [Link]

Something would probably be possible with interior mutability but it likely would need locking or some other internal mechanism to prevent mutation from both ways to access it at the same time.

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 11:48 UTC (Thu) by excors (subscriber, #95769) [Link] (4 responses)

That looks like a poor example - it doesn't implement any operations beyond "new" and "append", so it never even uses its "prev" pointers. The much more thorough exploration at https://rust-unofficial.github.io/too-many-lists/ calls a similar Rc<RefCell<...>> design "A Bad Safe Deque" and notes "It was a nightmare to implement, leaks implementation details, and doesn't support several fundamental operations". Eventually it gets to "A Production Unsafe Deque", which looks pretty complicated. Singly-linked lists are much easier (though still not trivial), the double-linking is the big problem.

(It also explains:

> Linked lists are terrible data structures. Now of course there's several great use cases for a linked list: [...] But all of these cases are _super rare_ for anyone writing a Rust program. 99% of the time you should just use a Vec (array stack), and 99% of the other 1% of the time you should be using a VecDeque (array deque). These are blatantly superior data structures for most workloads due to less frequent allocation, lower memory overhead, true random access, and cache locality.
>
> Linked lists are as _niche_ and _vague_ of a data structure as a trie. Few would balk at me claiming a trie is a niche structure that your average programmer could happily never learn in an entire productive career -- and yet linked lists have some bizarre celebrity status.

Generally, you probably shouldn't judge a language on the basis of how easily it can implemented linked lists. On the other hand, one of the listed "great use cases" is "You're writing a kernel/embedded thing and want to use an intrusive list", which is probably much more common amongst LWN readers than your average programmer; it's fair to have different priorities when judging languages.

But I think it's still important to consider whether your design uses (doubly-)linked lists because they really are the best data structure for your needs, or whether it's just because linked lists are really easy to implement in C (and an equivalent of Vec is hard) so they're your default choice for any kind of sequential container, and you're trying to apply the same mindset to Rust where they're really hard (but Vec is easy) and they should never be the default choice.)

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 12:22 UTC (Thu) by khim (subscriber, #9252) [Link]

> But I think it's still important to consider whether your design uses (doubly-)linked lists because they really are the best data structure for your needs, or whether it's just because linked lists are really easy to implement in C (and an equivalent of Vec is hard) so they're your default choice for any kind of sequential container, and you're trying to apply the same mindset to Rust where they're really hard (but Vec is easy) and they should never be the default choice.)

Tragedy of lists (and the reason Rust had to wait till 2015 to be viable) lies with the fact that while double-linked lists are very bad data structure today – that wasn't the case half-century ago when Computer Science was established and first courses that teach people how to program were developed.

Back then computers big, but their memories (and thus programs written for these computers) were small, while RAM was fast and CPU slow.

Linked lists really shined in such environment!

Today computers are tiny, but have memory measured in gagabytes, programs are crazy big and CPUs are hudnreds of times faster than RAM.

Today linked lists are almost never the right data structure for anything and even when they actually better than something like Vec the difference is not large.

But people are creatures of habits: after they have learned to deal with linked lists once they tend not to ever return back and use them as long as they could.

Worse yet: while it's true, today, that Vec is the way to go and linked lists are [mostly] useless… there wasn't any point in time when linked lists stopped being useful and Vec started shining. Transition was very slow and gradual: first Vec was awful and linked lists great, then Vec have become a tiny bit less inefficient and thus become better for some tasks, while linked list have become less efficient and thus worse for these some task, etc.

Thus even today, long after linked lists and all tricks associated with them stopped being important in 99% of cases, people try to apply what they know to Rust.

And yes, that inevitably sends us back to one funeral at time story.

Which is sad, because that means that new generation would have to relearn everything again: while linked lists are no longer useful many other lessons that “old beards” learned are still usefull… but if C/C++ would be replaced with Rust (or any other safe language) via one funeral at time way then all these lessons would be forgotten… and thus repeated.

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 12:42 UTC (Thu) by smurf (subscriber, #17840) [Link]

Well, the kernel contains a veritable heap of linked lists, so we're stuck with them.

The key isn't to implement a doubly-linked-list in Safe Rust, which is not exactly a workable idea for obvious lifetime reasons, but to hide them behind an appropriate abstraction.

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 14:00 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

I agree linked-lists are a terrible data-structure. But that isn't really the point here.

Doubly linked lists are just a well-understood form of data-structure with cycles. Cycles in data-representations in non-trivial programmes are common, even if those programmes (rightly) eschew linked-lists. The linked-list question speaks to the ability to do common kinds of data modelling.

Focusing on the linked-list part misses the point.

Challenges learning how to model data structures in Rust

Posted Oct 17, 2024 15:57 UTC (Thu) by mb (subscriber, #50428) [Link]

> Focusing on the linked-list part misses the point.

I think that focussing on everything must be implemented in safe Rust is missing the point of Rust.
Yes, many things *can* be implemented in safe Rust, but that doesn't mean they should.

For implementing such basic principles as containers and lists, it's totally fine to use unsafe code.
If you can drop reference counting by implementing it in unsafe code and manually checking the rules, then do it. This is actively encouraged by the Rust community. The only thing that you must do is keep interfaces sound and do all of your internal manual safety checks.

Therefore, I don't really understand why reference cycles or linked lists are discussed here. They are not a problem. Use a safe construct such as Rc or use unsafe. It's fine.

Insulating layer?

Posted Oct 20, 2024 11:15 UTC (Sun) by ssokolow (guest, #94568) [Link]

I have tried playing with Rust. There's a lot to like. There's some stuff I dislike - the macro system especially makes me go "Wut?", Zig's approach of allowing the /same/ language to be used to compute at compile time looks to be much more ergonomic (though, havn't tried Zig yet).

The declarative macro syntax is an interesting beast because it's somehow so effective at disguising that a macro is just a match statement to simulate function overloading with some Jinja/Liquid/Twig/Moustache/Handlebars/etc.-esque templating inside, disguised by a more Bourne-like choice of sigil.

...maybe because it's really the only place in the language where you have to code in a Haskell-like functional-recursive style instead of an imperative style if you want to do the fancier tricks?

In the little I've played with Rust, the lifetime annotations make sense on simpler stuff. But it seems to get very baroque on more complex stuff.

True. On the plus side, they are still actively working to identify places where they can remove the need for weird tricks. See, for example, the Precise capturing use<..> syntax section in this week's announcement of Rust 1.82.0.

Maybe I'm just too stupid for Rust. Like I'm too stupid for Haskell. Maybe refcounting is the right answer for my lack of intelligence - but then it seems many other programmers are like me. And it seems maybe even the best Rust programmers too, given how they always end up reaching for (A)RC too.

One of Rust's greatest weaknesses has always been that its "make costs explicit" philosophy has created a language that encourages premature optimization, and people in the community concerned with writing learning materials spend an inordinate amount of time encouraging people to write quick and dirty code (eg. String instead of &str, .clone() everywhere, etc.) first, learn the more advanced stuff later, and always profile before optimizing.

(Of course, to be fair, the same thing occurs more subtly at lower levels with people misjudging the cost of a given machine language opcode and how superscalar architecture will affect costs. It just doesn't result in compiler errors.)

There's no shame in reaching for a reference-counted pointer. (I'm no C++ programmer, but I've never seen any signs of this much agonizing over std::shared_ptr.) However, for better or for worse, having people averse to it does seem to contribute to the trend that Rust implementations of things have low and stable memory usage.

Insulating layer?

Posted Oct 14, 2024 10:31 UTC (Mon) by excors (subscriber, #95769) [Link] (18 responses)

A little bit, but that's not the same as "refcounting everywhere". You might store long-lived complex-ownership state in an Rc/Arc, but I think you typically wouldn't use the Rc much within your code - you'd (automatically) deref it into an uncounted &T reference before passing it down the call stack and operating on it, with the borrow checker ensuring the &T doesn't outlive the Rc<T> it originated from. That means you're rarely touching the reference count (avoiding the performance problems it causes in languages like Python), but you still get the lifetime guarantees (unlike C++ when you mix shared_ptr with basic references and risk use-after-free).

Insulating layer?

Posted Oct 14, 2024 11:03 UTC (Mon) by paulj (subscriber, #341) [Link] (17 responses)

That's a nice comment on how RC and lifetimes can complement each other. Thanks.

I'm still left a bit uneasy about the complexity on all the rules on getting data in and out of RCs and cells and what not, and implementing non-trivial relationships in data, in Rust. It's not a simple language.

Insulating layer?

Posted Oct 15, 2024 8:20 UTC (Tue) by taladar (subscriber, #68407) [Link] (16 responses)

You are confusing the complexity of the problem with the complexity of the language.

The problem of ownership is complex, Rust helps you deal with it by making it explicit, other languages just let you make mistakes without providing any help with it at all.

Insulating layer?

Posted Oct 15, 2024 9:15 UTC (Tue) by paulj (subscriber, #341) [Link] (15 responses)

The way I code in C is that I reduce the complexity of the ownership problem by constraining myself to what I am allowed to do:

- references may be "owning" or "shared"
- An "owning" reference may only ever be stored to automatic, local variables.
- A "shared" reference is ref-counted, and can be stored to multiple variables (using whatever refcounting machinery/helpers I've put in place)

Rust automating these rules and enforcing them, AND allowing me to safely pass a "owning" reference to other functions, taking care of passing the ownership in the process, is nice. In my way of doing things, I will often have to use local helper functions that take "owning" references to do stuff with them. And I have to myself ensure I don't store those pointers somewhere. But, that's what I have to do anyway. Rust enforcing this, unless code is marked 'unsafe' seems nice.

But... for me... Rust's lifetime type system allows too much. I can see it would let me do more than what my own heavily constrained system allows, and with guarantees, but it also drags in a lot of complexity. Complexity I already made a conscious decision to avoid.

A language with a much more limited set of scoping rules for pointers - more like the system I've already been using for years in unsafe languages - could be simpler and just as effective, possibly.

Insulating layer?

Posted Oct 16, 2024 7:44 UTC (Wed) by taladar (subscriber, #68407) [Link] (14 responses)

As long as you are the only one working on your code and you don't make mistakes you might be able to rely on your assumptions but I think the strength of Rust (and similar strict languages) is that they can guarantee that as long as there aren't any compiler bugs all programmers working on your project, including you on a bad day where you do make mistakes, will follow the rules.

Insulating layer?

Posted Oct 17, 2024 9:19 UTC (Thu) by paulj (subscriber, #341) [Link] (13 responses)

Yes, I get that. Rust can enforce rules, and that's great.

My point is more that my own system of ad-hoc, non-compiler-enforced rules for being able to manage the problem of lifetimes of objects and their references is simpler than Rusts'. In my system I basically try to have just 2 scopes for references - the very local, and then the refcounted. The latter are "safe" use-after-free issues (I don't use usually use refcounting machinery that deals with concurrency, but you could).

Rust goes further than that, and provides for arbitrary scopes of lifetimes for references to different objects. Great, lot more powerful. But... it also becomes harder to reason about when you start to make use of that power, it seems to me. I can't keep track of more than a couple of scopes - that's why my ad-hoc safe-guards in lesser languages are so simple, and I think it's part of why I struggle to get my head around Rust's error messages when I try write more interlinked, complex data-representations, in my own toy test/learn coding attempts. The fact that a lot of code ends up going back to refcounted containers suggests I might not be alone, not sure.

What I'm saying is that simple programmers like me might be better off with a "safe" (i.e. enforcing the rules) language with a more constrained, simpler, object lifetime management philosophy.

Insulating layer?

Posted Oct 17, 2024 10:36 UTC (Thu) by Wol (subscriber, #4433) [Link] (12 responses)

This sounds a bit like alloca?

I wonder. Could you create a "head" object who's lifetime is the same as the function that created it, and then create a whole bunch of "body" objects that share the same lifetime?

These objects can now reference each other with much fewer restrictions, because the compiler knows they share the same lifetime. So there's no problem with the head owning the next link down the line etc etc, and each link having a reference to the link above, because they'll all pass out of scope together?

Of course that means you can't access that data structure outside the scope of the function that created it, but that would be enough for a lot of purposes?

Cheers,
Wol

Insulating layer?

Posted Oct 17, 2024 11:45 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

> that would be enough for a lot of purposes?

That would also be rather pointless for a lot of other purposes (among them: freeing a bunch of body objects, oops you suddenly have scoping problems anyway), so what's the point?

Insulating layer?

Posted Oct 17, 2024 12:25 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> so what's the point?

It feels pretty simple to me - cf paulj's comment that full blown Rust just seems to blow his mind.

Does Rust actually have a memory allocator, or does it have a "new"? That was the point of my mention of alloca - as the caller goes out of scope, any memory allocated by alloca goes out of scope as well. You don't free alloca memory.

Could the Rust compiler handle "this memory is now out of scope, just forget about it"? If standard Rust rules say an object cannot own an object with a lifetime longer than itself, just dropping the object can't do any harm, can it?

You're effectively setting up a heap, with its lifetime controlled by "head". So you just drop the entire heap, because every object in the heap has had the heap lifetime constraints imposed on it.

Cheers,
Wol

Insulating layer?

Posted Oct 20, 2024 11:11 UTC (Sun) by ssokolow (guest, #94568) [Link]

Does Rust actually have a memory allocator, or does it have a "new"?

That's the concern of the data type, not the language.

If you want a new, you give the struct at least one private member so it can't be initialized directly and write a public associated function (i.e. a public class method) which constructs and returns an instance (Named new by convention only.) ...but that doesn't automatically mean heap allocation. It's purely a matter of whether there are "correct by construction" invariants that need to be enforced.

If you want to heap-allocate, you either use a type which does it internally like Box<T> (std::unique_ptr in C++) or Vec<T> or you do as they do internally and use the unsafe wrappers around malloc/calloc/realloc/free in std::alloc.

Could the Rust compiler handle "this memory is now out of scope, just forget about it"?

It does. If you want a destructor, you impl Drop and, if you don't, a stack allocation will just be forgotten until the whole frame is popped and a heap allocation will go away when the owning stack object's Drop is run and frees the memory.

Rust's design does a good job of making its memory management appear more sophisticated than it really is. It's really just stack allocation, access control, destructors but no actual language-level support for constructors, and using RAII design patterns to implement everything else in library code on top of manual calls to malloc/realloc/free.

The borrow checker plays no role in machine code generation beyond rejecting invalid programs, which is why things like mrustc can exist.

Insulating layer?

Posted Oct 17, 2024 14:07 UTC (Thu) by paulj (subscriber, #341) [Link] (8 responses)

Samba has "talloc" which supports having heap objects allocated as a hierarchy. Freeing an object frees all its children in the hierarchy.

You could take that kind of tack and allocate stuff on the stack with alloca, sure.

Neither hierarchical allocation, nor stack allocation, address the issue of tracking validity of references though, as such. As you're implying, that requires something else - be it an ad-hoc system of rules that enforce guarantees, assuming the programmers' can hold themselves to applying those rules (and... they will fail to every now and then); or whether they are rules in the language and enforced in the compiler.

The question for me is: What is the most programmer friendly system of rules to guarantee safety?

Rust is one example of that. With a very general lifetime typing system (the most general possible?). That generality brings complexity. Yet, many Rust programmes have to step out of that compile-time lifetime-type system and use runtime ref-counting. So then the question is, if the lesson from Rust is that many many programmes simply go to runtime ref-counting for much of their scoping, would it be possible to just have a less general, simpler lifetime-typing system?

E.g., perhaps it isn't necessary at all to even need to represent lifetime types. Perhaps it would be sufficient to have 2 kinds of references - local and refcounted, with well-defined and safe conversion semantics enforced by the language.

Insulating layer?

Posted Oct 17, 2024 16:00 UTC (Thu) by khim (subscriber, #9252) [Link]

> E.g., perhaps it isn't necessary at all to even need to represent lifetime types. Perhaps it would be sufficient to have 2 kinds of references - local and refcounted, with well-defined and safe conversion semantics enforced by the language.

That's where Rust have started, more-or-less. The language which used this approach, Cyclone, is listed among many languages that “influenced” Rust.

Only it's not practical: language that you are getting as a result doesn't resemble C at all! Or, more precisely: language would look like a C, but it's APIs would be entirely different. Even most trivial functions like strchr are designed around ability to pass reference to local variable somewhere. It's not even possible to pass buffer that you would fill with some values to another function!

I'm not entirely sure all the complexity that Rust have (with covariance and contravariance, reborrows and so on) is needed… technically – but it's 100% wanted. Till Rust 2018 introduced Non-lexical lifetimes Rust was famous not for its safety, but for it's needless strictness. I think almost every article till that era included a mandatory part which was telling you tales how your “fight with the borrow checker” is not in vain and how it enables safety, etc.

After introduction of NLL, reborrows, HRBTs and so on people stopped complaning about that… and started complaining about complexity of the whole thing… but it's highly unlikely that people would accept anything less flexible: they want to write code and not fight with a borrow checker!

> So then the question is, if the lesson from Rust is that many many programmes simply go to runtime ref-counting for much of their scoping, would it be possible to just have a less general, simpler lifetime-typing system?

Sure. Swift does that.

It has quite significant performance penalty, but still much faster than many other popular languages.

Insulating layer?

Posted Oct 17, 2024 17:10 UTC (Thu) by Wol (subscriber, #4433) [Link] (5 responses)

> Neither hierarchical allocation, nor stack allocation, address the issue of tracking validity of references though, as such. As you're implying, that requires something else - be it an ad-hoc system of rules that enforce guarantees, assuming the programmers' can hold themselves to applying those rules (and... they will fail to every now and then); or whether they are rules in the language and enforced in the compiler.

But does it *have* to?

If you have a procedure-level heap, or some other data structure with a guaranteed lifetime, you apply exactly the same borrow-rules but on the heap level. If all references to the heap have the same or shorter lifetime than the heap, when the heap dies so will the references.

So you create a heap for your linked list as early as possible, and you can freely create references WITHIN that heap as much as you like. They'll die with the heap. It's not meant to be an all-singing-all-dancing structure, it's meant to have a couple of simple rules that make it easy to create moderately complex data structures. You probably wouldn't be able to store pointers in it that pointed outside of it, for example. You would have to be careful storing references to items with shorter lifetimes. But if you want to store something with loads of complex *internal* references, it would be fine precisely because of the "everything dies together" rule.

So you'd use the existing rust checking setup, just that it's a lot easier for you as the programmer to keep track of, precisely because "it's an internal self reference, I don't need to care about it".

Cheers,
Wol

Insulating layer?

Posted Oct 17, 2024 17:27 UTC (Thu) by daroc (editor, #160859) [Link] (1 responses)

You may be interested in Vale, a programming language that is trying a bunch of new things around lifetime-based memory safety, including compiler support for "regions", which work almost exactly how you describe. The project also has some interesting ideas about tradeoffs between different memory management strategies. I want to write an article about it at some point.

Insulating layer?

Posted Oct 18, 2024 16:33 UTC (Fri) by paulj (subscriber, #341) [Link]

That is very interesting. The generational references sound useful for performant weak-references. Although, you can not use them for very frequently allocated objects (you can't reuse the memory past <generation size>_MAX).

Insulating layer?

Posted Oct 18, 2024 14:25 UTC (Fri) by taladar (subscriber, #68407) [Link] (2 responses)

That is essentially just an arena allocator. It solves the problem of forgetting to free something but doesn't solve e.g. accessing something that should not be used anymore after a certain operation.

Insulating layer?

Posted Oct 18, 2024 16:24 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> but doesn't solve e.g. accessing something that should not be used anymore after a certain operation.

And can't you just apply ordinary Rust rules to that? It's not intended as a way of escaping Rust's rules. It's just meant as a way of enabling the *programmer* to forget abut a lot of the rules on the basis that internal references, pointers, objects will all go invalid at the exact same time. So if A points to B and B points to A and they're in this structure you don't worry about cleanup because they both go poof and emit the magic smoke at the same time.

If A contains a pointer to C with a shorter (or longer) lifetime, Rust will need to check that A destroys the pointer as C goes out of scope, or alternatively that the entire heap cannot go out of scope until C is destroyed.

A simple structure for simple(ish) situations, and more complex structures for where simple doesn't work. And if those complex structures are hard to grasp, it helps if you've realised that the simple structures aren't up to the task (and why).

Cheers,
Wol

Rust has arena allocators

Posted Oct 18, 2024 16:35 UTC (Fri) by kleptog (subscriber, #1183) [Link]

There appear to be several arena allocators for Rust: https://manishearth.github.io/blog/2021/03/15/arenas-in-r...

The basic idea is you have a function with allocates an arena and keeps it alive. Within the arena objects can reference each other as much as they want, including cycles. They can also reference other objects, as long as they live longer than the arena. When your function exits, the arena is cleaned up in one go. Incompatible with destructors (though there are tricks for that), but otherwise looks like what you want.

I know them from PostgreSQL where they have an arena per query so interrupting the query safely throws away everything. There are plenty of use cases for them.

Insulating layer?

Posted Oct 18, 2024 14:23 UTC (Fri) by taladar (subscriber, #68407) [Link]

> many many programmes simply go to runtime ref-counting for much of their scoping

This is not my experience with Rust at all. Refcounting does happen occasionally but the vast, vast majority of cases doesn't use ref-counting at all in Rust. Usually it is no more than a few values (logically speaking, some of them can exist in many copies of course but the one that is referred to by the same name in the code and passed along the same code-paths) even in large programs.

Insulating layer?

Posted Oct 14, 2024 13:08 UTC (Mon) by khim (subscriber, #9252) [Link]

It's true that some people (even among Rust developers) actively want to make it a ref-counting language… but currently… no, it's not a refcounting language.

It's very subtle and very much “something good done for entirely wrong reasons”, but the critical part is that in Rust to increment your counter you need to explicitly call Arc::clone or Rc::clone.

And if you pass your smart pointer around you don't really do any expensive operations at all. This affects efficiency very significantly.

Swift is trying to reach the same point from the other side and automatically eliminate ref-counting from places where compiler can prove it's not needed.

It's an interesting experiment, but I'm sceptical about that approach: that advantage of Rust (every much unplanned, as shown above!) is that pesky .clone that you don't want to see on every line.

That's simple psychology: if you see it – you want to eliminate it (when that's easy to do, it's not a crime to call .clone, if you need to!), if it's invisible – you wouldn't even think about whether what you are doing is expensive or not.

Insulating layer?

Posted Oct 17, 2024 19:16 UTC (Thu) by sunshowers (guest, #170655) [Link]

Yes, it's common to sprinkle in refcounts where necessary. But most of the time you don't need them. So a common pattern is that the components all use refcounts to talk to each other, but use borrows internally.

Insulating layer?

Posted Oct 12, 2024 7:00 UTC (Sat) by smurf (subscriber, #17840) [Link] (67 responses)

> the extent to which you are empowered to point at language constructs and say "don't use that if you want to be safe."

That's the point right here. Rust has exactly one of these constructs. It's called "unsafe". You can find code that uses it with grep.

C++ has myriads of conditions that are documented as Undefined Behavior. Quite a few of them are invisible without deep code analysis that borders on solving the halting problem. They're not going to go away. There's too much code out there that depends on them and too much social weight behind keeping the status quo as it is; a significant subset of the C++ community still thinks that avoiding UB is the programmer's job, not the compiler's.

Insulating layer?

Posted Oct 12, 2024 20:55 UTC (Sat) by MaZe (subscriber, #53908) [Link] (66 responses)

Honestly with C/C++ a large part of the problem is the compilers themselves (and/or their developer's mindsets) and their willingness to bend over backwards to not define UB and intentionally not make it do the obvious(ly correct/natural) thing.

For example you'd naively expect "foo x = {};" to simply be nicer syntax for "foo x; memset(&x, 0, sizeof(x));" (at least with a lack of non-zero default initialized fields). But it's not... There's some ridiculous subtleties with structs/unions and padding and some compilers can quite literally go out of their way to set some bits to 1. It's obvious/natural that 'foo x = {}' should mean zero initialize the whole damn thing including any and all padding, and then on top of that apply any non-standard constructors/initializations.

Basically compilers chase benchmarks instead of trying to make things a little bit more predictable/sane.

Insulating layer?

Posted Oct 13, 2024 10:23 UTC (Sun) by rschroev (subscriber, #4164) [Link] (19 responses)

> It's obvious/natural that 'foo x = {}' should mean zero initialize the whole damn thing including any and all padding

To me that is neither obvious nor natural at all. To me it feels much more natural that constructs like that serve to initialize the class/struct members; what happens to the padding is not of my concern. I don't think I've ever cared about the bits in the padding, and I can't immediately think of a reason why I would, but if I would care I would make it explicit and use something which operates not on the members but on the raw memory, i.e. memset or memcpy.

Insulating layer?

Posted Oct 13, 2024 10:52 UTC (Sun) by johill (subscriber, #25196) [Link] (18 responses)

It also seems false that the compilers "go out of their way to set bits to 1", if anything it would seem those bits are left untouched (stack memory)?

However, this has bitten me in conjunction with -ftrivial-auto-var-init=pattern, where for something like

union foo {
  char x;
  int y;
};

...

union foo f = { 0 };

clang created code to initialize x=0 and the other three bytes to the pattern, but ={} actually does initialize it all. The actual case was harder to understand because there was a struct involved, possibly with sub-structs, but ultimately the first member was the union.

I feel like the OP is perhaps misrepresenting this quirk, but of course they might have something else in mind.

Insulating layer?

Posted Oct 13, 2024 13:10 UTC (Sun) by MaZe (subscriber, #53908) [Link] (17 responses)

Honestly, I don't remember the precise details. But what you're writing rings multiple bells.

Furthermore ={ 0 } and ={} aren't quite the same thing, but ={} isn't supported by older compilers...
= { 0 } used to mean zero init the whole thing though...

As for why it matters? Take a look at the bpf kernel system call interface.
The argument to the system call is a union of structs for the different system call subcases.
The kernel requires everything past a certain point to be zero-filled or it assumes the non-zero values have meaning (ie. future extensions) and it can't understand them and thus returns an error.

Another example is when a struct is used as a key to a (bpf) map.
Obviously the kernel doesn't know what portion of the key is relevant, so it hashes everything - padding included, so if padding isn't 0, stuff just doesn't work (lookups fail).
Yes, the obvious answer is to make *all* padding explicit.
Of course, this is also a good idea because userspace might be 32-bit and the BPF .o code is 64-bit, so you also don't have agreement on things like sizeof(long) or its alignment. You want (or rather need) asserts on all struct sizes.
Yes, once these gotchas get you, you know better, and there are workarounds...
But it's bloody annoying that you have to think about stuff like this.

There's ample other examples of C(++)-compiler-wants-to-get-you behaviours.
I just picked one single one...

[I guess some sort of 'sane C' dialect that actually says a byte is unsigned 8 bits, arithmetic is twos complement, bitfields work like this, shifts works like this, alignment/padding works like this, here's a native/little/big-endian 16-bit integer, basically get rid of a lot of the undefined behaviour, add some sort of 'defer' keyword, etc... would help a lot]

Insulating layer?

Posted Oct 13, 2024 18:00 UTC (Sun) by marcH (subscriber, #57642) [Link] (16 responses)

> But it's bloody annoying that you have to think about stuff like this.

It's not just "bloody annoying": it's _buggy_ because there are and always will be many instances where the developer does not think about it.

Every single discussion about C++ safety always ends up in the very same "you're just holding it wrong" argument in one form or the other. Well, guess what: yes, there is _always_ someone somewhere "holding it wrong". That's because developers are humans. Even if some humans are perfect, many others are not. When humans don't want to make mistakes, they ask _computers_ to do the tedious bits and catch errors. You know, like... a Rust compiler.

And please don't get me started on Coverity and other external checkers: I've seen first hand these to be the pinnacle of "you're holding it wrong". Notably (but not just) because of the many false positives, which invariably lead some developers with an obvious "conflict of interest" to mark real issues in their own code as false positives while more experienced people have simply no time to look and must delegate blindly.

Advocates of C++ safety live in some sort of fantasy workplace / ivory tower where none of this happens and where they muse about technical and theoretical problems while being completely disconnected from the trenches where most coding happens. This is a constant failure to realize that usability comes first, technical details second. Same thing with security: the best design on paper is worth nothing at all if it's too complicated to use.

The only, last hope for C++ safety is for a large enough number of C++ experts to spend less time on benchmarks and agree on ONE[*] safe subset and implement the corresponding static analyzer and distribute it with LLVM by default. That's a very long shot and even then some users will still find ways to hold that static analysis "wrong" in order to make their deadline. A generous dose of "unsafe" keywords on legacy code will of course be the easiest way (cause some of these keywords will be actually required) but they will find many other ways.

I think it got a bit better now but until recently we were still hearing horrors stories based on _SQL injection_! If our industry can't get past a problem that dumb and so easy to spot, catch and kill, then who's going to bet on "safe C++"?

[*] ONE subset? No, let's have C++ "profileS", plural! More choice!

Insulating layer?

Posted Oct 13, 2024 19:48 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (15 responses)

> [*] ONE subset? No, let's have C++ "profileS", plural! More choice!

I sit in on SG23 (the "Safety and Security" study group for ISO C++) and participate in discussions, but have not contributed any papers (mostly I try to help provide accuracy improvements when statements about Rust come up as I have done a fair amount of Rust programming myself, something that is thankfully seeming to increase among other members as well). It is my understanding that multiple profiles is because:

- *some* progress has to be able to be made in a reasonable amount of time;
- waiting until all of C++ is covered by a single "safe" marker is likely to take until something like C++32 (nevermind all of the things added between now and then); and
- even if a single "safe" marker were possible, deployment is nigh impossible if there's not some way to attack problems incrementally rather than "here are 3000 diagnostics for your TU, good luck".

Though Sean Baxter (of Circle) et al. are working on the second item there, it is *far* more radical than things like "make sure all array/vector indexing goes through bounds-checked codepaths" that one has a hope of applying incrementally to existing codebases.

I believe the plan is to lay down guidelines for profiles so that existing code can be checked against them as well as other SGs coming to SG23 to ask "what can I do to help conform to profile X?" with their work before it is merged.

Insulating layer?

Posted Oct 14, 2024 10:19 UTC (Mon) by farnz (subscriber, #17727) [Link] (14 responses)

From the outside, though, what seems to be happening is not "there is a goal of a 'one true safe profile', such that all code can be split into an 'unsafe' part and a 'safe' part, but to make progress, we're coming up with ways for you to say 'I comply only with this subset of the safety profile'", but rather "let 1,000 profiles conflict with each other, such that you cannot combine code written under different profiles safely". This has the worrying corollary that you expect 'safe' C++ to be impossible to safely compose, since there's no guarantee that two libraries will choose the same safety profile.

If it was clear that the intent is that there will eventually be one true profile that everyone uses, and that the reason for a lot of profiles existing is that we want to make progress towards the one true profile in stages, then there would be less concern.

Insulating layer?

Posted Oct 14, 2024 14:30 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (13 responses)

I don't know who is speculating about "1000 profiles" nevermind them conflicting. AFAIK, there's *maybe* 5 under consideration. Bjarne's papers certainly have a limited number of them at least; anyone making such noise hasn't shown up to ISO.

Number of profiles is kinda irrelevant here

Posted Oct 14, 2024 16:34 UTC (Mon) by farnz (subscriber, #17727) [Link] (12 responses)

For all practical purposes, there's no difference between 2 conflicting profiles and 1,000 profiles. The key bit is that you're perceived (by people showing up to ISO) as adding more profiles to "resolve" conflicts between the ultimate goal, rather than working towards a single safe C++ profile, with multiple supersets of "allowed" C++ to let you get from today's "C++ is basically unsafe" to the future's "C++ is safe" in stages.

Ultimately, what's scary is that it looks like the idea is that I won't be allowed to write code in safe C++, and mix it with safe C++ from another team, because we might choose different profiles; I can only see two paths to that:

Code that is safe according to profile A is definitionally safe according to profile B as well, for all values of A and B. This means that it doesn't matter which profile I use, I get the same outcome - it's just that different profiles will complain about different things.
There is one, and only one, safe C++ profile. The remaining profiles are all stepping stones on the route from arbitrary C++ to safe C++, and it's understood that they are just that - stepping stones such that your code isn't safe C++ yet, but it's usefully safer under certain conditions.

Otherwise, you end up with the language fragmenting; if you have 3 profiles that conflict, and I write in profile 1, while you write in profile 2, the combination of our code is no longer safe C++. The only way to avoid this is to mandate that we all use the same profile; but then you've spent all that effort writing your profiles, only to see it wasted because all but 1 profile is not used.

Number of profiles is kinda irrelevant here

Posted Oct 14, 2024 17:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (11 responses)

Profiles are a module-local decision that doesn't affect the decision to use other profiles in any other modules. I believe there is the proposed ability to apply a profile to an import to do some kind of "make sure I don't ship my unsafe bits for this to another module" verification, but I don't believe that there's anything saying one cannot have code under different profiles in the same program (unless C++ is going to also kick out not-under-a-profile C code which seems…unlikely).

Where do you forsee any profiles conflicting in such an incompatible way? Sure, it could happen, but there are specific broad profiles proposed and I'm not aware of anything inherently being in conflict. I'd be surprised if anything like that were known during the design and discussion and it not be addressed.

Number of profiles is kinda irrelevant here

Posted Oct 14, 2024 18:19 UTC (Mon) by smurf (subscriber, #17840) [Link]

> Profiles are a module-local decision that doesn't affect the decision to use other profiles in any other modules

What happens if I code to profile A and need to call some library function that changes the state of some variable to, let's say from shared to unshared (assuming that this is of concern for Profile A; could be anything else), but which isn't annotated with profile A's directives — because it's from library B which is coded according to profile C (or no profile at all) instead?

Answer: you spend half your productive time adding profile A's extensions to B's headers, and the other half arguing with your manager why you're even using B (and/or A) when you have to put so much nonproductive effort into it.

The result will be that you now have a fork of B. Alternately you spend even more time *really* fixing its profile-A-violations instead of just papering them over (assuming that you have its source code, which is not exactly guaranteed), sending its authors a pull request, and even more time convincing them that coding to Profile A is a good idea in the first place — which will be an uphill battle if the internals of B aren't easily convertible.

Contrast this kind of what-me-worry attitude with Rust's, where spending considerable effort to avoid "unsafe" (if at all possible) is something you do (and your manager expects you'll do, as otherwise you'd be coding in C++) because you fully expect that *every* nontrivial or not-carefully-reasoned-out use of it *will* bite you sooner or later.

Number of profiles is kinda irrelevant here

Posted Oct 14, 2024 18:32 UTC (Mon) by farnz (subscriber, #17727) [Link] (9 responses)

The versions I've heard about from SG23 members who pay attention to these things are cases where the different profiles assume different properties of "safe C++", such that if I, in module C, import modules A and B that each uses a different profile of "safe C++", the interactions between modules A and B through their safe interfaces, as intermediated by C, result in the safety guarantees made by their respective safety profiles being broken.

To put it in slightly more formal terms, each profile needs to be a consistent axiomatic system, such that anything that cannot be automatically proven safe using the profile's axioms is unsafe, and the human is on the hook for ensuring that the resulting code is consistent with the profile's axioms. The problem that multiple profiles introduce is that all pairs of profiles need to be relatively consistent with each other, or any complete program where two modules use different profiles has to be deemed unsafe.

I think we agree that "two modules using different profiles means your entire program is unsafe" is a bad outcome. But I'm arguing that fixing the problem of "we can't agree on axioms and theorems in our safe subset" by having multiple sets of axioms and theorems that must be relatively consistent with each other is simply expanding the amount of work you have to do, for no net gain to C++.

Number of profiles is kinda irrelevant here

Posted Oct 14, 2024 20:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (8 responses)

> the different profiles assume different properties of "safe C++"

Yes…they wouldn't be different if not. But I guess I'm having trouble seeing where these conflicts live given that no specific rules have been laid down to even compare yet. The boundaries of where guarantees happen is certainly something that needs to be considered. The situation is going to be messier than Rust, but it's a lot harder to get moving with billions of lines of code you can't rewrite.

I think it's more along the lines of "module A says it has bounds safety and module B says it has thread safety". Module C using them together doesn't mean B can't mess up bounds or A gets itself in a thread-related problem by the way they get used together, but the thread bugs shouldn't be in B and the bounds bugs shouldn't be in A.

Number of profiles is kinda irrelevant here

Posted Oct 14, 2024 21:04 UTC (Mon) by farnz (subscriber, #17727) [Link] (7 responses)

Unless the profiles are subsets of a "final" target, it's very hard to avoid accidental conflicts based on assumptions; for example, module A says it has bounds safety, but the assumptions in the "bounds safety" profile happen to include "no mutation of data, including via calls into module A, from any module that does not itself have bounds safety", and module B does not have bounds safety, thus causing module A to not have bounds safety because module B calls a callback in module C that calls code in module A in a way that the bounds safety profile didn't expect to ever happen unless you have both the bounds safety and thread safety profiles on module A. As a result, module A claims to have bounds safety, but module A has messed up bounds because it was "bounds safety on the assumption of no threading from modules without bounds safety".

Now, if it looked like SG23 were doing this because they knew it would make things a lot harder, and probably put off safe C++ until C++40 or later, I'd not be so concerned; but the mathematical nature here (of axiomatic systems) means that by splitting safety into "profiles", you've got all the work you'd have to do for a single "safe C++", plus all the work involved in ensuring that all the profiles are consistent with each other in any combination - and as the number of combinations of profiles grows, that work grows, too. If you have a "bounds safety" profile and a "thread safety" profile, then you need to ensure that the combination of bounds safety and thread safety is consistent. But you also need to ensure that bounds safety plus not thread safety is consistent, and that thread safety plus not bounds safety is consistent, and so on. Add in a third profile, and now you have to ensure consistency for all three profiles at once, all three cases of 1 profile, and all 3 pairs of profiles, and it just gets worse as the profile count goes up.

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 13:43 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (6 responses)

> but the assumptions in the "bounds safety" profile happen to include "no mutation of data, including via calls into module A, from any module that does not itself have bounds safety"

I don't think any profile can guarantee that it can't be subverted by such situations (e.g., even Rust guarantees can be broken by FFI behaviors). The way I forsee it working is that profiles get enough traction to help avoid some kind of "no C++ anymore" regulations by instead allowing "C++ guarded by profiles X, Y, Z" with ratcheting requirements as time goes by. If you need bounds safety, you need bounds safety and the bug needs to be addressed in the right location.

Thanks for the discussion, by the way. I'll keep this in mind as I participate in SG23 to help keep an eye on it.

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 14:50 UTC (Wed) by Wol (subscriber, #4433) [Link] (5 responses)

Wouldn't one of the easiest quick fixes be simply to say that all stuff currently defined as UB must have flags that convert it to Implementation Defined, and like the optimisation levels you can have some standard "bulk" definitions.

So it's opt-in (so it won't break existing programs), but would get rid of a huge swathe of programmer logic errors. And it might even help different profiles work together ...

Cheers,
Wol

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 16:09 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (3 responses)

There is…a lot of UB. If that were done, implementations would now have to document what that behavior is (and I don't think they can say "UB" anymore). What does one even say happens in the case of data races, division by zero, accessing an invalid iterator, access outside the bounds, unaligned atomic operations, etc.?

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 17:48 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

Data races? I guess the obvious answer is "what gets stored last gets kept"? Or are you going to tell me that it's not that simple?

Division by zero I think is simple - "if the hardware supports it, set the value to IEEE Infinity. Otherwise if the compiler detects it it's compiler error else it's a runtime error. Allow the user to override the hardware and crash".

I'd be quite happy with an answer of "The compiler can't detect it, so not our problem".

But I'm thinking of the easy wins like "signed integer overflow is UB". At present I gather most compilers optimise your naive programmers' test for it away on the grounds of "if it can't happen then the following code is dead code". Most compilers have a flag that defines it as 2s complement, I believe, though that's not the default.

So I'm guessing that I should have worded it better as "All UB that the compiler can detect and act upon needs to have that action documented, and options provided to configure sane behaviour (for the skilled-in-the-arts end-developer's definition of sane). And standard bundles of such options could/should be provided".

As they say, the whole is greater than the sum of its parts, and you will get weird interactions between optimisations. But if the "safe profiles" emphasise the end-developer's ability to know and control how the compiler is interpreting UB, we (a) might get more predictable behaviour, (b) we might get more sensible conversations about what is and is not possible, and (c) we might get fewer unpleasant surprises from those weird interactions, because end-developers will disable optimisations they consider not worth the candle.

Effectively what the linux kernel is already doing to some extent. But we want to get rid of the masses of options that you need a Ph.D. to understand, and just have a few "here this gives you a sane bundle of options". It's rather telling that most of the options linux specifies are --no-f-this, --no-f-that ... Actually, that "sane bundle of options" sounds like what -O1, 2, 3 should be doing, but I strongly get the impression it isn't.

Cheers,
Wol

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 18:18 UTC (Wed) by smurf (subscriber, #17840) [Link]

> I'm thinking of the easy wins like "signed integer overflow is UB"

This is an "easy win" in the sense that it's a solved problem: both gcc and clang have a '-fwrapv' option.

Most UBs are *not* "easy wins". The past discussion here should have told you that already.

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 18:43 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

> Data races? I guess the obvious answer is "what gets stored last gets kept"? Or are you going to tell me that it's not that simple?

And if thread 1 wins on the first byte and thread 2 on the second byte, which is "last"? Reading it gives a value that was never written. I believe the JVM handles this by saying that *a* value is returned, but it doesn't have to be a value that was ever written.

> Division by zero I think is simple - "if the hardware supports it, set the value to IEEE Infinity. Otherwise if the compiler detects it it's compiler error else it's a runtime error. Allow the user to override the hardware and crash".

Sure…assuming IEEE support and things like `-fno-fast-math`. What do you propose for integer division here?

> I'd be quite happy with an answer of "The compiler can't detect it, so not our problem".

So…can the compiler optimize based on any assumptions around this behavior? I mean, if the behavior is implementation-defined to be "I dunno", what kinds of as-if code transformations are allowed?

> Actually, that "sane bundle of options" sounds like what -O1, 2, 3 should be doing, but I strongly get the impression it isn't.

In general, the "optimization level" flags are a grabbag of heuristics and wildly varying behavior across implementations. For example, Intel's toolchains turn on (their equivalent of) `-ffast-math` on `-O1` and above.

Number of profiles is kinda irrelevant here

Posted Oct 16, 2024 16:58 UTC (Wed) by marcH (subscriber, #57642) [Link]

I think this explains a bit why this is not possible

https://blog.llvm.org/2011/05/what-every-c-programmer-sho...

Insulating layer?

Posted Oct 13, 2024 11:08 UTC (Sun) by mb (subscriber, #50428) [Link] (38 responses)

>For example you'd naively expect "foo x = {};" to simply be nicer syntax for "foo x; memset(&x, 0, sizeof(x));"

I think this is also not the case in Rust, where =Default::default() is a rough equivalent of C's ={}.

It only doesn't ever let you access the padding without unsafe. But it is still uninitialized and UB on read, unless you explicitly memset the whole thing with zeros before.

Insulating layer?

Posted Oct 13, 2024 19:50 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (37 responses)

I think touching (reading or writing) padding is always UB (or, at best, indeterminate) because a containing `enum` may use the padding to store the discriminant.

Insulating layer?

Posted Oct 13, 2024 22:33 UTC (Sun) by khim (subscriber, #9252) [Link] (36 responses)

It's undeterminate, but not UB. Normally reading the uninitialized value in Rust is UB, but reading padding is specifically excluded: Uninitialized memory is also implicitly invalid for any type that has a restricted set of valid values. In other words, the only cases in which reading uninitialized memory is permitted are inside unions and in “padding” (the gaps between the fields/elements of a type).

It's just too easy to write code that reads padding, for one reason or another, thus keeping it UB was considered to be too dangerous.

But while reading padding is not an UB, the result is still some random value, and not zero.

Insulating layer?

Posted Oct 14, 2024 7:54 UTC (Mon) by Wol (subscriber, #4433) [Link] (35 responses)

> It's just too easy to write code that reads padding, for one reason or another, thus keeping it UB was considered to be too dangerous.

Imho, that's the wrong logic entirely. There is no reason it should be UB, therefore it shouldn't be.

Accessing the contents pointed to by some random address should be UB - you don't know what's there. It might be i/o, it might be some other program's data, it might not even exist (if the MMU hasn't mapped it). You can't reason about it, therefore it's UB.

But padding, uninitialised variables, etc etc are perfectly valid to dereference. You can reason about it, you're going to get random garbage back. So that shouldn't be undefined. But you can't reason about the consequences - you might want to assign it to an enum, do whatever with it that has preconditions you have no clue that this random garbage complies with. Therefore such access should be unsafe.

Principle of least surprise - can you reason about it? If so you should be able to do it.

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 13:08 UTC (Mon) by khim (subscriber, #9252) [Link] (34 responses)

> Imho, that's the wrong logic entirely. There is no reason it should be UB, therefore it shouldn't be.

Thanks for showing us, yet again, why C/C++ couldn't be salavaged. Note how you have entirely ignoring everything, except your own opinion (based on how hardware works).

Just because there are no reason for it to be UB from your POV doesn't mean that there are not reason for it to be UB from someone's else POV. And, indeed, that infamous be || !be is very much UB in both Rust and C.

Yes, it may not make much sense for the “we code for the hardware” guys, it may be somewhat problematic, but, in Rust, it's not easy to hit that corner case (just look on how many contortions I had to do to finally be able to shoot myself in the foot!) and it helps compiler developers thus it was decided that access to “normal” uninitialized variable is UB. Even if some folks think it shouldn't be.

> But padding, uninitialised variables, etc etc are perfectly valid to dereference.

Nope. That's not how both C and Rust work. And in Rust reading uninitialised variables is UB while reading padding is not. Because that's where the best compromise for all parties involved was found.

> Therefore such access should be unsafe.

Just how much usafe? What about that pesky be || !be? Should it guaranteed to return true or is it permitted to return false or crash? Note that both C and Rust allow all three possibilities (although rustc tries to help the programmer and make it crash when it's easily detectable, but such behavior is very much not guaranteed).

> Principle of least surprise - can you reason about it? If so you should be able to do it.

Except that was tried for many decades and simply doesn't work. Answer to the can you reason about it? is very much depends on the person that you are asking. But to write something in any language at least two persons should give the same answer to that question: the guy who writes the compiler (or interpreter) and the guy who uses said compiler.

Simple application of principle of least surprise only works reliably when the sole user of the language is also its developer – and in that case it's very much not needed.

Insulating layer?

Posted Oct 14, 2024 15:02 UTC (Mon) by Wol (subscriber, #4433) [Link] (32 responses)

> > But padding, uninitialised variables, etc etc are perfectly valid to dereference.

> Nope. That's not how both C and Rust work. And in Rust reading uninitialised variables is UB while reading padding is not. Because that's where the best compromise for all parties involved was found.

Are you saying that the memory location is not allocated until the variable is written? Because if the memory location of the variable is only created by writing to it, then fair enough. That however seems inefficient and stupid because you're using indirection, when you expect the compiler to allocate a location.

But (and yes maybe this is "code to the hardware" - but I would have thought it was "code to the compiler") I'm not aware of any compilers that wait until a variable is written before allocating the space. When I call a function, isn't it the norm for the COMPILER to allocate a stack frame for all the local variables? (yes I know you can delay declaration, which would delay allocation of stack space.) Which means that the COMPILER allocates a memory location before the variable is accessed for the first time. Which means I may be getting complete garbage if I do a "read before write", but fair enough.

Okay, I don't know how modern gcc/clang etc work, but I'm merely describing how EVERY compiler (not many) I've ever had to dig into deep understanding of, work.

So no, this is NOT "code to the hardware". It's "code to the mental model of the compiler", and if you insist it's code to the hardware, you need to EXPLAIN WHY.

Or are you completely misunderstanding me and thinking I mean "dereference a pointer to memory" - which is not what I said and not what I meant!!! OF COURSE that's UB if you are unfortunate/stupid/incompetent enough to do that - taking the contents of the address of the pointer is perfectly okay - the compiler SHOULD have made sure that the address of the pointer points to valid memory. USING that contents to access memory is UB because using garbage as pointer to memory is obviously a stupid idea.

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 15:21 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (17 responses)

IIUC, the way this is thought about is that there are additional values for a memory location beyond the 256 encoded by the bits directly. LLVM's model has `undef` and `poison` at least. The only way to clear them is to write to them. So the memory is allocated, but may be (logically) represented as an language-unrepresentable value. Note that some hardware does have things like this: CHERI actually has 129 bits per pointer, the last not being addressable by the pointer value but is instead managed with dedicated instructions (probably?) to indicate "is a valid pointer". So while in C (and Rust) one could write the equivalent of `T* ptr = (T*)some_u128;`, `ptr` might not actually be usable as a pointer (though when such casts obey the *language* rules, the insertion of instructions to set that flag bit to the right state should be inserted).

Insulating layer?

Posted Oct 14, 2024 15:40 UTC (Mon) by Wol (subscriber, #4433) [Link] (16 responses)

But why does that make a difference to your ability to reason about it? It's basically the same problem I have in databases, where Pick has the empty string and SQL has NULL, and you're forever cursing the designer's decision to use those to mean about four different things.

Going back to the "to be or not to be", if uninitialised variables are defined as containing the value "undef", about which you cannot reason, then "be || !be" would refuse to compile. But it wouldn't be UB, it would be an illegal operation on an undefined variable.

If, however, the language allows you to operate on undefined variables inside an unsafe block, then to_be_or_not_to_be() would be an unsafe function, only callable from other unsafe functions, unless it actively asserted that it itself would not return a value of "undef".

(Like SQL allows logical operations on NULL, where any NULL in the expression means the result is NULL.)

And if you have a "convert random garbage to boolean" function, that can even handle undef and poison, then to_be_or_not_to_be() would just be like any normal function - guaranteed to return a boolean, just not necessarily a predictable one.

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 17:28 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (15 responses)

> But why does that make a difference to your ability to reason about it?

There are codepaths which make it impossible to know if something is actually initialized. Diagnostics to that effect are madness. Rust can avoid a lot of these issues because tuples are a language-level thing and functions wanting to return multiple values don't have to fight over which one "wins" the output while the rest live as output parameters. C++ at least has tuples, but they are…cumbersome to say the least.

bool be;
int status = do_init(&be);
// Is `be` initialized or not? `init` is a function call whose implementation is not visible at compile or link time.

Given that reading if it is actually uninitialized it is UB, the optimizer is allowed to assume that it is initialized. It may choose to emit a diagnostic that it is doing so, but it doesn't have to (and that may end up in all kinds of unwanted noise). Since the valid representations are "0" or "1", `be || !be` can be as-if implemented as `be == 1 || be == 0` which leads to UB effects of "impossible branches" being taken when it ends up being a bit-value of 2.

Insulating layer?

Posted Oct 14, 2024 18:01 UTC (Mon) by smurf (subscriber, #17840) [Link] (2 responses)

> There are codepaths which make it impossible to know if something is actually initialized

… in C++. There is no such thing in Rust — unless you're using "unsafe", that is. If you do, it's your responsibility to hide 100% of that unsafe-ness from your caller, at as low a level as possible, where reasoning about the state of a variable is still easy enough to do something about it.

Insulating layer?

Posted Oct 14, 2024 19:08 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

Exactly my point. to_be_or_not_to_be() knows it's undefined. So either it returns an unsafe boolean, or it has to fix the problem itself. If it knows it's returning undef in a "pure" boolean, it has to be a compiler error.

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 19:58 UTC (Mon) by smurf (subscriber, #17840) [Link]

to_be_or_not_to_be() knows it's undefined because its undefined-ness is expressed on exactly one line of C[++].

That argument no longer holds when it's spread over multiple functions, or even compilation units.

Insulating layer?

Posted Oct 14, 2024 19:17 UTC (Mon) by Wol (subscriber, #4433) [Link] (11 responses)

> int status = do_init(&be);

What does do_init() do? If it merely returns whether or not "be" is initialised, then how can the optimiser assume that it is initialised? That's a massive logic bug in the compiler!

If, on the other hand, it forces "be" to be a valid value, then of course the compiler can assume it's initialised. But that would be obvious from the type system, no?

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 19:42 UTC (Mon) by khim (subscriber, #9252) [Link] (10 responses)

> If it merely returns whether or not "be" is initialised, then how can the optimiser assume that it is initialised?

That's easy: because correct program is not supposed to read uninitialized variable it can conclude that on all branches where it's read it's successfully initilialized. Then it's responsibility of developer to fix code to ensure that

> But that would be obvious from the type system, no?

Nope. When you call read(2) nothing in the typesystem differs for return values that are larger and smaller than zero.

Insulating layer?

Posted Oct 14, 2024 20:19 UTC (Mon) by Wol (subscriber, #4433) [Link] (9 responses)

> > If it merely returns whether or not "be" is initialised, then how can the optimiser assume that it is initialised?

> That's easy: because correct program is not supposed to read uninitialized variable it can conclude that on all branches where it's read it's successfully initilialized. Then it's responsibility of developer to fix code to ensure that

Circular reasoning !!! Actually, completely screwy reasoning. If do_init() does not alter the value of "be", then the compiler cannot assume that the value of "be" has changed!

Let's rephrase that - "Because a SAFE *function* is not supposed to read an uninitialised variable".

to_be_or_not_to_be() knows that "be" can be 'undef'. Therefore it either (a) can apply boolean logic to 'undef' and return a true boolean, or (b) it has to return an "unsafe boolean", or (c) it's a compiler error. Whichever route is chosen is irrelevant, the fact is it's just *logic*, not hardware, and it's enforceable by the compiler. In fact, as I understand Rust, normal compiler behaviour is to give the programmer the ability to choose whichever route he wants!

All that matters is that the body of to_be_or_not_to_be() is marked as unsafe code, and the return value is either a safe "true boolean", or an unsafe boolean that could be 'undef'. At which point the calling program can take the appropriate action because IT KNOWS.

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 20:31 UTC (Mon) by khim (subscriber, #9252) [Link]

> In fact, as I understand Rust, normal compiler behaviour is to give the programmer the ability to choose whichever route he wants!

Nope. Normal compiler behavior is still the same as in C: language user have to ensure program doesn't have any UBs.

The big difference is that for, normal, safe, subset of Rust it's ensured by the compiler. But for unsafe Rust it's still resposibility of the developer to ensure that program doesn't violate any rules WRT UB.

> to_be_or_not_to_be() knows that "be" can be 'undef'.

Nope. It couldn't be undef. In Rust MaybeUninit<bool> can be undef, but regular bool have to be either true or false. Going from MaybeUninit<bool> to bool when it's undef (and not true or false) is an instant UB.

> At which point the calling program can take the appropriate action because IT KNOWS.

How does it know? You couldn't look on MaybeUninit<bool:gt; and ask it whether it's initialized or not. It's still very much a resposibility of Rust user to ensure that program doesn't try to convert MaybeUninit<bool> which contains undef into normal bool.

Insulating layer?

Posted Oct 14, 2024 20:41 UTC (Mon) by daroc (editor, #160859) [Link] (7 responses)

I think that using three exclamation marks in a row might be a sign that this back-and-forth is not going anywhere in particular. This is a worthy discussion topic, but I'm not sure the last few comments have added anything new.

Insulating layer?

Posted Oct 14, 2024 23:43 UTC (Mon) by atnot (subscriber, #124910) [Link] (6 responses)

> This is a worthy discussion topic, but I'm not sure the last few comments have added anything new.

I agree it may have been a worthy topic once upon a time. But when the same two people (khim and wol) have the same near-identical drawn out discussions for dozens of messages a week in every amenable thread to the point of drowning out most other discussion on the site (at least without the filter) and making zero progress on their positions over a span of at least 2 years, perhaps some more action is necessary.

Insulating layer?

Posted Oct 15, 2024 10:13 UTC (Tue) by Wol (subscriber, #4433) [Link] (5 responses)

Apologies. I try not to respond to khim any more.

Unfortunately, I suspect the language barrier doesn't help. I think sometimes we end up arguing FOR the same thing, but because we don't understand what the other one is saying we end up arguing PAST each other.

Cheers,
Wol

Insulating layer?

Posted Oct 15, 2024 11:15 UTC (Tue) by khim (subscriber, #9252) [Link] (4 responses)

> Unfortunately, I suspect the language barrier doesn't help.

It could be a language barrier but I have seen similar discussions going in circles endlessly with a pair of native speakers, too, thus I suspect problem is deeper.

My feeling is that it's ralated to difference between how mathematicians apply logic and laymans do it.

> I think sometimes we end up arguing FOR the same thing, but because we don't understand what the other one is saying we end up arguing PAST each other.

No, that's the issue. The big problem with compiler development (and language development) lies in the fact that compiler couldn't answer any interesting questions about semantics of your program.

And that's why we go in circles. Wol arguments usually come in the form of:

Compiler can do “the right thing” or “the wrong thing”
“The wrong thing” is, well… wrong, it's bad thus compilers have to do “the right thing”

And my answer comes in the form of:

Sure, we may imagine compilers that do “the right thing” or “the wrong thing”
Except usually “the wrong thing” is, possible to implement while “the right thing” is impossible to implement
We are stuck with “the wrong thing” this talking about “the right thing” is pointless
We may try to mitigate consequence of doing “the wrong thing” using some alternate approaches

And Wol ignores the most important #2 step from my answer and goes back to “compilers should do “the right thing”… even if I have no idea how can they do that”.

I have no idea why is it so hard for layman to accept that compilers are not omnipotent and compiler developers are not omnipotent either, that there are things that compilers just could never do… but that's the core issue: “we code for the hardware” guys somehow have a mental model of a compiler which is both simulatentously very simple (as discussion about how modern compilers introduce shadow variables for each assignment show) and infinitely powerful (as discussions about what compiler have to detect and report show).

I have no idea how compiler can be simultaneously primitive and yet infinitely powerful, that's definitely a barrier right there, but it's not a “native speaker” vs “non-native speaker” barrier.

And I continue the discussion to see if that barrier could be broken, somehow – because in that case C/C++ would have a chance! If that barrier could be broken then there's a way to reform the C/C++ “we code for the hardware” community!

But so far it looks hopeless. Safe languages would still arrive, they would just have to arrive the sad way, one funeral at time way. And that would be the end of C/C++, because “one funeral at time” approach favors transition to the new language.

Insulating layer?

Posted Oct 15, 2024 16:18 UTC (Tue) by atnot (subscriber, #124910) [Link]

I think this is a very tonedeaf way to respond to someone agreeing to finally stop participating in a discussion has been unproductive for years. Even if the powers that be continue to tolerate it, perhaps it's time to practice the (admittedly difficult) skill of letting someone be "wrong on the internet", for the sake of the rest of us.

Insulating layer?

Posted Oct 15, 2024 16:42 UTC (Tue) by intelfx (subscriber, #130118) [Link] (2 responses)

One has to have a non-trivial amout of gall to unapologetically continue to paint oneself as a "mathematician" and attack your opponent as a "layman"... immediately after you were asked to stop not just the attacks, but the entire argument.

Insulating layer?

Posted Oct 15, 2024 21:58 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

And to say I'm not a mathematician ... well I don't necessarily understand symbolic logic, and I don't have a COMPUTER degree, but I've got an Honours in Science (Maths, Physics and Chemistry). And I've got a Masters, too (in technology, though I'm much less impressed with the value of that degree than the plain Bachelors). And I've got a very good ituition for "this argument sounds screwy" ...

And when I try to argue LOGICally, if people argue straight past my logic and don't try show me what's wrong with it, I end up concluding they have to be attacking the messenger, because they are making no attempt to attack the message.

Khim is saying Science advances one funeral at a time, but it'll be his funeral not mine - if you look back over what I've written it's clear I've learnt stuff. Don't always remember it! But as I've said every now and then in the past, you need to try to understand what the other person is arguing. I get the impression khim finds that hard.

Cheers,
Wol

Insulating layer?

Posted Oct 15, 2024 22:13 UTC (Tue) by daroc (editor, #160859) [Link]

Alright, I think that's a fairly clear personal attack. Probably I should have said something after khim's message. Sorry, Wol, for not stepping in at that point. But in either case, this topic should end here.

Insulating layer?

Posted Oct 14, 2024 19:45 UTC (Mon) by dezgeg (subscriber, #92243) [Link] (3 responses)

See https://devblogs.microsoft.com/oldnewthing/20040119-00 for example of a real architecture where attempting to use an uninitialized register variable might cause an exception.

Insulating layer?

Posted Oct 14, 2024 20:32 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

Interesting.

But I'm not arguing that attempting to dereference garbage is okay. That post explicitly says the programmer chose to return garbage. No problem there. The problem comes when you attempt to USE the garbage as if it was valid. Rust would - I assume - have marked the return value as "could be garbage", and when the caller attempted to dereference it without checking, Rust would have barfed with "compile error - you can't unconditionally dereference possible garbage".

The point is, the programmer can reason about it because Rust would force them to track the fact that the return value could be garbage.

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 20:43 UTC (Mon) by khim (subscriber, #9252) [Link]

> Rust would - I assume - have marked the return value as "could be garbage"

Nope. Rust doesn't do that. Rust developer may use MaybeUninit<bool> to signal to the compiler that value may be uninitialized. And then Rust developer (and not compiler!) would decide when to go from MaybeUninit<bool> to bool (which would tell the compiler that at this point value is initialized).

If I lie to the compiler (like I did) at that point – that's an instant UB.

IOW: Rust does the exact same thing C/C++ does but mitigates the issue by making transition from “could be uninitialized” type to “I believe it's initialized now” type explicit.

> Rust would have barfed with "compile error - you can't unconditionally dereference possible garbage".

This couldn't be a compiler error, but sure enough, if you violate these rules and compiler can recognize it then there would be a warning. It's warning, not an error, because compiler may recognize this situation it but is not obliged to do that, it's something that it does on “best effort” basis.

Insulating layer?

Posted Oct 14, 2024 22:03 UTC (Mon) by dezgeg (subscriber, #92243) [Link]

The linked Itanium example doesn't dereference uninitialized pointers anywhere - it demonstrates that just attempting to store an uninitialized register value might fault. In essence, this could blow up:

int global;
void f() {
int uninitialized;
global = uninitialized;
}

Ie. direct example of an architecture where what you wrote ("But padding, uninitialised variables, etc etc are perfectly valid to dereference. You can reason about it, you're going to get random garbage back.") doesn't apply.

Insulating layer?

Posted Oct 14, 2024 20:12 UTC (Mon) by khim (subscriber, #9252) [Link] (9 responses)

> Are you saying that the memory location is not allocated until the variable is written?

Worse. For decades already most compilers create new location for every store to a variable. GCC started doing it around 20 years ago, LLVM did from the day one.

> That however seems inefficient and stupid because you're using indirection, when you expect the compiler to allocate a location.

Why would you use indirection? TreeSSA doesn't need that, you just create many copies of variable (one for each store to a variable).

> When I call a function, isn't it the norm for the COMPILER to allocate a stack frame for all the local variables? (yes I know you can delay declaration, which would delay allocation of stack space.)

No. Consider the following example:

struct II {
    int x;
    int y;
};

int foo(struct II);

struct LL {
    long x;
    long y;
};

int bar(struct LL ll) {
    struct II ii = {
        .x = ll.x,
        .y = ll.y
    };
    return foo(ii);
}

Here local variable ii ii never allocated at all.

> It's "code to the mental model of the compiler", and if you insist it's code to the hardware, you need to EXPLAIN WHY.

It's “code to the hardware” because “mental model of the compiler” couldn't explain how your program works without invoking hardware that would be executing your program. And it's only needed when you ignore rules of the language like they are described in the specification and then invent some new, entirely different specification of what you are program is doing and then assert that your POV is not just valid for today, but it would be the same 10, 20, 50 years down the road. That's… quite a presupposition.

> USING that contents to access memory is UB because using garbage as pointer to memory is obviously a stupid idea.

No, that's not about pointers, but about simple variables. Accessing them, if they are not initialized, is UB because normal program shouldn't do that and assuming that correct program doesn't access uninitialized variable is benefocial for the compiler. Heck, there are whole post dedicated to that issue on Ralf's blog.

Insulating layer?

Posted Oct 14, 2024 21:02 UTC (Mon) by Wol (subscriber, #4433) [Link] (8 responses)

> > Are you saying that the memory location is not allocated until the variable is written?

> Worse. For decades already most compilers create new location for every store to a variable. GCC started doing it around 20 years ago, LLVM did from the day one.

Ummmm ... my first reaction from your description was "Copy on write ??? What ... ???", but it's not even that!

The other thing to bear in mind is that this is a quirk of *optimising* compilers, and it completely breaks the mental model of "x = x = x", so it breaks the concept of "least surprise". And I don't buy that's "coding to the hardware". If I have a variable "x", I don't expect the compiler to create lots of "x"s behind my back!!! And while we might be dying off, there's probably plenty of people, like me! for whom this was Ph.D. research for our POST-decessors.

But. SERIOUSLY. What's wrong with either (a) pandering to the principle of least surprise and saying that dereferencing an uninitialised variable returns "rand()" (even if it's a different rand every time :-), or it returns 0x0000... The former makes intuitive sense, the latter would actually be useful, and if it's a compiler directive it's down to the programmer! which fits the Rust ethos.

If the hardware is my brain, after all, dereferencing an unitialised variable would be a runtime error, not an opportunity for optimisation and code elimination ...

Cheers,
Wol

Insulating layer?

Posted Oct 15, 2024 6:41 UTC (Tue) by khim (subscriber, #9252) [Link] (6 responses)

> And I don't buy that's "coding to the hardware".

How do you call it, then?

> If I have a variable "x", I don't expect the compiler to create lots of "x"s behind my back!!!

Just as you wouldn't expect CPU to create hundred accumulators in place of one that machine code uses? CPUs do that for about 30 years.

Just how you wouldn't expect to have one piece of memory to contain different values simultaneously? CPUs started doing that even earlier.

I'm not telling your that to shame you for not knowing, I'm explaing to you why assuming that compiler (or hardware) would work in a certain “undocumented yet obvious” way doesn't lead to something that you may trust.

There are many things that are implemented both in hardware and compiler via an as-if rule… and you don't need to even know about them if you are using language in accordance to the language specification. The same with hardware.

That's why I really like The Tower of Weakenings that Rust seems to embrace. With it normal developers have 90% of code that's “safe” and doesn't need to know anything about how hardware and compilers actually work. But in unsafe world there are also gradations. In Linux kernel there are certain tiny amount of code (related to RCU, e.g.) that touches everything simultaneously: the real hardware, compiler internals, and so on. But if said code is only 0.01% of the whole and the majority of developers work with the other 99.99% of code then precise rules used in that tiny part could be ignored by most developers.

> And while we might be dying off, there's probably plenty of people, like me!

Weren't we just talking about how major scientific breakthroughs arrive one funeral at time? Looks like that's how safety if low-level languages would arrive, too.

And that's why C/C++ wouldn't get it, in practice: language can be changed to retrofit safety into it, but who would use it in that fashion and why?

Insulating layer?

Posted Oct 15, 2024 10:09 UTC (Tue) by Wol (subscriber, #4433) [Link] (5 responses)

> > And I don't buy that's "coding to the hardware".

> How do you call it, then?

Brain-dead logic.

> And that's why C/C++ wouldn't get it, in practice: language can be changed to retrofit safety into it, but who would use it in that fashion and why?

So why am I left with the impression that it's PERFECTLY ACCEPTABLE for Rust to detect UB and not do anything about it? I take - was it smurf's - comment that "we can't promise to warn on UB because we can't guarantee to detect it", but if we know it's UB?

If a programmer accesses an uninitialised variable there's a whole bunch of reasons why they might have done it. The C(++) response is to go "I don't believe you meant that" and go and do a load of stuff the programmer didn't expect.

My understanding of the ethos of Rust is that if the compiler doesn't understand what you mean it's either unsafe, or an error. You seem to be advocating that Rust behave like C(++) and just ignore it. Sorry if I've got that wrong.

Imho (in this particular case) Rust should come back at the programmer (like any sensible human being) and ask "What do you mean?". Imho there's three simple scenarios - (1) you expect the data to come from somewhere the compiler doesn't know about, (2) you forgot to explicitly request all variables are zeroed on declaration / read-before-write, or (3) you're expecting random garbage. (If there's more, just add them to the list.)

So no you don't just dump stuff into UB, you should force the programmer to explain what they mean. And if they don't, it won't compile.

Cheers,
Wol

Insulating layer?

Posted Oct 15, 2024 10:33 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

So why am I left with the impression that it's PERFECTLY ACCEPTABLE for Rust to detect UB and not do anything about it? I take - was it smurf's - comment that "we can't promise to warn on UB because we can't guarantee to detect it", but if we know it's UB?

Rust simply doesn't have UB outside of unsafe. There's no detection of UB involved at all - all the behaviour of safe Rust is supposed to be fully defined (and if it's not, that's a bug). For example, in the case of the Rust equivalent of bool be; int status = use_be(&be);, the defined behaviour is for the program to fail to compile because be is possibly used before it is initialized.

Insulating layer?

Posted Oct 15, 2024 21:40 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

So accessing the contents of be before you've initialised it is not UB, it's a fatal error. Thanks.

I don't necessarily think it's the best definition, but it IS defined and it IS in-character for Rust. Which UB would not be.

Cheers,
Wol

Insulating layer?

Posted Oct 16, 2024 9:46 UTC (Wed) by laarmen (subscriber, #63948) [Link] (1 responses)

Rust makes it *hard* for you to read uninitialized variables, but not impossible:

let b: bool = unsafe { MaybeUninit::uninit().assume_init() }; // undefined behavior! ⚠️

I lifted this from the MaybeUninit doc, including the comment. That will compile, but *is* UB.

Now, I'm of the opinion that it is perfectly reasonable for Rust to declare this UB, as the alternative makes a lot of assumptions about the underlying implementation, all for a use case that seems dubious to me.

Insulating layer?

Posted Oct 18, 2024 14:27 UTC (Fri) by taladar (subscriber, #68407) [Link]

Unsafe blocks in Rust basically mean that you get to use unsafe operations but you are also responsible for upholding safety guarantees in your code inside the block. Methods like assume_init() are meant to be used after you have verified that the value is initialized, otherwise your code is unsound.

Insulating layer?

Posted Oct 15, 2024 10:42 UTC (Tue) by khim (subscriber, #9252) [Link]

> You seem to be advocating that Rust behave like C(++) and just ignore it. Sorry if I've got that wrong.

I'm not “advocating” anything, I'm just explaining how things work. Not how they “should work”. But how they, inevitably, have to work (and thus how they actually work). If out of ten choices that one like or dislike only one is actually implementable then you are getting that one whether you like it or not.

> My understanding of the ethos of Rust is that if the compiler doesn't understand what you mean it's either unsafe, or an error.

We are talking the full Rust, not just “safe” Rust here. UB is UB, whether it's in safe code or unsafe code. And yes, you can trigger UB in safe Rust – and it would lead to the exact same outcome as in unsafe Rust.

> If a programmer accesses an uninitialised variable there's a whole bunch of reasons why they might have done it.

If programmer accesses an uninitialized variable without the use of special construct that is allowed to touch undef, then it's a bug. Period, end of story. If program includes such access then it have to be fixed, there are no any other sensible choice.

The only difference of safe Rust and unsafe Rust is decision of whose responsibility is it to fix such bug. If it's “safe” Rust then it's bug in the compiler (currently there are around 100 such bugs) and compiler developers have to fix it, if it's in unsafe Rust, then developer have to fix it.

Compiler may include warning for [potential] bugs in unsafe Rust, but ultimately it's resposibility of developer to fix them.

> Imho (in this particular case) Rust should come back at the programmer (like any sensible human being)

Impossible. Compilers are mindless (they literally have no mind and couldn't have it) and not sensible (they don't have “a common sense” and attempts to add it inevitable lead to even worse outcome). That's something “we code for the hardware” people simply just refuse to accept for some unfathomable reason.

> (1) you expect the data to come from somewhere the compiler doesn't know about

In that case you have to use volatile read or volatile write.

> (2) you forgot to explicitly request all variables are zeroed on declaration / read-before-write

This is bug and it should be fixed. If you managed to do that in normal, “safe” Rust then it's bug in the compiler and it have to be fixed in compiler, if you did that in unsafe Rust, then it's bug in your code and you have to fix it.

> (3) you're expecting random garbage.

Currently that's also a bug, although there are discussions about adding such capability to the language (to permit tricks like the one used in the Using Uninitialized Memory for Fun and Profit. Currently Rust's only offer for such access is the use of asm!.

> And if they don't, it won't compile.

Not possible, sorry. If you wrote the magic unsafe keyword then it's your responsibility to deal with UB now.

Compiler may still detect and report suspicious findings, but it couldn't be sure that it detected everything correctly thus such thing couldn't be a compile-time error, only and compile-time warning.

Insulating layer?

Posted Oct 15, 2024 6:57 UTC (Tue) by smurf (subscriber, #17840) [Link]

> it completely breaks the mental model of "x = x = x",

No it doesn't. If the compiler knows at all times where your particular x lives at any given time, your mental model isn't violated. Your mental model of the apple you're going to have for tomorrow's breakfast doesn't change depending on which side of the table you put it on, does it?

Consider code like
a=fn1()
b=fn2(a)
c=fn3(a,b)
d=fn4(b,c)
return d

Now why should the compiler allocate space for four variables when c can easily be stored at a's location? it's not needed any more. d doesn't even need to be stored anywhere, simply clean the stack up and jump to fn4 directly, further bypassing any sort of human-understanding-centered model (and causing much fun when debugging optimized code).

Constructing a case where a ends up in multiple locations instead of none whatsoever is left as an exercise to the reader.

Insulating layer?

Posted Oct 14, 2024 15:22 UTC (Mon) by Wol (subscriber, #4433) [Link]

> Just because there are no reason for it to be UB from your POV doesn't mean that there are not reason for it to be UB from someone's else POV. And, indeed, that infamous be || !be is very much UB in both Rust and C.

If bool is defined as a value that can ONLY contain 1 or 0, and the location of bool contains something else (for example it's not just a single bit-field), then as I pointed out you are assigning invalid garbage to a field that cannot contain it. Either you have a coercion rule that guarantees that "be" will be understood as a boolean whatever it contains, or yes it should be a compile error or unsafe or whatever.

But imho looking at that simple example, the problem boils down to interpreting random garbage as a boolean. If the rules of the language let you do that, then I would expect it to return true - after all, the expression "(garbage) or not (garbage)" must evaluate to true if "garbage" can be interpreted as a boolean.

It's all down the language you're using, and whether that language lets you treat any random junk as a boolean. If it does, then those functions make mathematical sense and should not be UB. If the language DOESN'T let you, then it should fail to compile with an "illegal coercion" error.

And that has absolutely nothing to do with the hardware, and everything to do with mathematical logic. (Unless, of course, your DRAM behaves like a random number generator unless/until it's written to, such that (garbage) != (garbage) ...) And even there, I'd just say that you can guarantee the function will return a boolean, just not necessarily the boolean you expect ...

Cheers,
Wol

Insulating layer?

Posted Oct 13, 2024 11:50 UTC (Sun) by khim (subscriber, #9252) [Link] (6 responses)

> Honestly with C/C++ a large part of the problem is the compilers themselves (and/or their developer's mindsets) and their willingness to bend over backwards to not define UB and intentionally not make it do the obvious(ly correct/natural) thing.

True, but I would put 10% blame on compiler developers and 90% on “we code to the hardware” people. And you show why exteremely well.

> There's some ridiculous subtleties with structs/unions and padding and some compilers can quite literally go out of their way to set some bits to 1.

Which is not any different from what Rust is doing, isn't it?

> It's obvious/natural that 'foo x = {}' should mean zero initialize the whole damn thing including any and all padding, and then on top of that apply any non-standard constructors/initializations.

What is “obvious/natural” for one developers is not at all obvious/natural for other developer.

And when developers refuse to “play by the rules” and use “who told you one couldn't carry the basketball and run – I tried it I could do that just fine” approach then safety remains a pipe dream.

Sure, convoluted and strange rules of C and C++ with, literally, hundreds of UBs, don't help, but these rules could be changes. Except this wouldn't matter at all if people wouldn't accept them.

And that is why all attempts to make C or C++ safer would fail: you could change a language, but it's almost impossible to change developers' attitude.

The only known way is via Pank's principle: science progresses one funeral at a time… and it looks like language development happens in the exact same way. First we had Mel who refused to adopt assembler, then we had debate about structured programming, today “old school” developers play Mel's tricks with C/C++, loudly complain about compilers and refuse to adopt anything else… we couldn't predict whether Rust or something else would, eventually, replace C/C++ but we know it wouldn't happen peacefully, by people adopting the successor willingly and it wouldn't happen because someone mandate that C/C++ should be dropped… this leaves us with one funeral at a time… approach. That one tried, tested and works.

Insulating layer?

Posted Oct 13, 2024 12:29 UTC (Sun) by pizza (subscriber, #46) [Link] (5 responses)

> True, but I would put 10% blame on compiler developers and 90% on “we code to the hardware” people. And you show why exteremely well.

....How dare hardware not perfectly conform to the spherical cow abstractions of higher level languages [1]

> And when developers refuse to “play by the rules” and use “who told you one couldn't carry the basketball and run – I tried it I could do that just fine” approach then safety remains a pipe dream.

Oh, you mean "if it's not explicitly disallowed then it by definition it must be allowed" attitude of nearly _every_ field of human endeavour? ("metrics" become the only thing that matter, because that's what you're judged/rewarded by..)

[1] "Higher-level" notably includes C _and_ Rust.

Insulating layer?

Posted Oct 13, 2024 14:12 UTC (Sun) by smurf (subscriber, #17840) [Link] (4 responses)

> Oh, you mean "if it's not explicitly disallowed then it by definition it must be allowed" attitude of nearly _every_ field of human endeavour?

Well the workaround is easy[1], you disallow everything, except under proscribed circumstances.

Like Rust's "you can't write *any* data (or free a struct, or …) unless you can prove that there's no concurrent access".

1) for some value of "easy", anyway.

Insulating layer?

Posted Oct 13, 2024 19:43 UTC (Sun) by Wol (subscriber, #4433) [Link] (3 responses)

> Well the workaround is easy[1], you disallow everything, except under proscribed circumstances.

Or you disallow UB inasmuch as you can! And don't create new UB!

If the standard doesn't define it, you have to impose a sensible default (eg, on a 2s complement machine, you assume 2s complement), and allow a flag to change it eg --assume-no-overflow.

At which point you get a logical machine that you can reason about without repeatedly getting bitten in the bum. And the compiler writers can optimise up the wazoo for the benchmarks, but they can't introduce nasty shocks and make them the default.

Cheers,
Wol

Insulating layer?

Posted Oct 13, 2024 22:33 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

> Or you disallow UB inasmuch as you can! And don't create new UB!

How would that work without buy-in on the compiler users side? The Rust handling of UB (which actually works fine so far) hinges not on one, but two pillars:

Language developers try to reduce number of UBs as much as sensible
Language users try to avoid triggering the remaining UBs as much as possible

But C/C++ community is 100% convinced that doing something about UBs is responsibility of “the other side”. Just read “What Every C Programmer Should Know About Undefined Behavior” (parts one, two, three) and What every compiler writer should know about programmers

Developers demand that compiler developers should ignore the language specifications and accept program with “sensible code” as programs without UB and, as we saw, when asked what is “sensible code” immediately present something that Rust also declares as unspecified. That's very symptomatic: that example shows, again (as if articles above weren't enough), how both sides are entirely uninterested in working together toward common goal in a C/C++ world.

Defining padding as containing zeros is extremely non-trivial, because many optimizations rely on ability of the compiler to “separate” a struct (or union) into a set of fields and then “assemble” them back. And because the ability to have predictable padding is very rarely needed the decision, in a Rust world, was made to go with making it unspecified. Note that's is a compromise: usually reading uninitialized value is UB in Rust (to help compiler), but reading padding is not UB. From the reference: Uninitialized memory is also implicitly invalid for any type that has a restricted set of valid values. In other words, the only cases in which reading uninitialized memory is permitted are inside unions and in “padding” (the gaps between the fields/elements of a type).

This decision doesn't make any side 100% happy: compiler makers would like to make reading padding UB to simplify their job and compiler users would like to make it zero to simplify their job, but “unspecified but not UB” is good enough for both sides to, grudgingly, accept it.

Yet such compromise may not ever save the world where it's always the other side that have to do the job!

> At which point you get a logical machine that you can reason about without repeatedly getting bitten in the bum.

Nope. Logical machine that high-level language uses to define behavior of a program would always be different from “what the real hardware is does”. That's what separates high-level language from low-level language, after all. You can make it easier to understand or harder to understand, but someone who refuses to accept the fact that virtual machine used by high-level language definition even exists may always find a way to get bitten in the bum.

Insulating layer?

Posted Oct 14, 2024 13:24 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

> How would that work without buy-in on the compiler users side? The Rust handling of UB (which actually works fine so far) hinges not on one, but two pillars:

Well, maybe, if we didn't have stupid rules like "don't use signed integer arithmetic because overflow is undefined", the user space developers might buy in. I'm far more sympathetic to the developers when they claim there's too much UB, than I am to compiler developers trying to introduce even more UB.

In that particular case, I believe we have -fno-wrap-v or whatever the option is to tell the compiler what to do, but that should be the norm, not the exception, and it should be the default.

If the compiler devs change to saying "this is UB. This is what a user-space developer would expect, therefore we'll add a (default) option to do that, and an option that says 'take advantage of UB when compiling'", then I suspect the two sides would come together pretty quickly.

Compiler devs know all about UB because they take advantage of it. User devs don't realise it's there until they get bitten. The power is in the compiler guys hands to make the first move.

And if it really is *undefinable* data (like reading from a write-only i/o port), then user space deserves all the brickbats they'll inevitably get ... :-)

Cheers,
Wol

Insulating layer?

Posted Oct 14, 2024 15:19 UTC (Mon) by khim (subscriber, #9252) [Link]

> If the compiler devs change to saying "this is UB. This is what a user-space developer would expect, therefore we'll add a (default) option to do that, and an option that says 'take advantage of UB when compiling'", then I suspect the two sides would come together pretty quickly.

This was tried, too. And, of course, what the people who tried that have found out that was that every “we code for the hardware” developer have their own idea about what compiler should do (mostly because they code for different hardware).

> The power is in the compiler guys hands to make the first move.

Why? They have already done the first step and created a specification that lists what is or isn't UB. If developers have a concrete proposals – they could offer changes to it. And yes, some small perceentage of C/C++ community works with developers. But most “we code for the hardware” guys don't even bother to even look and read it… how can compiler developers would know that their changes to that list would be treated any better then what they already gave?

> And if it really is *undefinable* data (like reading from a write-only i/o port), then user space deserves all the brickbats they'll inevitably get ... :-)

They would probably find a way to complain even in that case…

> User devs don't realise it's there until they get bitten.

And that's precisely the problem: if language have UBs and users of said language “only realise it's there when they get bitten” then such language couldn't be made safe. Ever.

Either developers have to accept and handle UBs pro-actively or language should have no UBs at all. Latter case is limiting because certain low-level things couldn't be expressed in the language without UBs thus, for low-level stuff, safe and reliable language couldn't be made unless developers think about UBs before they are bitten.

And the only known way to introduce such a radical non-technical change is to create a new community. And that's the end of story: C/C++ couldn't be made safe not because there are some technical, unsolvable, issues, but because it's community refuses to accept the fact that there are technical, unsolvable, issues (that have to be worked around on social level).

Insulating layer?

Posted Oct 17, 2024 8:51 UTC (Thu) by peter-b (guest, #66996) [Link] (1 responses)

For example, you could just have lexically-scoped borrows (the borrow is not allowed to escape a given lexical scope) plus a syntax rule that prohibits moving a borrowed object within the borrow's lexical scope. This is not fully general, and fails to express many things that Rust can handle easily, but when you combine it with C++'s existing smart pointers, it is probably enough for many practical applications (that don't need ultra-low overhead).

Please read this excellent recent WG21 paper which addresses your suggestions in detail: P3444 "Memory Safety without Lifetime Parameters".

Insulating layer?

Posted Oct 17, 2024 16:39 UTC (Thu) by kleptog (subscriber, #1183) [Link]

I think one of the conclusions in that paper is on the mark:

> It’s not surprising that the C++ community hasn’t discovered a better way to approach safe references than the lifetime parameter model. After all, there isn’t a well-funded effort to advance C++ language-level lifetime safety. But there is in the Rust community. Rust has made valuable improvements to its lifetime safety design. Lots of effort goes into making borrow checking more permissive: The integration of mid-level IR and non-lexical lifetimes in 2016 revitalized the toolchain. Polonius[polonius] approaches dataflow analysis from the opposite direction, hoping to shake loose more improvements. Ideas like view types[view-types] and the sentinel pattern[sentinel-pattern] are being investigated. But all this activity has not discovered a mechanism that’s superior to lifetime parameters for specifying constraints. If something had been discovered, it would be integrated into the Rust language and I’d be proposing to adopt that into C++. For now, lifetime parameters are the best solution that the world has to offer.

Basically, if there was a simpler way then likely Rust developers would have already found it and implemented it, because they're actually looking for ways to make things simpler. The C++ community is barely engaging with the topic. The likely end result seems fairly obvious to me.