Too complex for my small brain

Posted Nov 9, 2024 0:57 UTC (Sat) by khim (subscriber, #9252)
In reply to: Too complex for my small brain by NYKevin
Parent article: Safety in an unsafe world

> Which, again, is the whole point of the exercise.

The whole point of the excercise was automatic deadlock prevention. If deadlock is prevented (as in: my code that could actually deadlock on the users machine) – I'm happy, if false positive is detected I'm annoyed. It's as simple as that.

Of course we all know that achieving zero false positives, zero false negatives is just simply mathematically impossible, but that doesn't make false positives less annoying.

As advocacy of deplock here shows some people even prefer to have a [small probability] of false negatives – mostly to avoid annoying and disrupting false positives.

> It is possible in the sense that the type checker cannot prove that it is impossible.

If that couldn't lead to actual bug in a program then that's false positive. And false positives are really big deal. They literally make or break tool adoptions. TSAN was the first widely adopted data race detector precisely because it has so few false positives. Some very rare corner cases exists, sure, but, most of the time, if TSAN detects error – you can be sure it's real error. You spend time fixing real bugs instead of making tools happy. That's good trade-off. If tools force you to change code because they report possible errors where none exist… you are spending time chasing windmills… that's annoying and not productive. That's bad trade-off.

Heck, it took decade and half for Rust borrow checker to become intelligent enough not to make people run screaming after interacting with it. Precisely because originally it had too many false positives.

If compiler tells me my program have an issue I want to be sure compiler is right at least 90% of time (99% is better and after 99.9% I would probably stop caring), otherwise I would prefer not to even bother with such useless set of checks.

> If you reorder things so that the argument may not be locked after Foo anymore, then the function no longer makes sense and needs to be fixed or rewritten so that it does not attempt to lock the argument after locking a Foo.

And who would ensure that? AFAICS compiler wouldn't care. I'm not sure if there are clippy lint and even if such lint does exist it may very easily be wrong if I'm taking or not taking Foo depending on what kind of callback I'm accepting.

So we have found another annoying problem with that approach – the one that doesn't even exist in the alternatives! Was it even discussed when traits were chosen as the way to detect potential deadlocks?

> Type aliases and associated types can allow a significant degree of compile-time indirection that makes the overall strategy much less painful than what you seem to be describing.

Sigh. Sure, you can invent many tricks to help someone who is forced to participate in a potato sack race: some kinds of rollers, running poles and many other things… but why the heck removal of the potato sack is not even discussed?

All your suggestiongs to reduce virality and pain related to it… – they all rely on the ordering of locks from “least important” to “more important” – and if you have such ordering then the only objection to the simple, naïve, const-based correctness checker suddenly just disappears.

Can we, please first discuss some tangible benefits that all that complexity provides (the reason to participate in the potato sack race instead of regular race) and only then start talking about mitigation strategies.

Because, in my book, to discuss two approaches they both need to have some benefits!

If one gives you flexibility but also small percentage of false negatives (lockdep) and the other gives you zero false negatives but somewhat annoying syntax (C++ templates or const-based deadlock checking prevention in Rust) then I can understand how one or another may be picked.

But if one approach have strictly more negatives than the other approach then it doesn't even matter if these negatives are large or small: they exist, ergo this is worse approach, ergo it shouldn't be used.

From what you are saying one of the positives that you perceive as “nice to have” is the fact that &impl LockAfter<Foo> acts, in addition to other things… also as a documentation, not as an automatic lock detector.

Fair point. But if that is the advantage over const-based deadlock checking prevention then… was that advantage ever measured or at least discussed?

We may have our locking scheme documentation spread over hundreds of files that we have written instead of keeping it in one place somewhat removed from the rest of the code… well… hmm… an interesting proposal. It would mean that it's easier to see how our locking scheme lock works, but harder to modify… and then we would use type aliases to reduce an easiness of discovering it but to reduce pain with modifying it, too… hmm… is it a good trade-off or not?

Discussion structured in that way would at least make some sense… but I don't see anything like that being discussed in the article that we are talking about, I don't see it being discussed in netstack3 documentation… I don't see it anywhere! It's as if no one even tried to think about if using complicated hierarchy of traits is better than just simple enum with 77 items!

This is what baffles me: the road that more-or-less guarantees that there would be a lot of work and a lot of complexity was chosen seemingly on the whim without ever discussing alternative roads… and now only mitigation strategies are discussed without ever explaining why that road was picked in the first place! Why? If something hurts us shouldn't we first try to remove that something instead of discussing which painkillers work best?

There should be some reason for it! Do we know it or not? Do we have at least an idea about what the reason was?

Too complex for my small brain

Posted Nov 9, 2024 2:58 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Of course we all know that achieving zero false positives, zero false negatives is just simply mathematically impossible, but that doesn't make false positives less annoying.

That's not quite true. You can achieve automatic verification with zero false negatives, although not in the general case.

Too complex for my small brain

Posted Nov 9, 2024 11:53 UTC (Sat) by khim (subscriber, #9252) [Link]

> That's not quite true. You can achieve automatic verification with zero false negatives, although not in the general case.

Not sure what you are talking about. It's trivial to achieve “zero false negatives”: return true; guarantees that. It's also trivial to achieve “zero false positives”: return false; guarantees that. You don't even need to know the question!

Not very useful implementation, but it works, isn't it? Thus saying that it's not possible to guarantee zero negatives would be stupid.

What's not possible is to precisely separate things. You have to have either false positives or false negatives or both.

And since ideal solution is not possible… we have many different solutions with different trade-offs.

Too complex for my small brain

Posted Nov 10, 2024 5:09 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (9 responses)

> Discussion structured in that way would at least make some sense… but I don't see anything like that being discussed in the article that we are talking about, I don't see it being discussed in netstack3 documentation… I don't see it anywhere! It's as if no one even tried to think about if using complicated hierarchy of traits is better than just simple enum with 77 items!

Isn't Fuchsia a Google project? I'm sure there is non-public discussion around the project, so it may well have been discussed internally and this is the result. I mean, I'm all for Google being more transparent about things, but there's a *long* way to go before that happens.

Too complex for my small brain

Posted Nov 10, 2024 13:14 UTC (Sun) by khim (subscriber, #9252) [Link] (8 responses)

The problem with that reasoning is that they documented the great many things… except the most basic question of why this complicated scheme with traits is even needed if much simpler one is obviosly possible.

It's as if someone decided to go from Eilat to Dubai around Africa without any explanation instead of using Suez.

We don't need to know about minutiae details of these talks to know that Houthi were mentioned and was deemed real enough problem to do that.

Similarly here: sure, we don't have transcripts of all the talks Google had inside… but they had to had some reason to do what they did, isn't it?

What could that be? If the documentation was the primary reason then it would have been mentioned, somewhere, I assume.

Too complex for my small brain

Posted Nov 10, 2024 13:30 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (7 responses)

Filing an issue with the Fuchsia team to improve documentation sounds a lot more productive than arguments with other non-Fuchsia participants here, no? Even if we, collectively, did come up with some rationale that satisfied you…what does that matter? And sure, they may not have a reason (e.g., the const pattern you suggest may not have occurred to them). Or they may have weighed the tradeoffs differently and just disagree with you. No matter what the case may be, I don't believe the answers lie in this comment thread.

Too complex for my small brain

Posted Nov 10, 2024 14:18 UTC (Sun) by khim (subscriber, #9252) [Link] (6 responses)

> Filing an issue with the Fuchsia team to improve documentation sounds a lot more productive than arguments with other non-Fuchsia participants here, no?

Depends on your goals, ultimately. If I wanted to improve Fuchsia then sure, it would have been the best way.

But I'm not interested in Fuchsia. Multics, Taligent, Sailfish or FirefoxOS… funny OSes come and fail regularly. Fuchsia looks like one of these so far.

But they do include many interesting ideas – and it's the question of whether I'm missing them, that I'm after.

In particular I use various tricks that help me turn runtime errors into a compile-time errors in C++ – and, naturally, am very interested in using them in Rust, too. Except lack of duck-typing for generics makes them very hard to use.

> Even if we, collectively, did come up with some rationale that satisfied you…what does that matter?

It matters because my goal here is to pick the best solution that would for me, not for Fuchsia. Also to help others to do the same, but that's secondary goal.

That's why I'm not, really, not interested in knowing why Fuchsia as part of Google did that choice. Even if there are some deeply internal reason to pick that approach… why would I care?

> Or they may have weighed the tradeoffs differently and just disagree with you.

To discuss “tradeoffs” we need two approaches with pluses and minuses. We don't have that yet.

Traits don't even work all that well for the documentation purpose because it's way too easy to leave annotations in place when they no longer needed. And then they become pure nuisance.

> And sure, they may not have a reason (e.g., the const pattern you suggest may not have occurred to them).

Could be. I, too, went the traits route when I wanted to implement similar scheme – because I have done that many times in C++ and it works beautifully there. But in Rust… that was a disaster. Viral nature of traits declarations created an insane mess.

I started looking for a solution… and found the const trick. It worked beautifully, made it possible to adopt C++17 style metaprogramming without turning my code into “insane mess” (just into “somewhat ugly mess”) – and thus the idea that someone may just not know it never occured to me in that discussion.

But I guess that's the best explanation that we have, so far.

Too complex for my small brain

Posted Nov 11, 2024 12:14 UTC (Mon) by daroc (editor, #160859) [Link] (5 responses)

Do you happen to have any code demonstrating your approach published? I feel like I have a good intuitive understanding of Fuchsia's approach, but I don't think I understand your approach as well.

Too complex for my small brain

Posted Nov 11, 2024 15:49 UTC (Mon) by khim (subscriber, #9252) [Link] (4 responses)

You mean the code in production? That's not “my” code, just code I helped to develop. And I would need to spend a lot of time what dependencies we are tracking, how and why (it wasn't about lock tracking, rules there were different, that's why simple pair of before and after traits were not enough).

I think it would be better to concentrate on that simplified example that started the whole discussion.

I kinda assumed that it would be trivial to follow yet, apparently, it's not trivial… thus I would try to explain the ingredients that are used there and show how can they be combined together to detect violations of locking ordering.

Zero-sized types, in Rust, may carry compile-time payload. Specifically:
1. They can carry information about some other type
2. They can carry information about lifetime
3. They can carry values (integers or sequences of integers)
To check the correctness of locking we would use all three properties. Specifically:
1. We would be putting information about all currently taken locks in the context (by using 1.1 property, obviously)
2. We would do that by putting zero-sized reference to the previous context: if you take &'a mut T and convert into PhantomData<&'a mut T> then this zero-sized type acts precisely like a real reference… except you couldn't use it to derefence anything (but we don't have anything interesting in the context itself, the only interesting things are in the context type, thus we don't need to dereference anything) – in particular as long as PhantomData<&'a mut T> exists the original context is not accesible (by using 1.2, obviously)
3. And we also would add some constant to know what we have put in the that context, e.g. we may put all locks in one enum and then simply use LockPriority::IPV6LOCK us usize (by using 1.3, obvious)
And with all that placed in our context we only just need to verify that lock that we are taking right now is valid. In Rust 1.79+ that's trivial: you just used const { assert!(…) } or const { if … { panic!("…") } and that's it. That's a bit strange because it looks as if that's a runtime check, but no, that's compiler-time check. So… you have two constants, compare and you would know if lock is Ok to take or not (ask me if you need to backport it to older versions of Rust, there you needed somewhat more complicated dance, but it's also doable).
Note that you have access to all types in the chain here, even if indirect, and could implement arbitrarily complex verification scheme… it's only hindered by the fact that const functions couldn't be part of the trait, so you would probably want to put your machinery in separate module to not fill the namespace with helper functions.

And that's… essentially it. We are done. Just need some syntax sugar to make it more pretty. I guess you would want to add something like LockPriority trait:

pub trait LockPriority : sealed::LockPriority {
    const PRIORITY: usize;
}

impl<'a, T: ?Sized, const P: usize> LockPriority
for TakenLockPriority<'a, T, P> {
    const PRIORITY: usize = P;
}

mod sealed {
    pub trait LockPriority {}
    impl<'a, T: ?Sized, const P: usize> LockPriority
    for super::TakenLockPriority<'a, T, P> {}
}

Would also need some macro because you would be doing things like
let (mut ctx, guard) = mutex.lock(UsePriority(& mut ctx));
pretty often. Or maybe not, maybe it's better to keep that explicit. Attempts to use wrong context would be detected and stopped by the compiler, anyway.

P.S. The while thing relies on the fact that correctness checking for the const expression and detection of errors happens not when you define some generic function, but when it's instantiated. And it's obvious why it have to happen like this: correctness checking of something likes const { assert!(…); } couldn't happen during definition time because what goes into assert!(…); may define on constants that ass passed as arguments. And doing that in runtime wouldn't make any sense: this would mean that your constant value would, suddenly, depend on something in runtime…

Too complex for my small brain

Posted Nov 11, 2024 16:18 UTC (Mon) by daroc (editor, #160859) [Link] (3 responses)

Thank you for the explanation; I see what I was missing. I didn't understand that the assertions in the const block were actually being evaluated at compile time — with that understanding, the scheme makes more sense.

It seems like the big difference between the two schemes is really just about the use of a partial order [1] versus a total order. Fuchsia's scheme also involves carrying around a zero-sized context that gets compared to see whether one element is less than another element at compile time — the difference is that the comparison has to be encoded into traits, because Rust doesn't support using custom types as const generics (yet).

I personally think that there are reasons to prefer a partial order, but that probably a scheme that used const generics to pass around a type with a custom PartialOrd instance would be the best of both worlds.

[1]: https://en.wikipedia.org/wiki/Partially_ordered_set#Parti...

Too complex for my small brain

Posted Nov 11, 2024 17:24 UTC (Mon) by khim (subscriber, #9252) [Link] (2 responses)

> It seems like the big difference between the two schemes is really just about the use of a partial order versus a total order.

That's actually pretty minor issue. const functions can do arbitrary calculations, if you are worried that locks would be accepted in some random arbitrary order while you want the compile-time error (and explicit decision by developer who would declare whether storage or network have higher priority, instead) then this can be done via simple proc-macro which would generate an appropriate table. netstack3 is defining their locks in once place with pile of macros, anyway.

As I have said: I developed this scheme to validate more complicated situation, where nether total nor partial order work. I just used simple number for illustration purposes.

And yes, we had function that accepted two enum numbers and then answered whether they are compatible. Pretty obvious modification.

> the difference is that the comparison has to be encoded into traits

The main difference is when and how it's checked, not how it's encoded. With netstack3 function declares what it accepts in it's signature… and that's it.

But how would it know? Think about function that stores something… if it would send it to USB storage, if it would save it on SSD then there would be different locks, if that file on the NFS server then you would hit yet another situation.

And you need to specify one, single markup that would cover everything. Sure, instead of simple callback you may accept another trait which embeds information about what locks that callback would need… but what if you have two callbacks? What if they could be called in a different order?

Instead of one, simple, rule you start develop elaborated schemes that may describe what you are really trying to achieve… but you are doing in a pretty limited world of traits and generics… which is total: if generic function accepts some type of callback – then it always accepts that type of callback! And if the calling function could't prove at the declaration time that it have the right to call that callback… oh, well… now we need another trait… and if we use two callbacks… we need to decide (and declare) order that we plan to use in the function declaration, too! And if order can be different… well that's another indirection! Traits are multiplying and fill your code with something that it doesn't really need, that's only there to help with lock order checking!

As I have said: this leads to insane mess… and insane compilation times.

Now, constants are validated after instantiation. That means that you simply just don't even need to care about combinations that are not happening in real code. And if you use function with a callback… callback would determine validity of your function automatically. You don't need to create an additional trait, put callback and lock in there and pass that around. Any generic function or generic callback already carries the required data, you don't need to duplicate it in separate, manually written, traits!

> I personally think that there are reasons to prefer a partial order, but that probably a scheme that used const generics to pass around a type with a custom PartialOrd instance would be the best of both worlds.

Sadly that's not possible in Rust because traits couldn't carry const functions and PartialOrd couldn't be used in const context, too.

You may create helper function, though. If your locks are defined in one place then it's trivial and if they are not defined in one place then RED/BLACK scheme can be used (it's more-or-less inveitable with netstack3 approach, too, as was already discussed).

P.S. One mode that's trivial with const-checking and which is, essentially, impossible to implement (without additional const-coding, heh) with netstack3 approach: “just make me happy”. I have 2 (3, 4, 5, N) locks… how can I lock them all without causing the deadlock. How would you implement that? That's powerful capability, but it's not compatible with “just leave this machinery out” compile-speedup enhancement (where you just map everything to () and compiler accepts everything till you would compiler with checking-enabled flag)…

Too complex for my small brain

Posted Nov 11, 2024 18:18 UTC (Mon) by daroc (editor, #160859) [Link] (1 responses)

Ah, I see now. I can't speak for anyone else, but you've managed to convince me.

As for why the Fuchsia team went with a trait-based approach — your comment about when panics in const blocks stabilized raises another theory: netstack3 was started in 2019. I'm not sure exactly when the LockBefore solution was added, but it could have been before the approach you described became possible.

Too complex for my small brain

Posted Nov 11, 2024 19:21 UTC (Mon) by khim (subscriber, #9252) [Link]

> I'm not sure exactly when the LockBefore solution was added, but it could have been before the approach you described became possible.

I think that was possible even in 2019, but it sure looked awfully ugly back then.

Panic in const were always possible (just divide by zero and you get panic at compile time) and, indeed, in Rust 1.51 everything already works. But without const generics all these checks would have to be based on One and Zero types and associated consts in traits and that looks pretty awful. When Rust 1.57 stabilized panics it became easier and with Rust 1.79 the need for strange dance with extra struct and const implementation for it is no longer needed.

I don't remember where I discovered the use of const for invariants checking (I think it was on URLO, but could be reddit, too), but I remember that current version of Rust was 1.5x, which meant that I don't need to do crazy dance with types to encode number, can just simply use const generics, but still needed to use “divide by zero” trick. Discussion about the fact that panic in const would be stabilized “soon”, that I vaguely recall, means that was middle of 2021.

Now suspect that without const generics with the need to encode numbers in One and Zero types and then use pile associated constants for calculations it was just too ugly to really contemplate and one needed a leap of faith to believe that eventually it all would become much simpler and more usable to bet an implementation on that.

But for me, since I knew about that approach essentially from my first attempt to help some of my friends to write their first large Rust program… it just feels much more natural than traits-based checks. Closer to how one would do that in C++17 and thus more… “traditional”, I would say.

I guess is never occurred to me that it may be very obscure technique that is not known to everyone.