Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 2:21 UTC (Thu) by roc (subscriber, #30627)
In reply to: Zig heading toward a self-hosting compiler by khim
Parent article: Zig heading toward a self-hosting compiler

It is insanely difficult to handle out-of-memory conditions reliably in a complex application. First you have to figure out what to do in every situation where an allocation fails. Usually there is nothing reasonable you *can* do other than give up on some request. Then you have to implement the failure handling --- make sure that your error handling (including RAII cleanup!) doesn't itself try to allocate memory! Then you have to test all that code, which requires the ability to inject failures at every allocation point. Then you have to run those tests continuously because otherwise things will certainly regress.

For almost every application, it simply isn't worth handling individual allocation failures. It *does* make sense to handle allocation failure as part of large-granularity failure recovery, e.g. by isolating large chunks of your application in separate processes and restarting them when they die. That works just fine with "fatal" OOM handling.

In theory, C and C++ support handling of individual allocation failures. In practice, it's very hard to find any C or C++ application that reliably does so. The vast majority don't even try and most of the rest pretend to try but actually crash in any OOM situation because OOM recovery is not adequately tested.

Adding OOM errors to every library API just in case one of those unicorn applications wants to use the library adds API complexity just where you don't want it. In particular, a lot of API calls that normally can't fail now have a failure case that needs to be handled/propagated.

Therefore, Rust made the right call here, and Zig --- although it has some really cool ideas --- made the wrong call.

Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 8:54 UTC (Thu) by smcv (subscriber, #53363) [Link]

> In theory, C and C++ support handling of individual allocation failures. In practice, it's very hard to find any C or C++ application that reliably does so. The vast majority don't even try and most of the rest pretend to try but actually crash in any OOM situation because OOM recovery is not adequately tested.

dbus, the reference implementation of D-Bus, is perhaps a good example: it's meant to handle individual allocation failures, and has been since 2003, with test infrastructure to verify that it does (which makes the test suite annoyingly slow to run, and makes tests awkward to write, because every "assert success" in a test that exercises OOM turns into "if OOM occurred, end test successfully, else assert success"). Despite all that, we're *still* occasionally finding and fixing places where OOM isn't handled correctly.

The original author's article on this from 2008 <https://blog.ometer.com/2008/02/04/out-of-memory-handling...> makes interesting reading, particularly these:

> I wrote a lot of the code thinking OOM was handled, then later I added testing of most OOM codepaths (with a hack to fail each malloc, running the code over and over). I would guess that when I first added the tests, at least 5% of mallocs were handled in a buggy way

> When adding the tests, I had to change the API in several cases in order to fix the bugs. For example adding dbus_connection_send_preallocated() or DBUS_DISPATCH_NEED_MEMORY.

Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 15:22 UTC (Thu) by khim (subscriber, #9252) [Link] (12 responses)

It's true that OOM-handling couldn't be added to the existing codebase. I'm not so sure it's as hard as you describe if you design everything from scratch.

It's like exception safety: it's insanely hard to redo an existing codebase to make it exception-safe. Google Style Guide even expressly forbids it. Yet if you use certain idioms and libraries — it becomes manageable.

If you want/need to handle OOM case the situation is similar: you change your code structure to handle that case… and suddenly it becomes much less troubling and hard to deal with.

I'm not sure Zig would manage to pull it off… but I wouldn't dismiss it because it tries to solve that issue: lots of issues with OOM handling in the existing applications/libraries come just from the fact that they design API for the usual “memory is infinte” world… and then try to add OOM handling to that… it doesn't work.

But you can go and check these old MS-DOS apps which had to deal with limited memory. They handle it just fine and it's not hard to make them show you “couldn't allocate memory” message without crashing. Please don't say that people were different back then and could do that, but today we lost that art. That's just not true.

Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 22:28 UTC (Thu) by roc (subscriber, #30627) [Link] (11 responses)

C and C++ and their standard libraries *were* designed from scratch to allow handling of individual allocation failures. Lots of people built libraries and applications on top of them that they thought would handle allocation failures. That didn't work out.

MS-DOS apps were a lot simpler than what we have today and often did misbehave when you ran out of memory. Those that did not often just allocated a fixed amount of memory at startup and were simple enough they could ensure they worked within that limit, without handling individual allocation failures. For example if you look up 'New' in the Turbo Pascal manual (chapter 15), you can see it doesn't even *mention* New returning OOM or how to handle it. The best you can do is call MaxAvail before every allocation, which I don't recall anyone doing. http://bitsavers.trailing-edge.com/pdf/borland/turbo_pasc...

Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 23:27 UTC (Thu) by khim (subscriber, #9252) [Link] (10 responses)

It's funny that you have picked Turbo Pascal 3.0 — the last version without proper care for the out-of-memory case. Even then it had $K+ options which was enabled by default and would generate a runtime error if memory was exhausted.

If you open the very site which you showed and look on the manual for the Turbo Pascal 4.0 — you'll find out HeapError error-handling routine there. Turbo Vision manual even have whole chapter 6 named “Writing safe programs” — complete with “safety pool”, “LowMemory” condition and so on. It worked.

Turbo Pascal itself used it and many other programs did, too. I don't quite sure when the notion of “safe programming” was abandoned, but I suspect it was when Windows arrived. Partly because Windows itself handles OOM conditions poorly (why bother making your program robust if the whole OS would come crashing down on you if you run out of memory?) and partially because it brought many new programmers to the PC which were happy to make programs which would work sometimes and cared not about making them robust.

Ultimately there's nothing mystic in writing such programs. Sure, you need tests. Sure, you need proper API. But hey, it's not as if you can handle other kinds of failures properly without tests and it's not as if you don't need to think about your API if you want to satisfy other kinds of requirements.

It's kind of a pity that Unix basically pushed us down the road of not caring about OOM errors with it's fork/exec model. It's really elegant… yet really flawed. Once you go that road the only way to efficiently use the whole memory available is via overcommit and once you have overcommit and malloc stops returning NULL and you get SIGSEGV at random time… you can no longer write reliable programs so people just stop writing reliable libraries, too.

Your only hope at that point is something like what smartphones and routers are doing: split your hardware into two parts and put “reliable” piece into one and “fail-happy” piece into another. People would just have to deal with the need to do hard reset at times.

But is that the good way to go for the ubiquitous computing? Where failure and watchdog-induced reset may literally mean life-and-death? Maybe this two parts approach would scale. Maybe it would. IDK. Time will tell.

Zig heading toward a self-hosting compiler

Posted Oct 9, 2020 3:08 UTC (Fri) by roc (subscriber, #30627) [Link] (1 responses)

Thanks for the TP4/Vision references. The vogue for OOM-safe application programming in DOS, to the extent it happened, must have been quite brief.

> But hey, it's not as if you can handle other kinds of failures properly without tests and it's not as if you don't need to think about your API if you want to satisfy other kinds of requirements.

It sounds like you're arguing "You have to have *some* tests and *some* API complexity so why not just make those a lot more work".

Zig heading toward a self-hosting compiler

Posted Oct 9, 2020 7:04 UTC (Fri) by khim (subscriber, #9252) [Link]

> It sounds like you're arguing "You have to have *some* tests and *some* API complexity so why not just make those a lot more work".

No. It's “a lot more work” if you don't think about it upfront. It's funny that this link is used as an example if how hard it is to handle OOM. Because it's really shows how easy it is to do. 5% of mallocs were handled in a buggy way — means that 95% of them were handled correctly on the first try. That's a success rate much higher than for most other design decisions.

Handling OOM conditions is not hard, really. It's only hard if you already have finished code designed for the “memory is infinite” world and want to retrofit OOM-handling into it. Then it's really hard. Situation is very analogous to thread-safety, exception-safety and many other such things: just design primitives which handle 95% of work for you, and write tests to cover the remaining 5%.

Zig heading toward a self-hosting compiler

Posted Oct 10, 2020 15:16 UTC (Sat) by dvdeug (guest, #10998) [Link] (7 responses)

I suspect it was when Windows arrived, because that's when the first serious multiprocess computing happened on PC and when the first complex OS interactions were happening. If I have several applications open when I hit the memory limit, what fails will be more or less at random; it's possible to be Photo Shop, or the music player or some random background program. It's also possible to be some lower-level OS code that had little option but to invoke the OOM killer or crash the system. It's quite possible you can't open a dialog box to tell the user of the problem without memory, nor save anything. As well as the fact your multithreaded code (and pretty much all GUI programs should run their interface on a separate thread) may be hitting this problem on multiple threads at once. What was once one program running on an OS simple enough to avoid memory allocation is now a complex collection of individually more complicated programs on a complex OS.

Zig heading toward a self-hosting compiler

Posted Oct 10, 2020 22:50 UTC (Sat) by khim (subscriber, #9252) [Link] (2 responses)

>It's quite possible you can't open a dialog box to tell the user of the problem without memory,

MacOS classic solved that by setting aside some memory for that dialog box.

>nor save anything.

Again: not a problem on MacOS since there application requests memory upfront and then have to deal with it. Other app couldn't “steal” it.

>I suspect it was when Windows arrived

And made it impossible to reliably handle OOM, yes. Most likely.

>What was once one program running on an OS simple enough to avoid memory allocation is now a complex collection of individually more complicated programs on a complex OS.

More complex than typical zOS installation? Which handles OOM just fine?

I don't think so.

No, I think you are right: when Windows (the original one, not Windows NT 3.1 which properly handles OOM, too) and Unix (because of fork/exec model) made it impossible to reliably handle OOM conditions — people stopped caring.

SMP or general complexity had nothing to do with it. Just general Rise of Worse is Better.

As I've said: it's not impossible to handle and not even especially hard… but in a world where people just trained to accept the fact that programs may fail randomly for no apparent reason that thing is just entirely unnecessary.

Zig heading toward a self-hosting compiler

Posted Oct 11, 2020 4:25 UTC (Sun) by dvdeug (guest, #10998) [Link] (1 responses)

> Again: not a problem on MacOS since there application requests memory upfront and then have to deal with it.

You could do that anywhere. Go ahead and allocate all the memory you need upfront.

> More complex than typical zOS installation? Which handles OOM just fine?

If it does, it's because it keeps things in nice neat boxes and runs a closed set of IBM hardware, in the way that a desktop OS can't and doesn't. A kindergarten class at recess is more complex in some ways than a thousand military men marching in formation, because you never know when a kindergartner is going to punch another one or make a break for freedom.

> SMP or general complexity had nothing to do with it.

That's silly. If you're writing a game for a Nintendo or a Commodore 64, you know how much memory you have and you will be the only program running. MS-DOS was slightly more complicated, with TSRs, but not a whole lot. Things nowadays are complex; a message box calls into a windowing system and needs fonts loaded into memory and text shapers loaded; your original MacOS didn't handle Arabic or Hindi or anything beyond 8-bit charsets. Modern systems have any number of processes popping up and going away, and even if you're, say, a word processor, that web browser or PDF reader may be as important as you. Memory amounts will vary all over the place and memory usage will vary all over the place, and checking a function telling you how much memory you have left won't tell you anything particularly useful about what's going to be happening sixty seconds from now. What was once a tractable problem of telling how much memory is available is now completely unpredictable.

> Just general Rise of Worse is Better.

To quote that essay: "However, I believe that worse-is-better, even in its strawman form, has better survival characteristics than the-right-thing, and that the New Jersey approach when used for software is a better approach than the MIT approach." The simple fact is you're adding a lot of complexity to your system; there's a reason why so much code is written in memory-managed languages like Python, Go, Java, C# and friends. You're spending a lot of programmer time to solve a problem that rarely comes up and that you can't do much about when it does. (If it might be important, regularly autosave a recovery file; OOM is not the only or even most frequent reason your program or the system as a whole might die.)

> in a world where people just trained to accept the fact that programs may fail randomly for no apparent reason

How, exactly, does issuing a message box saying "ERROR: Computer jargon" going to help that? Because that's all most people are going to read. There is no way you can fix the problem that failing to open a new tab or file because the program is out of memory is going to be considered "failing randomly for no apparent reason" by most people.

I fully believe you could do better, but it's like BeOS; it was a great OS, but when it was made widely available in 1998, between Windows 98 and an OS that didn't run a browser that could deal with the Web as it was in 1998, people went with Windows 98. Worse-is-better in a nutshell.

Zig heading toward a self-hosting compiler

Posted Oct 11, 2020 19:49 UTC (Sun) by Wol (subscriber, #4433) [Link]

> "However, I believe that worse-is-better, even in its strawman form, has better survival characteristics than the-right-thing, and that the New Jersey approach when used for software is a better approach than the MIT approach."

Like another saying - "the wrong decision is better than no decision". Just making a decision NOW can be very important - if you don't pick a direction to run - any direction - when a bear is after you then you very quickly won't need to make any further decisions!

Cheers,
Wol

16-bit Windows applications tried to deal with OOM

Posted Oct 11, 2020 12:38 UTC (Sun) by quboid (subscriber, #54017) [Link] (3 responses)

Perhaps it was when 32-bit Windows arrived that applications stopped caring about running out of memory.

The 16-bit Windows SDK had a tool called STRESS.EXE which, among other things, could cause memory allocation failures in order to check that your program coped with them correctly.

16-bit Windows required large memory allocations (GlobalAlloc) to be locked when being used and unlocked when not so that Windows could move the memory around without an MMU. It was even possible to specify that allocated memory was discardable and you didn't know whether you'd still have the memory when you tried to lock it to use it again - this was great for caches and is a feature I wish my web browser had today. :-)

Mike.

16-bit Windows applications tried to deal with OOM

Posted Oct 11, 2020 21:14 UTC (Sun) by dtlin (subscriber, #36537) [Link]

Android has discardable memory - ashmem can be unpinned, and the system may purge it if under memory pressure. I think you can simulate this with madvise(MADV_FREE), but ashmem will tell you if it was purged or not and MADV_FREE won't (the pages will just be silently zero'ed).

16-bit Windows applications tried to deal with OOM

Posted Oct 11, 2020 22:28 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

You should be glad browsers don't have that today. If they did, people would use it, and browsers on developer machines would rarely discard memory, so when your machine discards memory applications would break.

16-bit Windows applications tried to deal with OOM

Posted Oct 15, 2020 16:19 UTC (Thu) by lysse (guest, #3190) [Link]

Better they break by themselves than freeze up the entire system while it tries to page every single executable VM page through a single 4K page of physical RAM, because the rest of it has been overcommitted to memory that just got written to.

Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 19:23 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

For embedded-style development, there is a third option beyond handling every individual allocation failure or restarting the whole application/OS on any allocation failure: don't do dynamic allocation. There's simple things like replace std::vector<T> with std::array<T, UPPER_BOUND> if you can work out the bounds statically. Or whenever an API function allocates an object, change it to just initialise an object that has been allocated by the caller. That caller can get the memory from its own caller (recursively), or from a static global variable, or from its stack (which is sort of dynamic but it's not too hard to be confident you won't run out of stack space), or from a memory pool when you can statically determine the maximum number of objects needed, or in rare cases it can allocate dynamically from a shared heap and be very careful about OOM handling.

E.g. FreeRTOS can be used with partly or entirely static allocation (https://www.freertos.org/Static_Vs_Dynamic_Memory_Allocat...). Your application can implement a new thread as a struct/class that contains an array for the stack, a StaticTask_t, and a bunch of queues and timers and mutexes and whatever. You pass the memory into FreeRTOS APIs which connect it to other threads with linked lists, so FreeRTOS doesn't do any allocation itself but doesn't impose any hardcoded bounds. And since you know your application will only have one instance of that thread, it can be statically allocated and the linker will guarantee there's enough RAM for it.

In terms of the application's call graph, you want to move the allocations (and therefore the possibility of allocation failure) as far away from the leaf functions as possible. Just do a few big allocations at a high level where it's easier to unwind. Leaf functions include the OS and the language's standard library and logging functions etc, so you really need them to be designed to not do dynamic allocation themselves, otherwise you have no hope of making this work.

The C++ standard library is bad at that, but the language gives you reasonable tools to implement your own statically-allocated containers (in particular using templates for parameterised sizes; it's much more painful in C without templates). From an extremely brief look at Zig, it appears to have similar tools (generics with compile-time sizes) and at least some of the standard library is designed to work with memory passed in by the caller (and the rest lets the caller provide the dynamic allocator). Rust presumably has similar tools, but I get the impression a lot of the standard library relies on a global allocator and has little interest in providing non-allocating APIs.

It's not always easy to write allocation-free code, and it's not always the most memory-efficient (because if your program uses objects A and B at non-overlapping times, it'll statically allocate A+B instead of dynamically allocating max(A,B)), but sometimes it is feasible and it's really nice to have the guarantee that you will never have to debug an out-of-memory crash. And even if you can't do it for the whole application, you still get some benefit from making large parts of it allocation-free.

(This is for code that's a long way below "complex applications" from a typical Linux developer's perspective. But nowadays there's still a load of development for e.g. IoT devices where memory is limited to KBs or single-digit MBs, implementing complicated protocols across a potentially hostile network, so it's a niche where a language that's nicer and safer than C/C++ but no less efficient would be very useful.)

Zig heading toward a self-hosting compiler

Posted Oct 8, 2020 22:33 UTC (Thu) by roc (subscriber, #30627) [Link]

Yes, not allocating at all is a good option for some applications --- probably a lot more applications than those for which "check every allocation" is suitable.

There is a growing ecosystem of no-allocation Rust libraries, and Rust libraries that can optionally be configured to not allocate. These are "no-std" (but still use "core", which doesn't allocate). https://lib.rs/no-std

Rust const generics (getting closer!) will make static-allocation code easier to write in Rust.