Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 17:07 UTC (Wed) by dgm (subscriber, #49227)
In reply to: Brian Kernighan on the origins of Unix by epa
Parent article: Brian Kernighan on the origins of Unix

The moment you have per-user everything (programs, libraries, permissions, filesystems, network, ...) what you have is, in essence, a per-user kernel. I think that the best solution for running per-user kernels are good 'ol Virtual Machines (in the VirtualBox sense), so it is a solved problem indeed.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 19:55 UTC (Wed) by epa (subscriber, #39769) [Link] (4 responses)

If you don't want the users to communicate with each other (except over network links) then yes. You give each user their own system, whether by physical hardware or a virtual machine. My vision of a multi-user system is that it has the properties I mentioned, but at the same time you can communicate between users in the lightweight way you do on a current Linux system, provided you've granted permission. The multi-user interface provided by a set of virtual machines is pretty basic, because usually the only way for them to communicate is via TCP/IP.

We have approximations such as sandboxes used by web browsers to isolate one tab from another. Again, I see those as a kind of workaround or a hazy approximation to the right answer: creating a new user on the fly for each tab, giving it its own memory and CPU budget, and then communicating via signals, Unix domain sockets, or a shared filesystem. But yeah, talk is cheap. The point I want to make is that if it's considered a solved problem, that may be because we're too used to the limitations of the current systems. Before git, software version control might have been seen as a solved problem by some.

Brian Kernighan on the origins of Unix

Posted Jan 20, 2022 11:16 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

The other thing that would be nice to solve that VMs don't is good resource utilization; a VM's memory allocation is hard to vary dynamically based on overall system load, as is its filesystem size. The only host resource that's demand-shared in a VM setup is CPU.

It would be nice if we didn't have to preallocate resources to processes, but instead could rely on the system sharing all resources fairly and reasonably for the use cases in question. That's a hard problem, but it's one that (e.g.) filesystem quotas and cgroups aim to solve for containerized systems.

Brian Kernighan on the origins of Unix

Posted Jan 28, 2022 6:05 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

cgroups doesn't solve the memory sharing problem any better, though admittedly VMs do currently compound the problem. virtio-balloon, which is widely supported (including by typical Linux hosts and guests) permits a guest to return memory (e.g. filesystem buffer caches) to the host. The problem (AFAIU) is that currently there's no implementation on either side (host or guest) that communicates memory pressure back-and-forth, opportunistically shifting memory around. See defunct auto-ballooning project at https://www.linux-kvm.org/page/Projects/auto-ballooning. So, just as with cgroups, if the memory accounting budget is overcommitted and the VM manager hits a hard OOM condition, it's already too late to guarantee ideal behavior; that is, it's too late to first recover caches and other any pages across all processes/guests that aren't strictly necessary for correct continuation (if at all possible) of every process/guest.

The alternative in each case--VMs and cgroups--is to preallocate memory so that the sum is no more than available backing space. But the Kubernetes clusters I've worked on, for example, all overcommitted cgroups memory--the sum of all pod cgroup memory limits was greater than physical memory (and machines had no swap space, despite my protests). Use of cgroups couldn't prevent the OOM killer from unnecessarily shooting down processes when under high aggregate load; it only helped blunt edge cases where individual pods allocated much more memory than ever expected (e.g. because of a memory leak). This was especially true if a JVM was running in a pod, as for whatever reasons the Java applications people liked to run on those clusters aggressively used anonymous memory. Whenever some other team complained about their pod being OOM killed despite not going over their cgroup budget, the host invariably had a JVM running in another, unrelated pod. (This was especially problematic for the period, 2019-2020, when Linux kernels had a regression that made it *much* more likely that the OOM killer couldn't flush buffer caches faster than processes would dirty them. Worse, the result was the OOM killer spinning, sleeping, and eventually timing out in its flush routine; and often times multiple processes could end up in that routine--or end up trying to take a lock held by that routine--effectively resulting in a situation where the entire system could lock-up for an indeterminate amount of time. Even if you could get an SSH session, poke the wrong process and your session would lock-up, too.)

Brian Kernighan on the origins of Unix

Posted Jan 28, 2022 14:48 UTC (Fri) by farnz (subscriber, #17727) [Link]

Cgroups have two things that a virtual machine setup does not have:

Limits to determine which containers swap more than others. By setting memory.low to your "fair" share without overcommiting, and memory.high to your limit with overcommit, you bias the kernel's paging to page out memory hogs (both to swap and paging out their executable pages), and to slow down processes that can't reclaim memory to stay in their lane.
An interface for a management process to use to identify cgroups that are using more than their fair share. You can monitor memory.events for changes, and use the notifications to tell containers to reduce their memory use.

Now, if you're dealing with people who overcommit physical RAM but who've not read and understood In defence of swap, you're in trouble anyway - they're working on a faulty model of how the OS allocates memory to processes to begin with, and both models are going to break because they're expecting "magic" to save them. But with competent people, cgroups already does better than VMs; and because cgroups has more visibility into what the workload is doing, it's more likely that cgroups will improve over time.

As an example, EPOC32 had low memory signals it sent to the applications to let them know that the system was low on memory, and that it was out of memory. Applications were expected to handle these signals by discarding caches, performing an early GC etc - and could also (ab)use them for better performance (e.g. don't run GC until the system is low on memory). You can get similar signals in a cgroup-aware application by having the application monitor memory.events, and treating a breach of the "low" threshold as "memory is running low" and a breach of the "high" threshold as "out of memory". Then, your JVM could be aware of this, and whenever it breaches the "low" threshold (container using too much memory), it does a GC run, while when it breaches the "high" threshold, it also discards any cached JITted sections etc.

This interacts in a good way with the kernel's preference to swap stuff that breaches low more than it swaps stuff that breaches high - when you breach low, you try to come back under control, when you breach high, you try to avoid being killed.

Brian Kernighan on the origins of Unix

Posted Jan 20, 2022 17:10 UTC (Thu) by ermo (subscriber, #86690) [Link]

I can't help but think that what you're asking for is a micro-kernel-like design from the ground up, where e.g. the user graphical session is its own set of processes (possibly related via a cgroup or namespace-like idiom), each browser tab is its own process and each container is its own proces and where all these processes use the fundamental micro-kernel message-passing paradigm (possibly wrapped) to communicate with each other?

If all processes used the same form of containerisation/namespacing/layering and the same means of IPC (possibly with some faster-than-TCP-but-still-reliable network transport available), this might function as seamlessly as one could ideally expect.

At least, that's the endpoint my imagination extrapolates to from your problem statement. I could be (very) wrong of course.

I don't know how close Plan 9 comes to this hypothetical scenario or whether it's even close. But perhaps Zircon/Fuchsia might very well become the closest modern approximation of what I outline?

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 14:01 UTC (Fri) by smurf (subscriber, #17840) [Link]

… or something like the GNU Hurd.