Brian Kernighan on the origins of Unix
Brian Kernighan on the origins of Unix
Posted Jan 19, 2022 17:07 UTC (Wed) by dgm (subscriber, #49227)In reply to: Brian Kernighan on the origins of Unix by epa
Parent article: Brian Kernighan on the origins of Unix
Posted Jan 19, 2022 19:55 UTC (Wed)
by epa (subscriber, #39769)
[Link] (4 responses)
We have approximations such as sandboxes used by web browsers to isolate one tab from another. Again, I see those as a kind of workaround or a hazy approximation to the right answer: creating a new user on the fly for each tab, giving it its own memory and CPU budget, and then communicating via signals, Unix domain sockets, or a shared filesystem. But yeah, talk is cheap. The point I want to make is that if it's considered a solved problem, that may be because we're too used to the limitations of the current systems. Before git, software version control might have been seen as a solved problem by some.
Posted Jan 20, 2022 11:16 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (2 responses)
The other thing that would be nice to solve that VMs don't is good resource utilization; a VM's memory allocation is hard to vary dynamically based on overall system load, as is its filesystem size. The only host resource that's demand-shared in a VM setup is CPU.
It would be nice if we didn't have to preallocate resources to processes, but instead could rely on the system sharing all resources fairly and reasonably for the use cases in question. That's a hard problem, but it's one that (e.g.) filesystem quotas and cgroups aim to solve for containerized systems.
Posted Jan 28, 2022 6:05 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (1 responses)
The alternative in each case--VMs and cgroups--is to preallocate memory so that the sum is no more than available backing space. But the Kubernetes clusters I've worked on, for example, all overcommitted cgroups memory--the sum of all pod cgroup memory limits was greater than physical memory (and machines had no swap space, despite my protests). Use of cgroups couldn't prevent the OOM killer from unnecessarily shooting down processes when under high aggregate load; it only helped blunt edge cases where individual pods allocated much more memory than ever expected (e.g. because of a memory leak). This was especially true if a JVM was running in a pod, as for whatever reasons the Java applications people liked to run on those clusters aggressively used anonymous memory. Whenever some other team complained about their pod being OOM killed despite not going over their cgroup budget, the host invariably had a JVM running in another, unrelated pod. (This was especially problematic for the period, 2019-2020, when Linux kernels had a regression that made it *much* more likely that the OOM killer couldn't flush buffer caches faster than processes would dirty them. Worse, the result was the OOM killer spinning, sleeping, and eventually timing out in its flush routine; and often times multiple processes could end up in that routine--or end up trying to take a lock held by that routine--effectively resulting in a situation where the entire system could lock-up for an indeterminate amount of time. Even if you could get an SSH session, poke the wrong process and your session would lock-up, too.)
Posted Jan 28, 2022 14:48 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
Cgroups have two things that a virtual machine setup does not have:
Now, if you're dealing with people who overcommit physical RAM but who've not read and understood In defence of swap, you're in trouble anyway - they're working on a faulty model of how the OS allocates memory to processes to begin with, and both models are going to break because they're expecting "magic" to save them. But with competent people, cgroups already does better than VMs; and because cgroups has more visibility into what the workload is doing, it's more likely that cgroups will improve over time.
As an example, EPOC32 had low memory signals it sent to the applications to let them know that the system was low on memory, and that it was out of memory. Applications were expected to handle these signals by discarding caches, performing an early GC etc - and could also (ab)use them for better performance (e.g. don't run GC until the system is low on memory). You can get similar signals in a cgroup-aware application by having the application monitor memory.events, and treating a breach of the "low" threshold as "memory is running low" and a breach of the "high" threshold as "out of memory". Then, your JVM could be aware of this, and whenever it breaches the "low" threshold (container using too much memory), it does a GC run, while when it breaches the "high" threshold, it also discards any cached JITted sections etc.
This interacts in a good way with the kernel's preference to swap stuff that breaches low more than it swaps stuff that breaches high - when you breach low, you try to come back under control, when you breach high, you try to avoid being killed.
Posted Jan 20, 2022 17:10 UTC (Thu)
by ermo (subscriber, #86690)
[Link]
If all processes used the same form of containerisation/namespacing/layering and the same means of IPC (possibly with some faster-than-TCP-but-still-reliable network transport available), this might function as seamlessly as one could ideally expect.
At least, that's the endpoint my imagination extrapolates to from your problem statement. I could be (very) wrong of course.
I don't know how close Plan 9 comes to this hypothetical scenario or whether it's even close. But perhaps Zircon/Fuchsia might very well become the closest modern approximation of what I outline?
Posted Jan 21, 2022 14:01 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Brian Kernighan on the origins of Unix
Brian Kernighan on the origins of Unix
Brian Kernighan on the origins of Unix
Brian Kernighan on the origins of Unix
Brian Kernighan on the origins of Unix
Brian Kernighan on the origins of Unix