Brian Kernighan on the origins of Unix

Posted Jan 18, 2022 7:47 UTC (Tue) by NYKevin (subscriber, #129325)
In reply to: Brian Kernighan on the origins of Unix by zdzichu
Parent article: Brian Kernighan on the origins of Unix

The problem here is not (necessarily) that the tools don't exist, it is that none of them are as ubiquitous and widely-supported as Unix. I want containerization to be a Solved Problem™, in the same way that multi-user operating systems are a Solved Problem™, and right now it feels like we're still in the "everybody is throwing things at the wall to see what sticks" phase.

(Sure, at the low level, Linux-based systems have mostly standardized on "everything is a frontend to cgroups," but orchestration is an entirely different story. There's no single standard way of describing a bunch of generic containers, or machines, their associated resource requirements and constraints, etc., much less a standard piece of software that will take those constraints and schedule your containers for you. As I mentioned, you *can* do this stuff with k8s, but nobody regards k8s as The One True Container Solution in the same way that Unix is The One True Operating System.)

Brian Kernighan on the origins of Unix

Posted Jan 18, 2022 8:52 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

It’s not a technical problem it’s an organisational/funding problem.

All those new tools are proposed by organisations that do not want to contribute to existing distribution networks, but are unable to set up alternatives (because that’s hard and expensive).

When everyone will be fed up with the current mess things will consolidate violently (like they did at the end of the unix wars, with most of the companies that had refused team play on the losing side).

Brian Kernighan on the origins of Unix

Posted Jan 18, 2022 15:36 UTC (Tue) by epa (subscriber, #39769) [Link] (7 responses)

Multi-user systems are not a solved problem. If they were, there would hardly be any need for containers. Most container setups work around the inherently single-user setup of typical Linux distributions or Microsoft Windows.

In theory each user can have an independent set of software, sharing only the kernel; in practice it's damned awkward to run your own software stack from your home directory, and there aren't package management tools to make it easy.

In theory each user can have their own share of CPU time and I/O budget; in practice it's far too easy for one user to take more than their fair share, or to bust global limits with fork bombs and the like.

Users and permissions are themselves not per-user; there's a single global set of users and groups and it does not nest. You can have subdirectories in the file system but you can't have sub-accounts under your user account or make your own groups. It all has to go through a single global administrator. A true multi-user system would let you manage your own permissions within your space and delegate to sub-users.

Can you set up a multiuser Linux system where each user gets their own IP address, so they can run their own httpd on port 80 serving that address, and not interfere with other users? I am sure it's possible with enough hackery, but it's far from being straightforward or manageable.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 17:07 UTC (Wed) by dgm (subscriber, #49227) [Link] (6 responses)

The moment you have per-user everything (programs, libraries, permissions, filesystems, network, ...) what you have is, in essence, a per-user kernel. I think that the best solution for running per-user kernels are good 'ol Virtual Machines (in the VirtualBox sense), so it is a solved problem indeed.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 19:55 UTC (Wed) by epa (subscriber, #39769) [Link] (4 responses)

If you don't want the users to communicate with each other (except over network links) then yes. You give each user their own system, whether by physical hardware or a virtual machine. My vision of a multi-user system is that it has the properties I mentioned, but at the same time you can communicate between users in the lightweight way you do on a current Linux system, provided you've granted permission. The multi-user interface provided by a set of virtual machines is pretty basic, because usually the only way for them to communicate is via TCP/IP.

We have approximations such as sandboxes used by web browsers to isolate one tab from another. Again, I see those as a kind of workaround or a hazy approximation to the right answer: creating a new user on the fly for each tab, giving it its own memory and CPU budget, and then communicating via signals, Unix domain sockets, or a shared filesystem. But yeah, talk is cheap. The point I want to make is that if it's considered a solved problem, that may be because we're too used to the limitations of the current systems. Before git, software version control might have been seen as a solved problem by some.

Brian Kernighan on the origins of Unix

Posted Jan 20, 2022 11:16 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

The other thing that would be nice to solve that VMs don't is good resource utilization; a VM's memory allocation is hard to vary dynamically based on overall system load, as is its filesystem size. The only host resource that's demand-shared in a VM setup is CPU.

It would be nice if we didn't have to preallocate resources to processes, but instead could rely on the system sharing all resources fairly and reasonably for the use cases in question. That's a hard problem, but it's one that (e.g.) filesystem quotas and cgroups aim to solve for containerized systems.

Brian Kernighan on the origins of Unix

Posted Jan 28, 2022 6:05 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

cgroups doesn't solve the memory sharing problem any better, though admittedly VMs do currently compound the problem. virtio-balloon, which is widely supported (including by typical Linux hosts and guests) permits a guest to return memory (e.g. filesystem buffer caches) to the host. The problem (AFAIU) is that currently there's no implementation on either side (host or guest) that communicates memory pressure back-and-forth, opportunistically shifting memory around. See defunct auto-ballooning project at https://www.linux-kvm.org/page/Projects/auto-ballooning. So, just as with cgroups, if the memory accounting budget is overcommitted and the VM manager hits a hard OOM condition, it's already too late to guarantee ideal behavior; that is, it's too late to first recover caches and other any pages across all processes/guests that aren't strictly necessary for correct continuation (if at all possible) of every process/guest.

The alternative in each case--VMs and cgroups--is to preallocate memory so that the sum is no more than available backing space. But the Kubernetes clusters I've worked on, for example, all overcommitted cgroups memory--the sum of all pod cgroup memory limits was greater than physical memory (and machines had no swap space, despite my protests). Use of cgroups couldn't prevent the OOM killer from unnecessarily shooting down processes when under high aggregate load; it only helped blunt edge cases where individual pods allocated much more memory than ever expected (e.g. because of a memory leak). This was especially true if a JVM was running in a pod, as for whatever reasons the Java applications people liked to run on those clusters aggressively used anonymous memory. Whenever some other team complained about their pod being OOM killed despite not going over their cgroup budget, the host invariably had a JVM running in another, unrelated pod. (This was especially problematic for the period, 2019-2020, when Linux kernels had a regression that made it *much* more likely that the OOM killer couldn't flush buffer caches faster than processes would dirty them. Worse, the result was the OOM killer spinning, sleeping, and eventually timing out in its flush routine; and often times multiple processes could end up in that routine--or end up trying to take a lock held by that routine--effectively resulting in a situation where the entire system could lock-up for an indeterminate amount of time. Even if you could get an SSH session, poke the wrong process and your session would lock-up, too.)

Brian Kernighan on the origins of Unix

Posted Jan 28, 2022 14:48 UTC (Fri) by farnz (subscriber, #17727) [Link]

Cgroups have two things that a virtual machine setup does not have:

Limits to determine which containers swap more than others. By setting memory.low to your "fair" share without overcommiting, and memory.high to your limit with overcommit, you bias the kernel's paging to page out memory hogs (both to swap and paging out their executable pages), and to slow down processes that can't reclaim memory to stay in their lane.
An interface for a management process to use to identify cgroups that are using more than their fair share. You can monitor memory.events for changes, and use the notifications to tell containers to reduce their memory use.

Now, if you're dealing with people who overcommit physical RAM but who've not read and understood In defence of swap, you're in trouble anyway - they're working on a faulty model of how the OS allocates memory to processes to begin with, and both models are going to break because they're expecting "magic" to save them. But with competent people, cgroups already does better than VMs; and because cgroups has more visibility into what the workload is doing, it's more likely that cgroups will improve over time.

As an example, EPOC32 had low memory signals it sent to the applications to let them know that the system was low on memory, and that it was out of memory. Applications were expected to handle these signals by discarding caches, performing an early GC etc - and could also (ab)use them for better performance (e.g. don't run GC until the system is low on memory). You can get similar signals in a cgroup-aware application by having the application monitor memory.events, and treating a breach of the "low" threshold as "memory is running low" and a breach of the "high" threshold as "out of memory". Then, your JVM could be aware of this, and whenever it breaches the "low" threshold (container using too much memory), it does a GC run, while when it breaches the "high" threshold, it also discards any cached JITted sections etc.

This interacts in a good way with the kernel's preference to swap stuff that breaches low more than it swaps stuff that breaches high - when you breach low, you try to come back under control, when you breach high, you try to avoid being killed.

Brian Kernighan on the origins of Unix

Posted Jan 20, 2022 17:10 UTC (Thu) by ermo (subscriber, #86690) [Link]

I can't help but think that what you're asking for is a micro-kernel-like design from the ground up, where e.g. the user graphical session is its own set of processes (possibly related via a cgroup or namespace-like idiom), each browser tab is its own process and each container is its own proces and where all these processes use the fundamental micro-kernel message-passing paradigm (possibly wrapped) to communicate with each other?

If all processes used the same form of containerisation/namespacing/layering and the same means of IPC (possibly with some faster-than-TCP-but-still-reliable network transport available), this might function as seamlessly as one could ideally expect.

At least, that's the endpoint my imagination extrapolates to from your problem statement. I could be (very) wrong of course.

I don't know how close Plan 9 comes to this hypothetical scenario or whether it's even close. But perhaps Zircon/Fuchsia might very well become the closest modern approximation of what I outline?

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 14:01 UTC (Fri) by smurf (subscriber, #17840) [Link]

… or something like the GNU Hurd.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 0:29 UTC (Wed) by bartoc (guest, #124262) [Link] (5 responses)

I bet many do see k8s as the "one true container solution". I also bet around 100% of those people work for Google.

I've tried to understand k8s multiple times and it's like they refuse to _ever_ document what anything actually does, just some high-level abstract theory of what the results might be. It feels like reading a formal standard but more arbitrary.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 5:36 UTC (Wed) by jkingweb (subscriber, #113039) [Link] (4 responses)

This has been my experience with all Google documentation I've ever seen.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 12:10 UTC (Wed) by cyperpunks (subscriber, #39406) [Link] (3 responses)

Yep, sometimes I puzzled how Google can earn all that money and have all these brilliant minds; the software they produce and release lacks both pooper documentation and sane release policy. If they sold software the Microsoft way they would be out of business many years ago.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 13:56 UTC (Wed) by farnz (subscriber, #17727) [Link]

Technical writing is a very different discipline to software writing, and needs a different set of skills. Unfortunately, too many companies fail to value tech writing, and instead expect that the software authors will write good documentation themselves.

A really good tech writer knows how to take a developer's documentation (written from a place of deep familiarity with the code) and turn it into something useful for people unfamiliar with the code. A software developer just doesn't have that skill.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 17:40 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (1 responses)

The trick is that Google doesn't sell software at all.

It sells advertising space (and possibly trademark licences?), and uses a slice of the revenue to pay the salaries of its technology division.

Brian Kernighan on the origins of Unix

Posted Jan 19, 2022 17:42 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

(Also it charges rent on business-use access to its services.)

Brian Kernighan on the origins of Unix

Posted Jan 20, 2022 22:37 UTC (Thu) by ceplm (subscriber, #41334) [Link] (12 responses)

One thing which is very much not A Solved Problem™ is really working networked system. Something which would be networked but it would allow seamless work offline (or when the network is not that really fast as we would like it to be). We have either local systems (BTRFS, XFS, etc.) or network systems (NFS, Samba, etc.), but nothing in between.

I believe that Plan9 was more on the way towards this ideal, but Plan9 unfortunately died.

Brian Kernighan on the origins of Unix

Posted Jan 20, 2022 22:44 UTC (Thu) by sfeam (subscriber, #2841) [Link] (11 responses)

We have either local systems (BTRFS, XFS, etc.) or network systems (NFS, Samba, etc.), but nothing in between.

Well, there were VAXClusters. Start a job on the cluster and it would migrate from machine to machine or node to node as they connected or disconnected themselves from the cluster. So far as I know unix/linux clusters never have reached this level of flexibility. Or maybe you were focusing specifically on filesystems?

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 11:16 UTC (Fri) by ceplm (subscriber, #41334) [Link] (3 responses)

Yes, VMSClusters look like something I was thinking about (as far as I can get from the Wikipedia article on it). I speak mostly out of my frustration with the current state of the affairs.

What I was thinking was that all my data are somewhere in The Cloud and whatever machine I use to access them (workstation, my hobby home laptop, tablet, or phone) I get access to the same data. Perhaps I don’t have all applications for all data (thinking that tablet or phone), then I cannot access those, but for those data which I have application for I can access the exact same data. And of course all that over the Internet, so fully secure, with authorizations and all that crap. And all that cached on the local disc, because otherwise latency will kill any user experience. It should be the normal state of everyday affairs not something like pulling your teeth slowly.

If you try something like that today you have NFS or Samba or stuff like that, which is unuseable for everyday work over even slightly non-local network (not mentioning a potential security disaster, so one switches to something even more obscure like sshfs and that’s even slower; NFSv4 may be more secure). Or you have some ultra-high-level stuff like Coda/OpenAFS etc. which requires really high-end hardware (no chance of running it on my tablet, phone could not be even mentioned) and even there I am not sure how well it works. Or you have series of mutually incompatible ad-hoc synchronization hacks (offline-IMAP, similar for CardDAV/CalDAV/etc., some weird in-browser proprietary caching for Google Docs, scripts using rsync, etc.).

http://ninetimes.cat-v.org/ seems to claim that Plan9 was supposed to be able of something like that, but I have my deepest doubts and I will believe it when I see it.

If you have some system which fulfils my requirements than you are either God or a liar, and I be on the latter.

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 23:54 UTC (Fri) by intgr (subscriber, #39733) [Link] (2 responses)

> What I was thinking was that all my data are somewhere in The Cloud and whatever machine I use to access them (workstation, my hobby home laptop, tablet, or phone) I get access to the same data.

Perkeep (previously Camlistore) is an attempt to solve this problem.

Brian Kernighan on the origins of Unix

Posted Jan 22, 2022 19:28 UTC (Sat) by ceplm (subscriber, #41334) [Link] (1 responses)

OK, an interesting idea I will try to learn more about it. Two problems seem obvious, why I think keeping it on the lower level of a filesystem makes more sense is that nobody wants to use any format they don’t already use (https://xkcd.com/743/), and the second is that I have never heard about, so I don’t expect it to take over the world anytime soon.

Brian Kernighan on the origins of Unix

Posted Jan 22, 2022 21:57 UTC (Sat) by ceplm (subscriber, #41334) [Link]

Hmm, I have to revert myself a bit. I was talking about completely independent system and then I was talking about using “normal” formats, which is mutually exclusive. Damn.

Concerning Perkeep, it looks interesting, but more as a secondary story for backup and archiving (something similar to https://github.com/ThinkUpLLC/ThinkUp ?) not as something where you actually work.

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 12:01 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

People tried that with Linux (famous Beowulf clusters!) but this really doesn't buy you a lot. Why would you migrate jobs between hosts in the first place?

You can do full machine migration easily either via VM checkpoint/restore or via CRIU, and this is useful in virtualization scenarios. You can also do high-availability via Xen where the same code runs on two computers at once, in case one of them goes down.

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 13:58 UTC (Fri) by malmedal (subscriber, #56172) [Link] (5 responses)

I don't think Beowulf clusters had that capability. You could schedule your job in any cluster-member, but not move them. Similarly with VMS, you could maintain the illusion of 100% uptime of a specially written application even while replacing all hardware.(admittedly I've only used VMS on VAX, maybe later versions could actually move jobs).

MOSIX however, could move arbitrary processes between cluster machines.

Another option is Condor, which uses CRIU to move simple processes.

Beowulf

Posted Jan 21, 2022 14:01 UTC (Fri) by corbet (editor, #1) [Link]

Yeah...Beowulf clusters still exist, they are just called "data centers" now...

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 18:14 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Yes, they did. You could migrate processes between nodes: https://web.archive.org/web/20031002104248/http://www.ope... with some measure of automatic load balancing.

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 18:36 UTC (Fri) by malmedal (subscriber, #56172) [Link] (2 responses)

I believe I said Beowulf couldn't but MOSIX could migrate processes, they are different things.

Brian Kernighan on the origins of Unix

Posted Jan 21, 2022 18:38 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Sorry, I'm a writer, not a reader. 🤦

Yes, of course you're correct. I used a Beowulf cluster with MOSIX back in the day, so that's why I'm confusing two of them.

Brian Kernighan on the origins of Unix

Posted Jan 22, 2022 13:03 UTC (Sat) by malmedal (subscriber, #56172) [Link]

My own memory also rather lacking at this point, but I think the deal-breaker for me was that the migrated process would die if the node it originally ran on rebooted.

Did MOSIX ever fix that issue?