A "kill" button for control groups

Posted May 3, 2021 22:06 UTC (Mon) by zblaxell (subscriber, #26385)
Parent article: A "kill" button for control groups

This doesn't seem controversial to me. Just look at what it's replacing: the systemd-style while loop scraping PIDs out of /sys/fs/cgroup, dodging processes stuck in high-latency kernel syscalls while trying to win a game of whack-a-mole with high-priority fork-bombs on low-core-count systems with potentially unbounded run time and no clear exit condition...is an abomination familiar to anyone who has had to troubleshoot it, or worse, implement it.

It was possible to get a similar effect by freezing a cgroup before enumerating pids to kill it (i.e. first stop the fork-bombs from running, then kill the frozen processes), but the freezer cgroup has its own set of gotchas--we have to wait for the freeze to take effect to be sure we've captured every pid, and that waiting means we're back to "potentially unbounded run time" and distinguishing between new processes popping up and old processes that refuse to die, and with that extra complexity we are doing only slightly better than the abomination.

If the kernel implements this capability, it can lock out new processes from being created, while it terminates old ones. Userspace can do one write(2) syscall with running time proportional to the task list size, then forget anything in the cgroup existed (unless it chooses to wait until all killed processes blocked in syscalls exit, in which case cgroup2 has an API for that). Simple, elegant, and effective.

Of course, writing something to some file named "kill" will likely wipe out some processes. No lesser result should be expected by a human writing to a file on a control filesystem with such a name. Software blindly writing numbers into new cgroup files without knowing what the numbers mean is already asking for a world of trouble--it's best to make such software notorious, so it can be removed from the world.

But...maybe it would be better if writing, say, 0x4321fedc (LINUX_REBOOT_CMD_POWER_OFF, defined in sys/reboot.h) killed the cgroup, and other values didn't? Or the string "kill" or "-9" or really anything other than the first non-zero integer.

I also wonder why it only sends SIGKILL? People often want to send SIGTERM first, and since the kill file takes a numeric value anyway, it might as well be the signal ID.

A "kill" button for control groups

Posted May 3, 2021 22:14 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> If the kernel implements this capability, it can lock out new processes from being created, while it terminates old ones.
You can already do that with PID controllers.

A "kill" button for control groups

Posted May 3, 2021 23:19 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (4 responses)

systemd doesn't seem to know that. The pids controller was invented in 2015, systemd's cg_kill functions were last significantly updated in 2011. OTOH freezer cgroup was invented in 2008 and systemd didn't use that to derace cg_kill either.

It looks like there are some ways to escape from the pids controller which the kill button closes off: a process that is running fork() can evade some of the limits that are imposed after fork() starts and before it finishes, or escape by migrating to another cgroup. The kill-button patch leaves a note to smack that process with a SIGKILL just before fork() returns.

A "kill" button for control groups

Posted May 3, 2021 23:52 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Well, if you can escape to another cgroup then you can also avoid the kill controller. Normal processes don't have the rights for it.

Personally, I would prefer a reliable handle-based API for processes instead of trying to plug leaks in a dam with fingers.

A "kill" button for control groups

Posted May 4, 2021 22:07 UTC (Tue) by zblaxell (subscriber, #26385) [Link] (1 responses)

> if you can escape to another cgroup then you can also avoid the kill controller. Normal processes don't have the rights for it.

Rights can be delegated. That's one of the central features of cgroups: you don't need to be root to use it.

A process can move around within its delegation hierarchy and evade a (naive, non-looping) userspace terminator--that was part of what made looping (and possibly also recursive search) in userspace necessary. Processes can hold the controller FD's open so they can give themselves their rights back even if the control files are chmod-ed. Also probably a hundred other holes I haven't bothered to think about, and with this patch set, no longer have to.

A "kill" button for control groups

Posted May 4, 2021 22:46 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Realistically, systemd will kill processes faster than they can migrate within their subtree. It's a theoretical problem, but not a practical one.

A "kill" button for control groups

Posted May 4, 2021 16:50 UTC (Tue) by mezcalero (subscriber, #45103) [Link]

So, regarding use of freezer in systemd for killing: the cgroupsv1 freeze is really broken API-wise. i.e. you need to have a timed sleep loop to see when it is done. It's not usable for any code that strives to be reasonably clean. We never supported it in systemd for anything for this reason. I mean, there are limits to everything how much ugly low-level code we are willing to accept...

The cgroupsv2 freezer makes a ton more sense, and we expose it with hence high level operations (systemctl freeze + systemctl thaw), but we don't use it to make killing race-free. We could do that, but it doesn't feel ideal to me, since freezing is slow, i.e. we need to initiate the freeze, then wait until the kernel tells us it is done (poll()), then enqueue the signal, and then unfreeze and wait again. And blocking syscalls can delay the freeze for long times. Thus killing would become a "slow" operation in the worst case (at least that's my understanding), and that kinda sucks. After all we want this as a clean-up operation that gets rid of broken stuff, i.e. SIGKILL is the unfriendly way to abort stuff, but if things are not abortive anymore if we use the freezer, that defeats half the point.

I love Christian's work on this, since it fixes the race for us *and* is always a quick operation. We don't have to wait for anything "slow". (I mean, it internally also iterates through all processes, so it's not O(1), but that's not what I mean by "slow"...) It just enqueues the SIGKILL for each process in a race-free fashion, and that's all we need.

So, yeah, I am looking forward to Christian' work land and we'll happily make use of it in systemd once it lands. It fixes a real problem for us.

Lennart