Not a "bloated, monolithic system"?

Posted Jan 30, 2019 22:21 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
In reply to: Not a "bloated, monolithic system"? by zblaxell
Parent article: Systemd as tragedy

> No, cgroups solve that issue cleanly. systemd's cgroup controller is showing its age and doesn't solve the issue in some cases.
Well, systemd is not set in stone. What needs to be changed to fix it?

> Systemd's cgroup killer plays whack-a-mole with processes in a tight loop, instead of freezing the cgroup then picking off processes at leisure without looping.
PID controller might be even better suited for this, but in practice systemd is much faster at killing processes than the kernel is at forking.

> The concrete bug that we hit in production is that systemd assumes processes respond instantly to SIGKILL, and can be ignored after the signal is sent. On Linux, SIGKILL can take nontrivial amounts of time to process, especially if processes have big temporary files or use lots of RAM (e.g. databases). If a service restarts, and there was a timeout during the stop, systemd can spawn a new service instance immediately, which will try to allocate its own large memory before its predecessor has released it, OOMing the machine to death.
This doesn't sound right. systemd will wait until the cgroup is empty, by which time all the resources should be freed.

And I think you can increase the timeout time and/or disable it completely in this case.

Not a "bloated, monolithic system"?

Posted Jan 31, 2019 21:08 UTC (Thu) by zblaxell (subscriber, #26385) [Link] (22 responses)

> Well, systemd is not set in stone. What needs to be changed to fix it?

systemd is not written in stone, but it is unusually difficult to analyze, modify and deploy. Any day that starts with "understand how hundreds or thousands of units interact at runtime with a dynamic dependency resolver to produce some result" usually ends with "I need to do these 20 things, let's write a 25-line program that does those 20 things in the right order and run that program instead of systemd." I can never figure out how to turn that sort of day into a systemd pull request.

> in practice systemd is much faster at killing processes than the kernel is at forking.

That may be true, assuming no scheduler shenanigans; however, the loop in systemd that kills processes terminates when the cgroup contains no processes _that systemd has not already killed once_, so it's vulnerable to pid reuse attacks. If a forkbomb manages to get a reused pid during the kill loop, systemd will not attempt to kill it.

> This doesn't sound right. systemd will wait until the cgroup is empty, by which time all the resources should be freed.

It writes "Processes still around after SIGKILL. Ignoring." on the log and ignores the process(es). It might send SIGKILL to the same process again when the service restarts, but additional SIGKILLs aren't helpful.

There's no single right answer here, and no documented systemd configuration option I can find to address this case. The documentation talks a lot about waiting for various subsets of processes, but the code revolves around actions taken during state transitions for systemd services. These are not the same thing.

If the killed process is truly stuck, e.g. due to a kernel bug, then it will never exit. Processes that can't ever exit pollute systemd's service dependency model (e.g. you can't get to target states that want that service to be dead). systemd solves that problem by ignoring processes whose behavior doesn't fit into its dependency model.

If the killed process isn't stuck, but just doing something in the kernel that takes a long time, then to fit systemd's state model, we need to continue to wait for the process after we sent it SIGKILL. We only need to wait for the process to exit if we're going to do something where the process exit matters (e.g. start the process again, or umount a filesystem the process was using), but systemd doesn't have a place for such optimizations in its data model.

It's probably possible to do this in systemd with a helper utility in ExecStartPre to block restarts until the service cgroup is completely empty except for itself, but that's no longer "clean", and you'd have to configure it separately for every service that might need it.

> And I think you can increase the timeout time and/or disable it completely in this case.

I can't find a post-KILL timeout in the documentation or code, and I've already spent more time looking for it than I typically spend implementing application-specific service managers.

Not a "bloated, monolithic system"?

Posted Jan 31, 2019 22:16 UTC (Thu) by pizza (subscriber, #46) [Link] (15 responses)

> systemd is not written in stone, but it is unusually difficult to analyze, modify and deploy. Any day that starts with "understand how hundreds or thousands of units interact at runtime with a dynamic dependency resolver to produce some result" usually ends with "I need to do these 20 things, let's write a 25-line program that does those 20 things in the right order and run that program instead of systemd." I can never figure out how to turn that sort of day into a systemd pull request.

If you don't get a deterministic dependency graph for the things you need to happen in a given order, then you haven't sufficiently specified your dependencies. No dependency resolver can read your mind.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 3:56 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (14 responses)

> No dependency resolver can read your mind.

We don't want it to read our minds, or block devices, or assorted device buses, or any data source we didn't explicitly authorize it to consume. We want systems with predictable (or at least highly repeatable) behavior so they behave the same way in test and production.

Yes, we can translate our 25-line startup program into a simple dependency graph, lock down all the places where systemd could pick up extra nodes the graph, and dodge the parts of systemd that have builtin exceptions to the dependency evaluation rules. That's a lot of project risk, though, and sometimes significant cost, and--most important of all--nobody is paying us to do any of that work.

If we're doing safety or security audit on the system, the systemd dependency resolver (and any external code it executes) gets sucked into the audit since it's executing our dependency graph. If the choice comes down to "audit 25 lines of code" or "audit 25 lines of code and also systemd", well, one of those is cheaper.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 12:15 UTC (Fri) by pizza (subscriber, #46) [Link] (13 responses)

> If we're doing safety or security audit on the system, the systemd dependency resolver (and any external code it executes) gets sucked into the audit since it's executing our dependency graph. If the choice comes down to "audit 25 lines of code" or "audit 25 lines of code and also systemd", well, one of those is cheaper.

You're being disingenuous. That first statement should read "Audit 25 lines of code and also the shell interpreter [1] and also everything else the script invokes. [2]"

[1] ie bash dash or csh or busybox or perl or whatever it is that actually parses and executes that that script.
[2] Likely to include grep, psutils, and util-linux as well [3]
[3] Don't forget libreadline, glibc, libstdc++, and everything else the shell and those utilities depends on!

(When all of that "external code" is factored in, I suspect systemd will come out way, way ahead on the "least amount of total code that needs auditing" metric)

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 14:12 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (3 responses)

> Audit 25 lines of code and also the shell interpreter [1] and also everything else the script invokes. [2]

Already done for other projects, no need to do them again. Arguably if we ever did a systemd audit then we could reuse the result for multiple projects, but nobody wants to be the first.

> grep, psutils, and util-linux as well [3] Don't forget libreadline, ..., libstdc++,

We don't use 'em. It's mostly shell builtins, flock, maybe a couple of C wrappers for a couple of random kernel API calls. 'echo' and /sys/fs/cgroup are sufficient for a lot of use cases. Our scope ends once the application is execve()ed in a correctly configured environment and resumes when the application exits.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 15:18 UTC (Fri) by pizza (subscriber, #46) [Link] (2 responses)

> Already done for other projects, no need to do them again. Arguably if we ever did a systemd audit then we could reuse the result for multiple projects, but nobody wants to be the first.

What exactly do you mean when you say "audit"?

That could mean anything between "reviewing the license and patent overlap" to "line-by-line inspection/verification" of the sort that's needed to certify rocket avionics. Are you auditing algorithms and state machines to ensure all states and transitions are sane under any possible input? Are you auditing all input processing to make sure it can't result in buffer overflows or other sorts of security problems? Are your audits intended to ensure there's no leakage of personal information (eg in logs) that could run afoul of things like the GPDR?

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 22:31 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (1 responses)

> That could mean anything between "reviewing the license and patent overlap" to "line-by-line inspection/verification" of the sort that's needed to certify rocket avionics.

Closer to rocket avionics. Part of it is running the intended code under test and verifying that all executed lines behave correctly with a coverage analyzer. Sometimes you can restrict scope by locking down the input and verifying only the parts of the shell that execute, other times you swap in a simpler shell to interpret the service management code and audit 100% of the simpler shell.

> Are you auditing algorithms and state machines to ensure all states and transitions are sane under any possible input?

Attack the problem the other way: precompute an execution plan for the service dependency graph, ordered and annotated for parallel execution. The service manager just executes that at runtime. Conceptually, "shell script" is close enough to get the idea across to people, and correct enough to use in prototyping, but it might be a compiled or interpreted representation of the shell script by the time it gets certified. Add a digital signature somewhere to verify it before executing it.

Most of the time the execution plan can be computed and verified by humans: you need storage, then start networking and UI in parallel, then your application runs until it crashes, then reboot. Someone checks that UI and networking do not in fact depend on each other. In more complicated cases you'd want a tool to do the ordering task, so you provide the auditors with evidence your tool is suitable for the way you use it.

The execution plan does not look anything like sysvinit scripts. It does not have all the capabilities of systemd, since the whole point is to avoid having to audit the parts of systemd you didn't use. Only the required functionality ends up on the target system.

Normally these do not change at runtime except in tightly constrained ways (e.g. a template service can have a variable number of children). If for some reason you need a crazy-high amount of flexibility at runtime in a certified system, there is theoretically a point where the complexity curves cross over and it's cheaper to just sit down and audit systemd.

> Are you auditing all input processing to make sure it can't result in buffer overflows or other sorts of security problems?

If auditing something like a shell, you get a report that says things like "can accept input lines up to the size of the RAM in the system, but don't do that, that would be bad" and "don't let random users control the input of the shell, that would be bad".

So the problem for the service manager script reduces to proving that the shell, as used for the specific service manager input in the specific service manager environment, behaves correctly. This can be as little as some light code review and coverage testing (plus a copy of the shell audit report and a checklist of all the restrictions you observed).

> Are your audits intended to ensure there's no leakage of personal information (eg in logs) that could run afoul of things like the GPDR?

It's never come up. The safety-critical systems don't have any personal information in them, and the security-critical ones have their own logging requirements that are out of scope of the service manager.

In some cases you can mix safety-critical and non-safety-critical services on a system provided there is appropriate isolation between them. Then you audit the safety-critical service, the service manager, and their dependencies, and mostly ignore the rest.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 23:15 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> It's never come up. The safety-critical systems don't have any personal information in them, and the security-critical ones have their own logging requirements that are out of scope of the service manager.
I'm sorry. If your safety-critical systems have programs that can't be SIGKILL-ed cleanly and have several hundred of tightly-interconnected modules, then I want to run in the opposite direction from them.

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 18:55 UTC (Mon) by jccleaver (guest, #127418) [Link] (7 responses)

> You're being disingenuous. That first statement should read "Audit 25 lines of code and also the shell interpreter [1] and also everything else the script invokes. [2]"

And you're not living in the real world. Your shell is already going to be audited, or an upstream audit is taking place if it matters.

Shell scripts can be complex. A 25 line shell script that just imperatively executes 20 commands and then exec's into a final daemon is not complex, and is in fact the simplest possible way to accomplish a deterministic goal on a *nix system.

It boggles my mind that there are administrators out there that would consider some other solution as simpler. Why are people so scared of procedural programming using a *nix shell?

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 19:08 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> Shell scripts can be complex. A 25 line shell script that just imperatively executes 20 commands and then exec's into a final daemon is not complex, and is in fact the simplest possible way to accomplish a deterministic goal on a *nix system.
Nope. How do you track the daemon state? Are you sure PID files are correct? How do you kill the daemon in case it's stuck? What if you want to allow unprivileged users to terminate the service? Do you need a separate SUID binary? ...

Writing a robust shell initscript is incredibly hard. It's not 25 lines of code, many initscripts are hundreds lines of code (and are still buggy as hell).

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 19:20 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (1 responses)

> How do you track the daemon state?

cgroups, watchdogs, service pings...

> Are you sure PID files are correct?

Don't use 'em, because cgroups.

> How do you kill the daemon in case it's stuck?

cgroups

> What if you want to allow unprivileged users to terminate the service?

We don't.

> Do you need a separate SUID binary?

No.

> Writing a robust shell initscript is incredibly hard.

No.

> It's not 25 lines of code, many initscripts are hundreds lines of code (and are still buggy as hell).

No.

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 19:31 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> cgroups, watchdogs, service pings...
OK. Can you show me an init script that is 25 lines long AND uses cgroups to guaranteed the process termination?

During one of the systemd flamewars some years ago I tried to find an actual init script that uses cgroups for comparison with systemd units. I was not able to do it.

Writing it was also decidedly non-trivial, I had to create my own cgroup hierarchy convention and make sure that there are no race conditions in cgroups manipulation. Both are not easy at all, and I was not even trying to use cgroups controllers to limit the resource use.

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 20:30 UTC (Mon) by jccleaver (guest, #127418) [Link] (2 responses)

> Nope. How do you track the daemon state? Are you sure PID files are correct? How do you kill the daemon in case it's stuck? What if you want to allow unprivileged users to terminate the service? Do you need a separate SUID binary? ...

The OP was concerned about doing dependency ordering, the 25 line script was not the init script itself.

> Writing a robust shell initscript is incredibly hard. It's not 25 lines of code, many initscripts are hundreds lines of code (and are still buggy as hell).

No, it's not. On RedHat systems it's this (which, other than the URL, has not changed since basically 2004 - EL3/EL4):

1) Cut/paste this: https://fedoraproject.org/wiki/EPEL:SysVInitScripts#Inits...
2) Edit primary process's name and path
3) Add any additional custom logic your daemon needs

If your initscript for a basic daemon has a more complex structure than this, then you're probably doing something wrong. If your distribution forces you to do something more complex than this, then I'm sorry for you.

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 20:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> The OP was concerned about doing dependency ordering, the 25 line script was not the init script itself.
So now it's 25 lines for deps, then another 1k lines for cgroups manipulation in Bash.

Noted.

> No, it's not. On RedHat systems it's this (which, other than the URL, has not changed since basically 2004 - EL3/EL4)
Incorrect. This doesn't have PID file management, for starters.

RH /etc/init.d/functions tooling

Posted Feb 4, 2019 22:05 UTC (Mon) by jccleaver (guest, #127418) [Link]

> > The OP was concerned about doing dependency ordering, the 25 line script was not the init script itself.
> So now it's 25 lines for deps, then another 1k lines for cgroups manipulation in Bash. Noted.

cgroup *definition* is out of scope. To *assign* this process to a cgroup just set
CGROUP_DAEMON="cpu,memory:test1" (or whatever) in /etc/sysconfig/foo

> Incorrect. This doesn't have PID file management, for starters.

Incorrect. You can see in the start/stop sections that it's recommended to use the 'daemon' and 'killproc' functions, which handle PID file management for you and fall back to looking for the process's exec if not found. If your daemon does something weird in launch, you can have your daemon make the pidfile and pass that file name into the functions with -p.

The default 'status' function handles pid files automatically too.

All of this is in /etc/init.d/functions.
I'm sorry that you don't appear to be using a RedHat system, which is clearly the better distribution.

Not a "bloated, monolithic system"?

Posted Feb 4, 2019 19:10 UTC (Mon) by pizza (subscriber, #46) [Link]

> And you're not living in the real world. Your shell is already going to be audited, or an upstream audit is taking place if it matters.

So why is systemd being held to a different standard?

> It boggles my mind that there are administrators out there that would consider some other solution as simpler. Why are people so scared of procedural programming using a *nix shell?

Shell is used for the same reason that folks use a screwdriver as a chisel or a punch. Sure it's convenient, and often gets the job done, but there's a much higher chance of unintended, and often quite painful, consequences.

Not a "bloated, monolithic system"?

Posted Mar 5, 2019 7:26 UTC (Tue) by immibis (subscriber, #105511) [Link]

Those would be there whether you use systemd or not... also, who said it was a 25-line shell script?

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 6:12 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> It writes "Processes still around after SIGKILL. Ignoring." on the log and ignores the process(es). It might send SIGKILL to the same process again when the service restarts, but additional SIGKILLs aren't helpful.
I tried to replicate it by creating an unresponsive NFS share.

Systemd failed to kill the process (as expected) and the service entered the "failed" state. It was not restarted.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 14:14 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (4 responses)

> It was not restarted.

Not the bug I was looking for, but a bug nonetheless.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 20:15 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Not the bug I was looking for, but a bug nonetheless.
Uhm, why? A service that can't be SIGKILL-ed is clearly not safe to be restarted.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 22:11 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (2 responses)

>A service that can't be SIGKILL-ed is clearly not safe to be restarted.

Look up the page a little: long-running processes that take a long time to exit after SIGKILL, but eventually get there. You want to restart them, but only after they exit, and there's a big time gap between KILL and exit.

Not a "bloated, monolithic system"?

Posted Feb 1, 2019 23:23 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Then create a separate service responsible for the restart. Activate it on initial service's failure.

Duh.

Not a "bloated, monolithic system"?

Posted Feb 9, 2019 22:38 UTC (Sat) by nix (subscriber, #2304) [Link]

I'm sorry, but could you possibly be a little more contemptuous? It's not unpleasant enough to read this comment thread yet.