LWN: Comments on "Another Debian init system vote called"

Another Debian init system vote called

hummassa — Mon, 17 Feb 2014 13:21:22 +0000

That, per se, would mean it was a terrible joke...

:-(

iabervon — Thu, 13 Feb 2014 18:43:24 +0000

It's essentially a standard filibuster mechanism: if a slight minority wants to block passage, they can do so, but it costs them reputation to do so if all of the substantive issues have been raised. That is, the upstart supporters on the TC could vote FD, but if they transparently do so to prevent systemd from winning, then the community sees the most prominent upstart supporters within the community obstructing progress, which would hurt them in a GR later.

Regardless of the default init system for jessie, there's the question of which init systems will be available, now and in the future. Supporters of upstart are limited in what they can do tactically on this vote due to the long-term need to be not too much trouble to have in Debian.

Another Debian init system vote called

Jonno — Thu, 13 Feb 2014 14:34:43 +0000

> you mean in upstart you cannot depend on a service started via init script?

In upstart sysvinit scripts aren't a first class citizens, instead there is a single upstart job that runs all enabled sysvinit scripts that doesn't have a corresponding upstart job.

You can order other upstart jobs before or after that job, but there is no (easy) way to know what sysvinit scripts it starts or if any of them failed.

It is also plain impossible to have a sysvinit script that depends on an upstart job that depends on a sysvinit script.

Another Debian init system vote called

nix — Thu, 13 Feb 2014 14:32:56 +0000

I read somewhere that he contributed to Debian for 13 years.

I'm reasonably sure it's longer than that. Thirteen years ago would be 2001. He was *DPL* in 1998. He rewrote dpkg in 1994/1995... so call it twenty years, minimum.

Another Debian init system vote called

nix — Thu, 13 Feb 2014 14:27:58 +0000

Ian suddenly (and sensibly) disappeared himself for several days, to calm down. Does that count?

Another Debian init system vote called

hummassa — Wed, 12 Feb 2014 17:48:12 +0000

you mean in upstart you cannot depend on a service started via init script?

Another Debian init system vote called

smurf — Wed, 12 Feb 2014 17:20:22 +0000

inetd is a half-assed solution which doesn't even try to solve all the other problems you commonly encounter when administrating a service; in fact, it made that more difficult to accomplish.

For instance, sometimes you really need to stop a service temporarily. inetd? Edit inetd.conf and reload inetd?? forget it, the first thing I'd need to do is to figure out which signal to use for that …

Or restart/reload. With inetd you can only do by manually killing it; there's no integrated system here, and, again, wild inconsistency whether you use SIGHUP to tell a daemon to reload its config or SIGUSR1 or whatever. At least "/etc/init.d/apache reload" actually works without remembering which signal to use. Or which file the PID is stored in.

In addition, some services listen on multiple sockets … or these days dbus … or they must not start prematurely. For instance, I cannot socket-activate Apache because it depends on mysql to run beforehand, and I cannot socket-activate mysql (this whould solve the dependency problem by way of simply stalling Apache) because local connections use a unix-domain socket.

systemd just does away with (almost) all of these problems.

Another Debian init system vote called

paulj — Wed, 12 Feb 2014 16:18:15 +0000

I never said inetd was an argument against systemd. Indeed, if systemd manages to bring back socket-activated services, that'd be great.

I'm just curious as to why socket activation didn't take over the world the last time, in inetd. If there were clear reasons, it'd be useful to understand those - particularly any technical ones.

Another Debian init system vote called

anselm — Wed, 12 Feb 2014 16:10:57 +0000

Inetd did allow children to keep running and handle multiple requests, and it would have been trivial to have extended it to hand on further sockets, e.g. the listen socket for TCP, had inetd become more widely used. Why didn't it, and its socket activation, become the default choice for managing services?

People probably thought it was bloated, overengineered, and against the One True Unix Philosophy™ …

I keep failing to see how »we could do everything systemd can do with SysV init and inetd if only we wrote a little missing code« is an argument in favour of the status quo and against systemd. The status quo will still be inconsistent, badly documented, and difficult to get to grips with, what with the different configuration mechanisms, file formats, and so on.

Inetd, for example, doesn't let you control which interface out of several a service binds to, so you don't, say, get to run an inetd-based service on localhost only. This is a pretty obvious and useful feature and could probably be added fairly easily – Postfix's »master« process, which basically amounts to a glorified inetd, supports it, for one –, but the truth of the matter is that during the last three decades or so of inetd's existence nobody actually condescended to do so. (Nowadays there is xinetd, of course, but there you have yet another piece of software that comes with its own configuration file format and is just as badly integrated with the rest of the SysV init world than inetd was.)

Another Debian init system vote called

paulj — Wed, 12 Feb 2014 15:30:57 +0000

What other drawbacks? Given systemd socket activation is basically extending on what was there in inetd, what drawbacks could it have that do not apply to systemd as well?

(In another comment on another, earlier story, I asked why it was that inetd fell out of popularity, after a number of early network services were programmed to its style, and Unix programmes went back to daemonisation - I'm still curious to hear views on this. Inetd did allow children to keep running and handle multiple requests, and it would have been trivial to have extended it to hand on further sockets, e.g. the listen socket for TCP, had inetd become more widely used. Why didn't it, and its socket activation, become the default choice for managing services?)

Another Debian init system vote called

mathstuf — Tue, 11 Feb 2014 19:33:06 +0000

I wonder if there could be one cgroup manager per namespace (specifically PID namespace since that seems like the best qualified one). When creating a new one, the kernel asks the parent namespace's manager to portion off a subtree to delegate to the container's manager then passes that off to the container (if requested). Of course, I don't use the cgroup API directly and have no idea how feasible/useful this is :) .

Another Debian init system vote called

anselm — Tue, 11 Feb 2014 11:50:57 +0000

The main goal of the double-fork method is to let the actual service process disassociate itself from the process that started it (usually a shell, whether it has been started from the command line or from an init script). This makes it immune from signals that are sent to the parent and then propagated to its children, for example if a login shell session ends.

This works because if a process exits, its children are implicitly adopted by the init process (PID 1). Hence the actual service process becomes a direct child of PID 1, and as such PID 1 is notified if it exits. Systemd can use this information much more profitably than Sys-V init, because Sys-V init doesn't even know (or care) what sort of process it is dealing with, so it is in no position to, e.g., restart the service.

The downside is that it is very difficult to figure out the PID of the actual service process from the outside, since the shell that launched the daemon is only told the PID of the intermediate process which forks the actual service process. This is why daemons often write their eventual PID to a file, which sort-of works but is frankly not a great way of dealing with the problem due to possible race conditions, collisions between multiple instances of the same daemon wanting to write to the same file, etc.

Another Debian init system vote called

smurf — Tue, 11 Feb 2014 11:01:12 +0000

> But why write a sub-optimal daemon n the first place? Why would you want to use this setup/fork/exit model?

Precisely because the setup occurs first. Thus an exit code =0 means "all OK, I'm up", whereas !=0 tells the startup manager "something went wrong, halt the sequence". (Not that SysV init actually cared …)

Systemd has sd_notify() which can be used to convey other information ("I'm alive but still checking the database") without, unlike sysVinit, blocking the rest of the startup sequence.

the GR will be ugly

cesarb — Tue, 11 Feb 2014 10:38:24 +0000

If they are the same I read, these were three posts by the same guy, repeating himself. I'd count them as one.

A single outlier, who is probably not even a Debian Developer, tells nothing about how badly the GR will go.

Another Debian init system vote called

smurf — Tue, 11 Feb 2014 10:31:26 +0000

You know what? I'm going to go forward under the assumption that systemd will win the ‹censored› GR, if and when there is one.

Enough time has been wasted already.

Another Debian init system vote called

anselm — Tue, 11 Feb 2014 10:29:11 +0000

If you look at Unix systems programming books from the 1990s (e.g., the books by W. Richard Stevens) you'll find that this actually used to be the recommended method of writing Unix daemons -- probably because the support for service startup and tracking from System-V init was so dismal.

Thanks to modern init systems like systemd, we are now in the fortunate position of being able to do away with much of the low-level support stuff that used to be part and parcel of the C code of Unix daemons in the old days. (You could do this to a certain degree with inetd, but that had other drawbacks.) Even on Debian, start-stop-daemon is considered something of a hack, and you can't rely on its presence on other systems.

Another Debian init system vote called

paulj — Tue, 11 Feb 2014 09:01:35 +0000

And whatever those security problems are, they do not get fixed by exporting this API again via a user-space IPC API. To fix them requires knowing which ones are sensitive and limiting access. This could surely be done just as easily by restricting the permissions on whichever security-sensitive knobs in the fs-based API - once you know which!

Another Debian init system vote called

tzafrir — Tue, 11 Feb 2014 08:40:53 +0000

Ubuntu package maintainers generally have a good track record of merging their changes back to Debian, where applicable.

Another Debian init system vote called

tzafrir — Tue, 11 Feb 2014 08:37:21 +0000

> Also, while the PIDFile option is recommended, it defaults to
> the child process, which should be correct for most daemons
> using the setup/fork/exit model.

But why write a sub-optimal daemon n the first place? Why would you want to use this setup/fork/exit model?

If you want compatibility with SystemV: on Debian, and with busybox, you can use start-stop-daemon to fork into background and avoid having that code in your daemon.

Another Debian init system vote called

kugel — Tue, 11 Feb 2014 08:19:00 +0000

I tend to agree, if that's true. It violates encapsulation of containers, too.

Another Debian init system vote called

iq-0 — Tue, 11 Feb 2014 08:13:21 +0000

The fd per task is just the basic handle that you can use to track a proces and get a handle for it's descendants. You'd obviously need a way to get handles for it's descendants in a corresponding fashion. Then you have enough to actually track a whole group (and even report if some descendant is killed if you'd want that).

I understand that the current cgroup functionality can be used to effectively infer that same information, but this mechanism (and as it appears to be present in FreeBSD) can be used to get the same information in a generic way (even when you're not the cgroup controller, think apache tracking it's childs and any subprocesses they spawn in a correct way).

Another Debian init system vote called

smurf — Tue, 11 Feb 2014 07:04:36 +0000

Given Canonical's track record WRT working with upstream (NOT), my first impulse would be for Debian to use anything but upstart. Why should Debian do their work? They don't give anything back, thanks but no thanks to their restrictive CLA.

Fortunately there are other – technical – reasons to use systemd instead …

Another Debian init system vote called

Cyberax — Tue, 11 Feb 2014 06:26:37 +0000

Lots of systems regularly monitor cgroups for stuff like memory usage or disk IO bandwidth monitoring. Pulling all this information through FUSE is not exactly super-fast.

Besides, lots of operations would require multiple roundtrips. I'm experimenting with it right now, actually. And there ARE race conditions with this FUSE interface - it's very hard to reliably intercept changes in the cgroups from a userspace filesystem driver.

Another Debian init system vote called

smurf — Tue, 11 Feb 2014 06:13:59 +0000

Sure, FUSE has a context switching overhead. So what? You're not going to call it after set-up.

Another Debian init system vote called

mchapman — Tue, 11 Feb 2014 02:49:21 +0000

> So? If there are security issues then won't users be able to exploit them through systemd?

That depends on what the security issues are, which -- as I think you rightly point out -- have not been clearly explained anywhere.

I do not doubt for a second that Tejun and the other kernel developers have very good reasons for making these changes. "It's too complex in its current form" is a good reason. "It has security problems" would be as well... if we knew what those security problems actually were.

Another Debian init system vote called

Cyberax — Tue, 11 Feb 2014 02:06:22 +0000

So you admit that there are no real reasons for the single writer.

> Look there is a very clear separation in usage cases.
Nope. It's quite common to have very mixed systems.

> And then, there's aws-like container as as service wild west...where you bloody well can't trust nobody to do nothing but try to steal each other's magic beans or bitcoins or system resources.
So? If there are security issues then won't users be able to exploit them through systemd?

>Choose the API that works best for your usage case.
So if I want to use one API then I'll have to stop using systemd. Great!

Another Debian init system vote called

jspaleta — Tue, 11 Feb 2014 01:54:59 +0000

I am so very overjoyed to hear that you are tired of this crap. That's awesome. I look forward to a lack of followups from you if this is true. Though I must admit, I do harbor a great fear that you are in fact not tired of this and will find it within yourself the strength to continue discussing this topic again and again months and months into the future.

But on to the points:
2) uhm it would be more correct to say _the_ cgroup hierarchy...even the cgmanager defined one...once that manager is available as a usable tool. I very much doubt that the cgmanager hierarchy will fair any better well being meddled with arbitrarily than a systemd managed one.

3) noone says that such issues would inherently unfixable. I believe the reasoning is that putting the policy mechanism for all the controllers into a manager codebase is going to make it easier to effectively mitigate security concerns given the complexity of the controllers...for now. It's a judgement call on how to best deal with the problem given the state of the code right now.

And no the old deprecated API doesn't _need_ to be fixed if the new API is meant to address the use cases where hostile containers need to be factored in.

Look there is a very clear separation in usage cases. There is the centrally cultivated environment case, HPC and Google, where the containers are well groomed upstanding and forthright citizens of their respective universes and there's no expectation that containers are going to cross the multiverse boundaries and find themselves running on an arbitrary system.

And then, there's aws-like container as as service wild west...where you bloody well can't trust nobody to do nothing but try to steal each other's magic beans or bitcoins or system resources.

Choose the API that works best for your usage case. If you are only running containers you've designed, and unless you suffer from the same multiple personality disorder that I do, then you probably don't have to worry about your hand crafted containers being maliciously designed to disrupt your own system. But if you are like me, and I pray that you are not too similar, because really the world couldn't handle another me, then you will probably want to move to the new API as soon as your can, to protect yourself from yourself.

Another Debian init system vote called

paulj — Tue, 11 Feb 2014 01:37:09 +0000

If certain memcg knobs are not to be handed to untrusted users, then why not just set the fs permissions to not be writeable by the untrusted users???

Another Debian init system vote called

Cyberax — Tue, 11 Feb 2014 01:23:43 +0000

Ok, let's see. Basically, it all boils down to two issues:

1) Bad controllers that do not nest properly (blkio). They are being reworked for the unified tree.

2) "We don't trust users not to mess our precious systemd hierarchy". No comments.

3)
>There are also security implications. memcg control knobs directly
>regulate the operation of memory reclaim and writeback. I wouldn't be
>surprised if there are pretty easy ways to make them go bonkers while
>staying inside the limits from the parent. Again, think of sysctl.
>You don't wanna hand these out to untrusted entities.

Yet again, the mysterious unfixable security problems that no-one knows about.

Guess what? If the old API is still going to be supported then these issues are called "security bugs" and must be fixed.

And there are no other real issues. I'm really 100% tired of that crap. I'm serious, all the justifications are:
>I think it generally is a good idea to have a buffer layer between the kernel interface and individual consumers for cgroup

Really?

Another Debian init system vote called

jspaleta — Tue, 11 Feb 2014 01:04:22 +0000

At best, you have experience using the old API and relying on applications to use PAXControlGroup like self-restraint on not mucking around with the hierarchies and controllers. And when you need to take actions which break the rules laied out in PAXControlGroups, this is done manually or via locally developed scripts crafted specifically for your environment and are not considered generally usable (like Google's case.)

All this is proof of is that multi-hierarchy allows for well behaved multiple writers applications to choose to forego making use of certain cgroup capabilities as part of automation. And that's giving you the benefit of the doubt that you actually have multiple applications acting as concurrent cgroup writers in your HPC configuration at all and your setup isn't entirely confined to a single tool doing the manipulation.

But I can only guess as to the details of your configuration, and whether your containers are all managed internally by your in-house admin team (like Google's use case) and thus considered non-hostile. Or if you are allowing for externally maintained containers to be spun up by external admins (containers as a service) and thus need to consider individual containers as potentially hostile to other containers.

Regardless of the details of your configuration,I do not think experience with the old API has been shown to translate into relevant experience with the new API that enforces a single hierarchy model.

Kernel developers are keeping the old API around.

https://lkml.org/lkml/2013/4/5/535
Tejun speaks to filesystem delegation quite specifically in last years status-quo post. There are 3 or 4 paragraphs as to why he thinks its not going to work. You disagree with him..noted. But he did speak to it.

And in that discussion container delegation was specifically brought up:
https://lkml.org/lkml/2013/4/9/176
Tejun responds:
https://lkml.org/lkml/2013/4/9/581

now you can brush aside his provided example of memcg control knob impact if you want. And you can dismiss his desire for people to think of cgroups moving forward as a sysctl interface instead of as a filesystem or more like virtual machine boundaries from a security perspective. But he does make an effort to point out the impact of what delegation could mean for usage patterns that have to consider untrusted hostile containers running on the system.

Now I think if your experience so far is grounded in a well groomed, centrally administered container environment, then certainty its understandable that you may not need to worry about hostile containers and you might not be able to muster the empathy to care about other use cases. So the particular security impacts might feel a little contrived for you and you might have some difficult understanding wtf he's trying to talk about with regard to untrusted containers. But its not hard to imagine how a public container as a service configuation, which let someone as evil as myself spin up containers in big iron in the cloud, might want to take advantage of the sort of problems Tejun is speaking to.

Now he seems to think that making filesystem-like delegation work properly in cgroupfs as it is is going to be the harder way forward. So he's making a judgement call on implementation design. You disagree with the call. And as you see in the thread, he acknowledges that if someone can show the delegation stuff is going to work out in the workman project, he'll reconsider. But they have to do the work and _prove_ it to him. Note, workman shutdown as as project...so I take that as meaning noone disagreed with him enough to do the work to get delegation working reliably.

So here we are. Can I please have 50 cents?

Another Debian init system vote called

Cyberax — Tue, 11 Feb 2014 00:25:03 +0000

So we'll chalk it up as: "No".

And yes, I did research and found nothing useful.

Another Debian init system vote called

fandingo — Tue, 11 Feb 2014 00:16:48 +0000

Can you do your own research for once? I've done plenty for you.

Another Debian init system vote called

Cyberax — Tue, 11 Feb 2014 00:01:18 +0000

> You need a policy that allows the namespaced root (based on the outer UID) to modify the cgroup configuration for that cgroup. Then, you need to connect to DBus and interact with the API just like normal.
How?

With delegation it's simple "mount --bind /sys/fs/cgroup/<path>/something /containers/<mycontainer>/sys/fs/cgroup". That's it.

Can you provide the similar policy or whatever for DBUS?

Another Debian init system vote called

Cyberax — Mon, 10 Feb 2014 23:58:49 +0000

> You've shown they are orthogonal? I believe you've stated they are orthogonal, I've seen no testable proof that they are.
What kind of proof do you need? I've been using cgroups in production for more than two years, mostly for HPC. So I do know a little about their implementation details and I just don't see the issues.

> And to be clear when you say cgmanagerd you mean cgmanager that started development in second half of 2013, based on libnih and has yet to have a public release, nor any stable API documentation made publicly available?
It might be anything. Including a distribution that keeps away from systemd on purpose but still wants to use cgroups for its own purposes. Google does this, for example.

> Expecting the systemd developers to anticipate the development of cgmanager with a competing API, seems a bit...silly.
Kernel developers should expect it. If there were some kind of delegation then it'd be a non-issue - a container would simply receive a bind-mounted group as the root of its cgroup tree.

Then the containerized application can use whatever tools it wants to manipulate its subtree. And the parent container still owns everything above that subtree and can do whatever it wants to do. Including terminating the child or moving it into another tree partition.

We're doing it that way right now, except we need to bind several controllers (memcg, cpu, freezer, etc.) instead of just one unified tree.

With the single writer mode this scenario becomes impossible. And for no good reason.

Another Debian init system vote called

paulj — Mon, 10 Feb 2014 23:52:05 +0000

So you need to have the container know it's in a container, so that the manager inside it can know to connect via IPC through the kernel to an outside manager, so the inside manager, rather than talk to the kernel directly to manage the cgroups for processes inside its container, can have the outside manager manage those cgroups instead, on behalf of the inside manager? The outside manager has to place sockets inside the container and its namespace, so it can be connected to of course.

Why is this better than having the kernel, which is always there, in every namespace, in every container, provide the API directly? Arbitrating between processes for access to resources is the kernel's entire point for existing.

It seems a very very strange "kernel" API.

Another Debian init system vote called

fandingo — Mon, 10 Feb 2014 23:40:21 +0000

I don't understand why you're being so obtuse about this whole thing. Just go write your own version already and quit the complaining that's been going on for months. It's clear that you won't be happy until you do so.

> how do I expose cgroups API in a namespaced container?

You need a policy that allows the namespaced root (based on the outer UID) to modify the cgroup configuration for that cgroup. Then, you need to connect to DBus and interact with the API just like normal.

> Oh, and it might use cgmanagerd instead of systemd.

Oh course if the API is different, you will have to use that different API. It's useless to keep bringing this point up. Everyone is aware of it, and there's no indication that it will actually pose a problem.

> And no answers.

Only if you require that every answer is ponies, rainbows, and cgroupfs.

Another Debian init system vote called

jspaleta — Mon, 10 Feb 2014 23:30:46 +0000

You've shown they are orthogonal? I believe you've stated they are orthogonal, I've seen no testable proof that they are.

And to be clear when you say cgmanagerd you mean cgmanager that started development in second half of 2013, based on libnih and has yet to have a public release, nor any stable API documentation made publicly available? Or are you talking about another, older project that I am not aware of that predates April 2013, that systemd maintainers could have known about an anticipated for when building their slice model and associated API? Because if you are talking about libnih cgmanager, I'm not sure how anyone could adequately support its API yet, considering its still in development. I'm not aware of any public documentation for a stable API by which to delegate to.

Expecting the systemd developers to anticipate the development of cgmanager with a competing API, seems a bit...silly. Especially since Canonical/Ubuntu devs had previously shown a willinginess to replicate logind API as is. Its more reasonable to expect that a second implementation would have re-used systemd's API...and then well there's only one API and there's no need to...delegate persay... you just use the one API and you dont have to care what is implementing it. Meh.

I still don't fully grok how cgmanager is going to deal with the sane_behavior rework.. from my, admittedly skimmed look at the code, it looks like its written to work with the old api exclusively. My understanding of what cgmanager is designed to achieve is somewhat hampered by a lack of documentation. But its still in early development yet, and not had a public release, so I'm more than willing to give them the benefit of the doubt as to technical sufficiency until its got at least a first stable release out and the devs expect it to be testable wildly for general workloads. As far as I can tell, they are no where near that point, so meh.

-jef

Another Debian init system vote called

Cyberax — Mon, 10 Feb 2014 23:06:36 +0000

> I think you missed something subtle in Tejun's original discussion concerning pie-in-the-sky end goal of merging cgroups into the process hierarchy. I think that end goal is very much something that everyone will enjoy.
The correct word, I think, is 'tolerate'.

> However, I think the getting from existing cgroups api to that end goal, is going to require some forbearance with regard to allowing less than optimal situations between now and that potential end goal.
What _is_ the end goal? As I've shown, single hierarchy and single writer are completely orthogonal.

And systemd resource API is shit. Right now there's no way to delegate to cgmanagerd, for example. Or must _everyone_ use systemd even inside containers?

> And remember, the existing API is still there, its not going away any time soon.
Yeah, sure. There'll be two choices: current API with lots missing features like blkio nesting and new insane API with all the features but one writer.

Another Debian init system vote called

Cyberax — Mon, 10 Feb 2014 23:02:39 +0000

Nope, I'm saying that I want a transparent and easy to use API. That is, filesystem delegation.

Now, can you try to answer this simple question - how do I expose cgroups API in a namespaced container? I.e. the 'root' user in that container should be able to do any cgroup changes in it.

Oh, and it might use cgmanagerd instead of systemd.

Lots of questions, isn't it? And no answers.

Another Debian init system vote called

nybble41 — Mon, 10 Feb 2014 22:52:50 +0000

I did know about "Type=forking", but as far as I can tell the only downside to using the default "Type=simple" for such daemons is that systemd won't wait for the communication channels to be configured--just like running the same daemon in the foreground. Is there some other side-effect I'm missing?

The foreground daemon model with either socket activation or notifications ("Type=notify") is certainly preferable if you can use it, but either type requires support from the daemon. Putting the daemon in foreground mode *without* socket activation or sd_notify() leaves systemd without any way to see when the daemon is ready to accept requests. "Type=forking" is pretty much the only race-free way to start a daemon not specifically modified for systemd.

Also, while the PIDFile option is recommended, it defaults to the child process, which should be correct for most daemons using the setup/fork/exit model.