How Debian managed the systemd transition

Posted Sep 16, 2015 22:54 UTC (Wed) by luto (guest, #39314)
In reply to: How Debian managed the systemd transition by josh
Parent article: How Debian managed the systemd transition

> What's the *advantage* of trying to launch dbus-daemon from the initramfs?

It has a big advantage over the current scheme: you can start it early and still don't need to worry about restarting it. It has no advantage over kdbus in terms of where the code is loaded from or ease of initialization, but I don't think it has a disadvantage either.

AIUI, the current scheme comes from the idea that all userspace code running after startup must reside on a non-initramfs mount. I've heard people say that it's not even possible to keep an initramfs program running after pivot_root. This is simply incorrect. Back when initramfs was actually ramfs, it wasted unpageable memory (just like kernel code), but initramfs is tmpfs nowadays.

Heck, there's no fundamental need for systemd to re-exec itself after pivot_root either, although, given that daemon-reexec is well-supported, it's probably a good idea from a forced testing and memory conservation perspective.

As a concrete, if dubious, benefit, udevd really could depend on dbus even without kdbus. Just require that dbus-daemon be started before udevd. (If this happened, I would drop udevd as part of the virtme minimal guest and I'd seriously consider busybox's udev as an alternative, but that a bit off-topic.)

How Debian managed the systemd transition

Posted Sep 16, 2015 23:08 UTC (Wed) by josh (subscriber, #17465) [Link] (26 responses)

> It has a big advantage over the current scheme: you can start it early and still don't need to worry about restarting it. It has no advantage over kdbus in terms of where the code is loaded from or ease of initialization, but I don't think it has a disadvantage either.

It does have at least two disadvantages there. First, getting dbus-daemon and all of its dependencies into the initramfs would prove rather annoying. Statically linking it isn't a solution (distros, dependency management, and static linking don't mix well), and adding a pile of libraries to the initramfs doesn't appeal. But even after doing that, which is certainly doable, "don't need to worry about restarting it" is a bug, not a feature; dbus-daemon is apparently utterly incapable of handling a restart, but it needs to restart on upgrade. kdbus doesn't have that problem, because it doesn't need a userspace daemon. (It needs some initial setup, but systemd does that, and systemd handles upgrades just fine.)

How Debian managed the systemd transition

Posted Sep 16, 2015 23:24 UTC (Wed) by luto (guest, #39314) [Link] (25 responses)

> dbus-daemon is apparently utterly incapable of handling a restart, but it needs to restart on upgrade. kdbus doesn't have that problem, because it doesn't need a userspace daemon. (It needs some initial setup, but systemd does that, and systemd handles upgrades just fine.)

I think your argument here is a bit confused. dbus-daemon is indeed apparently utterly incapable of handling a restart, so you can't upgrade it without rebooting (or blowing up everything that depends on it). But the kernel and, hence, kdbus, is utterly and completely incapable of being upgraded without rebooting, so the behavior is similar.

I would argue that the real issue with current distros is that they might actually try to upgrade dbus-daemon on disk *and restart it without rebooting*, which is doomed unless a *userspace* dbus daemon gets a major rewrite.

So I still don't see how kdbus is any better at all in this regard, aside from the fact that distros have already figured out how to build the kernel as a self-contained thing but might have trouble building a minimal static dbus-daemon. (It would work fine as a dynamic library with eager binding, too, but that's ugly.)

I'll grant that kdbus is probably a much more streamlined, self-contained piece of code than dbus-daemon, but that's more or less irrelevant wrt this issue.

Also, the userspace approach has a huge advantage here: you can run different versions of it in different containers.

How Debian managed the systemd transition

Posted Sep 17, 2015 0:27 UTC (Thu) by einstein (guest, #2052) [Link] (3 responses)

> But the kernel and, hence, kdbus, is utterly and completely incapable of being upgraded without rebooting, so the behavior is similar.

Actually, the kernel can be upgraded without a reboot. I was using ksplice for that back in 2009 or so, and the feature is coming together in mainline.

How Debian managed the systemd transition

Posted Sep 17, 2015 0:32 UTC (Thu) by luto (guest, #39314) [Link] (2 responses)

If someone implements a reboot-less upgrade from x.y to x.(y+1), and it actually works, I will personally buy them the beer or tasty non-alcoholic beverage of their choice*. Snapshotting the world and restoring using CRIU or similar tools doesn't count.

I've gotten emails from the ksplice team asking me how the heck they're supposed to handle a small number of individual entry changes I've made, and those are tiny compared to replacing the whole kernel.

* Within some reasonable limits.

How Debian managed the systemd transition

Posted Sep 17, 2015 0:47 UTC (Thu) by josh (subscriber, #17465) [Link]

> Snapshotting the world and restoring using CRIU or similar tools doesn't count.

I'd argue that if you can successfully save userspace, kexec a new kernel, and seamlessly reload userspace, that's a huge accomplishment that counts as a "live" kernel upgrade.

How Debian managed the systemd transition

Posted Sep 22, 2015 16:20 UTC (Tue) by jejb (subscriber, #6654) [Link]

> If someone implements a reboot-less upgrade from x.y to x.(y+1), and it actually works, I will personally buy them the beer or tasty non-alcoholic beverage of their choice*. Snapshotting the world and restoring using CRIU or similar tools doesn't count.

Hey, that's not fair: to go from n to n+1 you know the only way is to save and restore the kernel state in a version independent manner, so you're trying to define the only possible method out of your challenge. The problem with the method is the time it takes, but there are people working on it

https://sslab.gtisc.gatech.edu/pages/kup.html

How Debian managed the systemd transition

Posted Sep 17, 2015 1:07 UTC (Thu) by josh (subscriber, #17465) [Link] (20 responses)

From your comment, you seem to think of kdbus as "dbus-daemon in the kernel", which explains why you consider it analogous that dbus-daemon can't handle live upgrades and that the kernel can't. I was commenting from the point of view that kdbus isn't "dbus-daemon moved into the kernel", but rather "DBus without dbus-daemon". The only userspace setup it needs (ignoring temporary compatibility shims for dbus-daemon) is to mount it. By contrast, dbus-daemon 1) offers bus services of its own, which gain new methods over time, 2) has a pile of evolving userspace configuration bits, and most critically 3) doesn't always function properly when new libraries run against old dbus-daemon or vice versa. None of those issues apply to kdbus.

(I'm going to ignore the case of unloading and reloading kdbus.ko, here, because I doubt you can do that without stopping all dbus users, so that doesn't count either. It does mean you could upgrade kdbus without upgrading the kernel, but that won't make sense once kdbus gets merged into the kernel. It also doesn't address your point.)

So, my contention is that if you ran dbus-daemon from the initramfs, then in addition to the pain of building a dbus-daemon that can run from the initramfs, while handling services and configuration files both from the initramfs *and* from the root filesystem, you'd also have cases where you need to reboot to upgrade dbus-daemon, because you want to upgrade the corresponding userspace and your userspace can't cope with an old dbus-daemon. (It *especially* can't cope with the dbus package getting upgraded on the filesystem but the running version being older than the installed package.)

How Debian managed the systemd transition

Posted Sep 17, 2015 1:32 UTC (Thu) by luto (guest, #39314) [Link] (19 responses)

I'm thinking of kdbus as "dbus-daemon in the kernel" where dbus-daemon is a hypothetical non-crufty daemon.

Sure, kdbus doesn't read config files, but there is no reason whatsoever that a userspace dbus daemon should need to read config files, especially if it's aiming for feature parity with kdbus. Similarly, kdbus claims ABI compatibility, but a userspace dbus daemon really ought to do the same.

I get kind of annoyed when kdbus gets compared to dbus-daemon-as-it-exists and the favorable comparisons are used as an argument for why kdbus is a good idea. Dbus-daemon has all kinds of problems, but, after reading far too many emails about it and thinking about it for far too long, I'm having trouble believing that there is a single respect in which kdbus solves a problem that a simple, streamlined userspace daemon can't easily solve.

If current dbus-daemon barfs when its package is upgraded under it, that's *pathetic*, but it's still not a good reason why distros should be excited about kdbus.

(The streamlined userspace daemon would need help from an improved AF_UNIX credential mechanism, but that's easy.)

How Debian managed the systemd transition

Posted Sep 17, 2015 2:07 UTC (Thu) by josh (subscriber, #17465) [Link] (15 responses)

> I'm thinking of kdbus as "dbus-daemon in the kernel" where dbus-daemon is a hypothetical non-crufty daemon.

That hypothetical non-crufty daemon would almost never need upgrading, sure. And neither does kdbus, so the comparison works. But the dbus-daemon we have *today* doesn't belong in an initramfs, and that's where this discussion started. And I see a distinct lack of people working on a hypothetical non-crufty dbus-daemon, hence why it remains hypothetical.

Apart from that, I can think of several things kdbus can do that an arbitrarily lightweight dbus-daemon can't, which explains part of why nobody seems to want to work on a hypothetical non-crufty dbus-daemon. Most notably, it eliminates a context switch from every message passed (two from every round-trip). If you had a "non-crufty" dbus-daemon that didn't need to touch the actual messages, what remaining non-cruft purpose does the daemon serve? Even having dbus-daemon involved in setup or broadcasts represents unnecessary overhead.

How Debian managed the systemd transition

Posted Sep 17, 2015 2:32 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

Note, though, that if the overhead could be entirely eliminated (context switches included) there *are* things I'd love to see moved out of the kernel. The vast majority of filesystems, for instance: a giant pile of C code, running at the highest possible security level, used to parse what should be arbitrary untrusted data, that we're increasingly exposing to arbitrary unprivileged userspace. There's no good reason for, for instance, isofs, freevxfs, or hfs to live in the kernel.

How Debian managed the systemd transition

Posted Sep 17, 2015 5:55 UTC (Thu) by luto (guest, #39314) [Link]

FUSE does pretty well for itself despite context switches. I've never profiled it, but I bet that context switches account for very little of its overhead. I would imagine that inefficient use of page cache is the main problem.

Dbus is a nasty model for things like filesystems, though. Some kind of fast capability-based transport would be much better suited, especially since a file descriptor (or directory reference or whatever) maps quite nicely to a capability.

How Debian managed the systemd transition

Posted Sep 17, 2015 5:52 UTC (Thu) by luto (guest, #39314) [Link] (12 responses)

I'm not really convinced by this context switch thing. For a messaging system, users are likely to care about latency and about throughput. Certainly, to send a single message via a central daemon, two context switches are required, whereas sending a message via kdbus or any other direct-through-the-kernel system only needs one context switch.

But context switches should be decently under 2 µs on a modern system. (The atrocious performance of libgdbus + dbus-daemon has *nothing* do with with the extra context switch.) With some optimization, which certainly could be done, I bet we can significantly improve context switches performance.

In any event, for applications that care about throughput, the extra context switch is a red herring. Under load, a good central daemon will process many messages per time slice, so the throughput bottleneck is much more likely to be message routing and such rather than context switches. Under that type of load, having a central daemon shouldn't by much slower than doing everything in the kernel. Kdbus is IMO unlikely to be particularly fast in terms of CPU time used per message because the per-message processing is rather complex.

With a userspace mechanism built on top of a serious IPC primitive, the extra context switch goes away because the central daemon can easily introduce parties for direct communication. Linux has no such mechanism (other than SCM_RIGHTS). seL4 does, and I suspect (although I don't know for sure) that the other L4 systems do as well. Binder also looks reasonable for such uses, even though it's rather crufty in other respects.

For dbus in particular (userspace or kernel), I think that good performance under load will be tough, because dbus has a reliable in-order broadcast model. If everyone can broadcast to everyone in order, then the overall system needs to buffer each message until every receiver has read it. Since the senders and receivers are all asynchronous, that can be a lot of buffering. For kdbus in particular, the fancy "pool" model means (AFAICT) that all of the broadcast messages need to be buffered *separately* for each receiver. IMO this will work considerably worse than just doing it with a lightweight userspace daemon. Realistically, though, the fully-ordered broadcast model seems unlikely to hold up under load with *any* implementation whatsoever.

How Debian managed the systemd transition

Posted Sep 23, 2015 9:33 UTC (Wed) by paulj (subscriber, #341) [Link] (10 responses)

The problem is some people already went and implemented a kernel DBUS, presumably without having thought too deeply about things and not having questioned the notion that the performance problems with dbus-daemon were to do with kernel-userspace transitions. So given it exists and does improve performance over the inefficient user-space implementation, and given those people (like any others) aren't keen to have their work wasted, there will now be pressure to integrate it.

That pressure will be hard to deflect by pointing out the correct solution to an inefficient user-space implementation is not a very $FAVOURED_IPC_OF_THIS_DECADE-specific kernel implementation, but instead to implement an efficient user-space implementation + whatever generalised kernel services are needed for IPC problems in the abstract. To deflect that pressure for good requires coming up with that efficient user-space implementation really.

How Debian managed the systemd transition

Posted Sep 23, 2015 9:47 UTC (Wed) by lgeorget (guest, #99972) [Link] (2 responses)

> The problem is some people already went and implemented a kernel DBUS, presumably without having thought too deeply about things and not having questioned the notion that the performance problems with dbus-daemon were to do with kernel-userspace transitions.

Actually, if I recall correctly the discussions on that matter, the main advantage of the in-kernel implementation of dbus was not that it reduces the number of context switches but that it reduces the number of memory copies because for the kernel, unlike a user-space daemon, copying memory can be as simple as mapping the same pages in two processes.

> those people (like any others) aren't keen to have their work wasted, there will now be pressure to integrate it.

As far as I can tell from reading the mails on the Linux mailing list, Greg Kroah-Hartmann has shown to be very professional. He would surely be pleased to see his work in the mainline kernel, but not to the point to "pressure" anyone.

How Debian managed the systemd transition

Posted Sep 23, 2015 15:06 UTC (Wed) by luto (guest, #39314) [Link]

Indeed, kdbus saves a memory copy in the common case if the receiver is able to consume data straight from the "pool" without copying the data itself.

For small messages, this barely matters, and for large messages, both kdbus and AF_UNIX users can use memfds, which does even less copying.

Actually, for small messages, I'll only believe that the kdbus approach is faster if someone benchmarks it cleanly. The saved copy is only possible because the kernel writes to the receiver's pool when the message is sent, and that means that the kernel has to map the receiver's pool, and that's not free. (In fact it can be very slow -- modern CPUs are very good at mapping things, but at least x86 makes *unmapping* extremely expensive.)

How Debian managed the systemd transition

Posted Sep 23, 2015 15:51 UTC (Wed) by dlang (guest, #313) [Link]

Linus has pointed out that the performance wins of kdbus have far more to do with horribly inefficient userspace dbus code than any advantage of being in the kernel (context switches or memory copies)

So the 'official' justification for kdbus is no longer performance, but rather security and/or reliability

How Debian managed the systemd transition

Posted Sep 23, 2015 16:29 UTC (Wed) by raven667 (subscriber, #5198) [Link] (6 responses)

> presumably without having thought too deeply about things

I've been on the sidelines, following development on LWN, but that doesn't seem representative of the people involved or the effort which has gone into this, so I wouldn't presume that at all.

> not having questioned the notion that the performance problems with dbus-daemon were to do with kernel-userspace transitions

I believe there was awareness that the existing dbus-daemon implementation was not performant but also awareness that even a perfectly implemented userspace daemon has an upper limit on what it can do because of serializing, memory copying and context switches. Experience with the X Window protocol is instructive here as it sits in a very similar place in the software stack and there was a desire for dbus to be able to scale to the point of handling graphics data, which has already been demonstrated with X that a userspace daemon cannot do this without kernel support. Less copying and less context switches are also a boon for power usage which is becoming more important every year, both for battery powered and datacenter devices.

> efficient user-space implementation + whatever generalised kernel services are needed for IPC problems in the abstract.

This was the original goal and implementation many years ago but was flatly rejected by the kernel developers who would have needed to approve it which is why we have the kdbus implementation we have today as opposed to some other design. The original thought would be for a multicast AF_UNIX type socket that a userspace daemon could control which would be capable of zero-copy message delivery but the network subsystem maintainers refused to entertain the changes required to make something like that work and be supportable, so a different design which is much more self-contained is being proposed instead.

How Debian managed the systemd transition

Posted Oct 9, 2015 23:29 UTC (Fri) by nix (subscriber, #2304) [Link] (5 responses)

Experience with the X Window protocol is instructive here as it sits in a very similar place in the software stack and there was a desire for dbus to be able to scale to the point of handling graphics data, which has already been demonstrated with X that a userspace daemon cannot do this without kernel support.

X was doing just that without kernel support for nearly two decades. The MIT-SHM extension is worth noting.

You don't need to be the kernel to share memory... and with memfds, you don't even need to be the kernel to share memory with untrusted partners.

How Debian managed the systemd transition

Posted Oct 10, 2015 1:24 UTC (Sat) by raven667 (subscriber, #5198) [Link] (4 responses)

Shared memory is a kernel feature that gets you some of the way there but doesn't have the access control interface that these applications require and the DRI/DRM interfaces in the kernel were created for graphics applications like X, much like memfd which was created for kdbus, so I don't think its fair to say that X runs undegraded without special kernel support.

How Debian managed the systemd transition

Posted Oct 13, 2015 13:50 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

What? X ran undegraded without kernel support for literally a decade plus, until hardware 3D stuff started turning up. MIT-SHM provided everything needed.

How Debian managed the systemd transition

Posted Oct 13, 2015 14:45 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (2 responses)

> X ran undegraded without kernel support for literally a decade plus

I think one could argue that being given direct access to the graphics hardware, and thus effectively unlimited access to the entire system, should count as "kernel support". Sure, the driver code was inside the X server rather than compiled into the kernel or a loadable module, but it still required special interfaces used primarily by X, and it wasn't possible to run the X server as an ordinary, non-root user process.

How Debian managed the systemd transition

Posted Oct 13, 2015 15:09 UTC (Tue) by raven667 (subscriber, #5198) [Link] (1 responses)

That's a good point, but even if you don't consider allowing the userspace app to just bang away at /dev/mem "kernel support" because that really isn't a defined API, certainly we say that limiting to the performance and capabilities of the 1990's X stack would be considered "degraded" by modern standards and applications. Making this behave safely without degraded performance required the addition of dedicated APIs, to talk to the graphics co-processor, to share memory buffers, beyond the 1980's UNIX standard ones.

We've already gone down the route of adding dedicated IPC APIs for SysV, for Netlink, for X/Wayland and now for DBUS, which I see as following the evolution of OS design and the needs of the applications of the era when these interfaces were designed.

How Debian managed the systemd transition

Posted Oct 13, 2015 22:49 UTC (Tue) by nix (subscriber, #2304) [Link]

Oh, I agree it would be bad by modern standards -- however, it was quite clearly capable of scaling to the point of handling graphics data with no more kernel support than that. To get back to the original point: unless you think D-Bus is not just going to be asked to handle graphics data but the full graphics flow of a 3D game I think the volume of data involved in graphics should not serve as an argument for needing kernel support just to handle that.

How Debian managed the systemd transition

Posted Sep 25, 2015 22:24 UTC (Fri) by oak (guest, #2786) [Link]

Yea, the buffering is much larger performance issue than context switching. All it takes is some message that is generated very frequently, and a client that has subscribed to the message, but isn't reading its messages (e.g. because it's suspended for few days while on background).

Result is that daemon message buffers grow until they take all your memory, your system message transport goes to swap (with everything else) and things become *really* slow until the problematic client is killed. If the client is woken up, daemon and client can spend many minutes (or hours depending on how much swap & buffering you have) during which bus isn't very responsive. If allocations were mixed well enough, emptying the message buffer on daemon doesn't actually free its dirtied memory because it's gotten fragmented.

This is D-BUS experience from 5-10 years ago on semi-embedded device. Even worse, the user-space daemon gets it's memory fragmented very easily and doesn't return to system memory it's once allocated. So, local DOS is trivial to do with any client that can connect to bus.

Some of the things where kernel *might* be able to improve on this are:
* Assigning message buffers memory cost to corresponding client, so that admin can identify who's the culprit
* Better allocator that guarantees that after processing the messages, the emptied buffer can actually be freed for other purposes (i.e. allocation blocks don't mix data with unrelated life-times, e.g. send and receive messages or messages from/to different clients)
* If message is status broadcast, maybe having some mechanism where only last status update is buffered
* Suspending message sending if receiver isn't processing the messages

How Debian managed the systemd transition

Posted Sep 17, 2015 6:11 UTC (Thu) by alison (subscriber, #63752) [Link] (2 responses)

>Dbus-daemon has all kinds of problems, but, after reading far too many emails about it and >thinking about it for far too long, I'm having trouble believing that there is a single respect in >which kdbus solves a problem that a simple, streamlined userspace daemon can't easily solve.

Performance of Dbus-daemon aside, what about the more abstract question of whether a new message-passing API inside the kernel makes sense? From the shear design point of view, why does the kernel provide 3 notification services for userspace via fanotify, dnotify and inotify? Presumably the rationale for adding fanotify to dnotify and inotify was that fanotify was superior. Why does that rationale not apply to kdbus?

Both kdbus and Dbus-daemon will continue to evolve. The issue of whether the kernel should have a new feature would logically be decided on the basis of what the kernel's rightful role is. Mostly the kernel's job is to abstract away the details of hardware and to provide userspace with services (e.g. boot) that it would have difficulty managing itself. Is IPC like that provided by kdbus such a service, or no? If not, why is it fundamentally different from notification, to which it seems logically related?

How Debian managed the systemd transition

Posted Sep 17, 2015 11:46 UTC (Thu) by lsl (subscriber, #86508) [Link] (1 responses)

> Presumably the rationale for adding fanotify to dnotify and inotify was that fanotify was superior.

Wasn't the "rationale" more like "we hope it makes snake oil vendors stop torturing our enterprise kernels with horrible out-of-tree modules"? At least that's what I remember from it. It wasn't any less drama than kdbus. Also, it didn't get merged until attempts were made to rework it to be more generally useful, for tasks other than implementing snake oil products.

How Debian managed the systemd transition

Posted Sep 23, 2015 20:35 UTC (Wed) by foom (subscriber, #14868) [Link]

... Except, it failed to do so. (Man, what a *waste* of a third attempt to have functional fs watching functionality...)