|
|
Subscribe / Log in / New account

The Debian init system general resolution returns

The Debian init system general resolution returns

Posted Oct 25, 2014 2:31 UTC (Sat) by viro (subscriber, #7872)
In reply to: The Debian init system general resolution returns by zblaxell
Parent article: The Debian init system general resolution returns

*snort*

More interesting class of bugs stems from the systemd propensity to spew^Wdistribute tons of information to the rest of the system. Extra complexity in a sensitive place is only a part of the problem - much worse is that if dbus-daemon can't keep pace with it, the backlog is stored in memory of PID 1 until it can be sent. Now, recall what makes PID 1 special from the OOM killer point of view...

And no, it's not a pure theory - I've run into one of the bugs in that class; a lot of umount activity going on (e.g. on shutdown with several thousand bindings present in the system) ended up with quadratic amount of dbus traffic. The damn thing kept resending the entire mount table every time it saw a change. Welcome to 8G of dirty memory held by PID 1... Basically, they were too lazy to compare the old and the new tables and send a proper delta. Sure, fixing that one hadn't been hard, but the underlying architectural deficiency is still there.

_Anything_ that convinces systemd to generate a major spew is likely to take the system down. From what I've heard they had an earlier bug of the same kind; this one - with spew consisting of network device information.

It's not so much systemd codebase fault as one of the reasons why dbus is not suitable for high-volume traffic, combined with systemd using it for potentially huge amounts of such with PID 1 as originator. And no, bringing that festering shitpile of protocol into the kernel won't make the things any better - dbus is broken by design.


to post comments

The Debian init system general resolution returns

Posted Oct 25, 2014 2:48 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

What exactly is broken in DBUS? It's a dead-simple messaging protocol for structured messages.

PS: no, Plumber is not in any way better.

The Debian init system general resolution returns

Posted Oct 25, 2014 3:22 UTC (Sat) by viro (subscriber, #7872) [Link] (16 responses)

Assumption that it's suitable for high-volume traffic without dropped packets is a recipe for DoS. That assumption is very clearly made by systemd and I really doubt that Lennart et.al. would enjoy getting rid of it - AFAICS, there's a lot of code assuming that dbus delivery is reliable and even more assuming that PID 1 isn't throttled by congestion. I certainly would hate the scope of code review and redesign it would take to avoid that fun...

PID 1 is in extremely *bad* position for keeping track of everything and sending reliable notifications of everything. No mechanism would really help with that; it's just that dbus pretends to provide one that will cope with such demands. It can't. Neither can kdbus.

Every time somebody finds a way to trigger a huge amount of traffic originating at PID 1, it's a serious bug. And they insist on keeping track of a lot of system state in PID 1, with all kinds of traffic being sent. Asking for trouble...

The Debian init system general resolution returns

Posted Oct 25, 2014 3:59 UTC (Sat) by raven667 (subscriber, #5198) [Link] (7 responses)

But the makers and major users of dbus know very well that its not suitable for high volume traffic that is the main driver of kdbus, to increase the effective volume threshold and make better defined guarantees about what is and is not suitable.

The Debian init system general resolution returns

Posted Oct 25, 2014 4:47 UTC (Sat) by viro (subscriber, #7872) [Link] (6 responses)

I'm sorry, but this is hogwash. In that DoS only a few percent of memory footprint had been outside of dbus packets *payloads*. You have a traffic producer (systemd), hosing the bus with many gigabytes of data, and consumers (dbus-daemon) trying to chew through that. I don't have the profile data, but it should be easy to reproduce - take Fedora userland circa last February, cd /tmp; for i in `seq 10000`; do touch $i; mount --bind /dev/null $i; done, followed either by shutdown or by explicit loop doing umount of the same (easier to collect profiling information). And watch the memory footprint in process... Sure, some overhead could be eliminated, but dbus-daemon is nowhere near keeping up; the same pileup will happen, and I would be very surprised if kdbus would manage to shave much off.

And yes, _that_ particular bug had been fixed - current systemd (since March or so) doesn't produce quadratic amount of traffic in that situation. My point is that _anything_ that tricks it into hosing the bus with shitloads of traffic will cause the same kind of problem, kdbus or no kdbus.

IOW, it's something they have to watch out for, and sending a lot of stuff (a lot of kinds of stuff, even) as part of normal operation, expected by the rest of the system, is seriously asking for trouble.

The Debian init system general resolution returns

Posted Oct 25, 2014 6:15 UTC (Sat) by viro (subscriber, #7872) [Link] (5 responses)

PS: I'm not saying that this bug was something earth-shattering - nobody has seriously stressed the system with really large number of mounts and namespaces until the docker folks started to play with scalability, and that has certainly exposed a bunch of rather embarrassing issues, quite a few of them in the kernel (percpu allocator eating O(N^2) for N allocations and freeings, propagate_mnt() case when O(N^2) mounts had been created *and* destroyed before anyone could see them, leaving only O(N) added, sequential read() /proc/self/mounts costing O(previously read entries), leading to cat /proc/self/mounts being O(size^2) and constant (and very small) sized mount hash). Had been an interesting couple of weeks hunting them down; result was near-linear by number of created docker instances, instead of obscene O(N^4). systemd bug got caught at the same time - O(N^2) traffic and easily triggered OOM. Another bug had been in umount(8) with long argument lists, IIRC (rereading the mount table after each umount(2))...

The point being, bugs happen; the architectural mistake there is what's making them a lot more severe. Namely, the use of dbus to send notifications of many kinds of system state changes as they are happening, with PID 1 as sender. And one *still* can trigger obscene amount of traffic there - mount tmpfs on /tmp/a, create a bunch of bindings in /tmp/a/* and then keep doing mount --move /tmp/a /tmp/b; mount --move /tmp/b /tmp/a. Nowhere near as bad as "it panics on shutdown when there's a lot of mounts", but the same "let's keep telling dbus-daemon about those changes, no matter what" easily translates into severe slowdowns and OOMs. Single syscall, done in constant time kernel-side, ends up with massive dbus spam, and no throttling. Moreover, the processes receiving those notifications can bloody well get the same information themselves, and do it cheaper...

The Debian init system general resolution returns

Posted Oct 25, 2014 14:56 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Now imagine that someone used something like udev rules for that. The quadratic behavior would still be there and init manager really does need to look into mount points.

The Debian init system general resolution returns

Posted Nov 2, 2014 15:27 UTC (Sun) by nix (subscriber, #2304) [Link] (3 responses)

This also seems problematic from another perspective: fs namespaces. What good is it sending info out about init's mount namespace when there's no guarantee at all that it corresponds to the view anything else has of the filesystem? Any process that's looking at mounts other than by looking at /proc/self/mounts (or a symlink to it, e.g. /proc/mounts) and then considering that it need have any relevance to *its* state, rather than to the state of the process that did that read(), is just asking for subtle and horrible bugs.

The same applies to anything at all related to network interfaces.

Presumably this traffic is meant to be consumed by udev, i.e. it's a replacement for the existing in-kernel uevent messages over the netlink socket. Seems like a rather baroque, ludicrous, and bug-prone change to me.

The Debian init system general resolution returns

Posted Nov 2, 2014 16:33 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I doubt the implementation is so dumb that its leaks inappropriate information across namespaces, systemd pid 1 manages the namespaces for services to it should have awareness of what goes where, but I don't think either of us is an expert on the implementation. There isn't anything here prevents a process from reading this information directly if it wants to, the real use case are applications which already use dbus for IPC, subscribing to a new structured message rather than implementing their own opening polling and parsing logic, or their own uevent subscription, which seems like a win to me.

The Debian init system general resolution returns

Posted Nov 2, 2014 17:08 UTC (Sun) by viro (subscriber, #7872) [Link] (1 responses)

As far as I can see, it boils down to wanting to be The Authority And The Source Of All Information. Nevermind that subjects^Wmanaged^Weverybody else can get the same thing easily; it seems to go against some very strong instincts. Same style as spamming everyone in the company with pointless memos on every thinkable topic, relevant or not...

There's a very strong smell of PHB all over the design. Worse, a PHB that had been told by some conslutant about Web 2.0 and social media being The Thing for millenial generation and decided to have a local equivalent of twitter built for communication with the plebes. It doesn't work well? Why, let's move it to the critical servers; those are on beefier intertubes, or something... Still doesn't work well? Too fucking bad for those who maintain those servers - it's their responsibility now (and of course, any questions regarding the basic design of the damn thing are countered with generous loads of "we had it behave that way before, therefore it must behave the same").

And yes, I am talking about dbus and plans of moving it kernel-side ;-/

The Debian init system general resolution returns

Posted Nov 2, 2014 21:35 UTC (Sun) by johannbg (guest, #65743) [Link]

Observing the kernel communication regarding the kdbus submission it's pretty clear that Eric would have nacked that proposal before the actual submission if that was possible.

That nack of his was a bit weird if you ask me but I guess I need to sacrifice a chicken, dance on one foot and drink some of that kernel koolaid to get my mind into the kernel cult and communication.

The Debian init system general resolution returns

Posted Oct 25, 2014 6:03 UTC (Sat) by johannbg (guest, #65743) [Link] (3 responses)

Interesting criticism of an architectural design which can never be better than the underlying architectural it's built upon and the interfaces it provides ( the linux kernel itself ).

The underlying problem is the same basically parallelizing X where X can = d-bus, sockets, file system jobs, you name it.

"PID 1 is in extremely *bad* position for keeping track of everything and sending reliable notifications of everything. No mechanism would really help with that"

So let's hear it based on the function of PID-1 how would you do it if not PID 1?

What architectural design do you have in mind to solve this?

The Debian init system general resolution returns

Posted Oct 25, 2014 6:38 UTC (Sat) by viro (subscriber, #7872) [Link] (2 responses)

Look, mount table is trivial to build from scratch - open /proc/self/mountinfo and read it; check the code in systemd that does this work. Moreover, keeping it updated is also easy - check the same code.
It's not really worth bothering with any IPC, let alone the one of push instead of pull variety.

And if you do insist on IPC, for some reason, you could bloody well start a caching daemon on demand (with e.g. timeout for inactivity). There's no reason whatsoever to keep that in PID 1 or anywhere near it.

We already have mechanisms for parallelizing. Had them for more than four decades. Called "processes"...

Having systemd forwarding a bunch of stuff it gleans from the kernel is asking for bottlenecks, for no good reason. System calls are not going away; not unless you want a truly monumental bottleneck with systemd playing the role of Mach server. Even then read(2) and open(2) wouldn't disappear, including those of /proc/self/mountinfo...

The Debian init system general resolution returns

Posted Oct 25, 2014 8:40 UTC (Sat) by johannbg (guest, #65743) [Link] (1 responses)

You do realize when I spoke about parallelizing I was referring to daemon start-ups and states etc ( and signal handling there of ).

We have to agree to disagree on the push vs pull implementation.

The Debian init system general resolution returns

Posted Oct 25, 2014 19:23 UTC (Sat) by dlang (guest, #313) [Link]

systemd is not the only way to address parallel startup

The Debian init system general resolution returns

Posted Oct 25, 2014 15:13 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Assumption that it's suitable for high-volume traffic without dropped packets is a recipe for DoS. That assumption is very clearly made by systemd and I really doubt that Lennart et.al. would enjoy getting rid of it - AFAICS, there's a lot of code assuming that dbus delivery is reliable and even more assuming that PID 1 isn't throttled by congestion.

Systemd uses a protected DBUS interface, so if you're able to DDoS it then you already have enough capabilities to wreck the system. It's not a fatal flaw in the design.

Also, it looks like KDBUS can help to throttle the senders - the regular DBUS daemon can't really do it cleanly.

The Debian init system general resolution returns

Posted Nov 2, 2014 15:28 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

Throttle the senders? So now consumers of this information get a randomly inaccurate mount table! That's ever so much better.

Or they could read /proc/self/mounts or /proc/self/mountinfo and get an always-reliable, namespace-aware view with none of this nonsense.

The Debian init system general resolution returns

Posted Nov 2, 2014 17:09 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Any asynchronous interface (yes, including udev over netlink sockets) is susceptible to receiving stale data. And even reading /proc/mounts is racy, because anything can happen after you do a 'read' call.

Besides, netlink sockets can block senders just as well as kdbus.

The Debian init system general resolution returns

Posted Nov 2, 2014 17:31 UTC (Sun) by viro (subscriber, #7872) [Link]

... and PID 1 is never ever going to do _blocking_ send, regardless of the transport. Think what happens with the system where PID 1 is fast asleep... There's a damn good reason why systemd doesn't do blocking sendmsg(). The problem isn't transport one; it's that in congested situations you need to change the traffic you are generating. And sending the mount table updates is completely pointless, congested situation or not.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds