I know library size isn't your point, but it was my point:
A huge library size for simple functionality is a clear sign of badly written or designed code, with all downsides that come with that: Inefficient, unnecessary complex code which is hard to debug and hard to optimise properly.
Context switching may be very fast, until you trash all your caches between every switch, then it all goes down the drain.
So the main reason why improving the IPC works is probably because it means less crappy code of dbus is run instead. If dbus does the multicasting, it sends messages one by one. The processes receiving them where most likely idling, while dbus-daemon, the bloated pig it is, probably did enough jumping around its own code to eat its timeslice up. So the processes receiving the message gets scheduled, does its thing, and only then does dbus-daemon gets a new time slice and can send the message to the next processes waiting for it (SMP should help a lot though). This cycle repeats itself till all processes received the message. If dbus-daemon was mean and lean this extra ping-ponging wouldn't be very noticeable and wouldn't happen as much. By pushing the multicasting into the kernel, this particular problem is avoided.
Sending a short message to multiple processes should be very fast, and we agree that isn't what makes dbus so slow. What makes it slow is all the other things it does for no good reason, but what exactly all that is, I don't know. It could be a bug too.