Fast interprocess communication revisited

November 9, 2011

This article was contributed by Neil Brown

Slightly over a year ago, LWN reported on a couple of different kernel patches aimed at providing fast, or at least faster, interprocess communication (IPC): Cross Memory Attach (CMA) and kernel-dbus (kdbus). In one of the related email threads on the linux-kernel list, a third (pre-existing) kernel patch called KNEM was discussed. Meanwhile, yet another kernel module - "binder" used by the Android platform - is in use in millions of devices worldwide to provide fast IPC, and Linus recently observed that code that actually is used is the code that is actually worth something so maybe more of the Android code should be merged despite objections from some corners. Binder wasn't explicitly mentioned in that discussion but could reasonably be assumed to be included.

This article is not about whether any of these should be merged or not. That is largely an engineering and political decision in which this author claims no expertise, and in any case one of them - CMA - has already been merged. Rather we start with the observation that this many attempts to solve essentially the same problem suggests that something is lacking in Linux. There is, in other words, a real need for fast IPC that Linux doesn't address. The current approaches to filling this gap seem to be piecemeal attempts: Each patchset is clearly addressing the needs of a specific IPC model without obvious consideration for others. While this may solve current problems, it may not solve future problems, and one of the strengths of the design of Unix and hence Linux is the full exploitation of a few key ideas rather than the ad hoc accumulation of many distinct (though related) ideas.

So, motivated by that observation we will explore these various implementations to try to discover and describe the commonality they share and to highlight the key design decisions each one makes. Hopefully this will lead to a greater understanding of both the problem space and the solution space. Such understanding may be our best weapon against chaos in the kernel.

What's your address?

One of the interesting differences between the different IPC schemes is their mechanism for specifying the destination for a message.

CMA uses a process id (PID) number combined with offsets in the address space of that process - a message is simply copied to that location. This has the advantage of being very simple and efficient. PIDs are already managed by the kernel and piggy-backing on that facility is certainly attractive. The obvious disadvantage is that there is no room for any sophistication in access control, so messages can only be sent to processes with exactly the same credentials. This will not suit every context, but it is not a problem for the target area (the MPI message passing interface) which is aimed at massively parallel implementations in which all the processes are working together on one task. In that case having uniform credentials is an obvious choice.

KNEM uses a "cookie" which is a byte string provided by the kernel and which can be copied between processes. One process registers a region of memory with KNEM and receives a cookie in return. It can then pass this cookie to other processes as a simple byte string; the recipients can then copy to or from the registered region using that cookie as an address. Here again there is an assumption that the processes are co-operating and not a threat to each other (KNEM is also used for MPI). KNEM does not actually check process credentials directly, so any process that registers a region with KNEM is effectively allowing any other process that is able to use KNEM (i.e. able to open a specific character device file) to freely access that memory.

Kdbus follows the model of D-Bus and uses simple strings to direct messages. It monitors all D-Bus traffic to find out which endpoints own which names and then, when it sees a message sent to a particular name, it routes it accordingly rather than letting it go through the D-Bus daemon for routing.

Binder takes a very different approach from the other three. Rather than using names that appear the same to all processes, binder uses a kernel-internal object for which different processes see different object descriptors: small integers much like file descriptors. Each object is owned by a particular process (which can create new objects quite cheaply) and a message sent to an object is routed to the owning process. As each process is likely to have a different descriptor (or none at all) for the one object, descriptors cannot be passed as byte strings. However they can be passed along with binder messages much like file descriptors can be passed using Unix-domain sockets.

The main reason for using descriptors rather than names appears to involve reference counting. Binder is designed to work in an object-oriented system which (unsurprisingly) involves passing messages to objects, where the messages can contain references other objects. This is exactly the pattern seen in the kernel module. Any such system needs some way of determining when an object is no longer referenced, the typical approaches being garbage collection and reference counting. Garbage collection across multiple different processes is unlikely to be practical, so reference counting is the natural choice. As binder allows communication between mutually suspicious processes, there needs to be some degree of enforcement: a process should not be able to send a message when it doesn't own a reference to the target, and when a process dies, all its references should be released. To ensure these rules are met it is hard to come up with any scheme much simpler than the one used by binder.

Possibly the most interesting observation here is that two addressing schemes used widely in Linux are completely missing in these implementations: file descriptors and socket addresses (struct sockaddr).

File descriptors are used for pipes (the original UNIX IPC), for socket pairs and other connected sockets, for talking to devices, and much more. It is not hard to imagine them being used by CMA, and binder too. They are appealing as they can be used with simple read() and write() calls and similar standard interfaces. The most likely reason that they are regularly avoided is their cost - they are not exactly lightweight. On an x86_64 system a struct file - the minimum needed for each file descriptor - is 288 bytes. Of these, maybe 64 are relevant to many novel use cases, the rest is dead weight. This weight could possibly be reduced by a more object-oriented approach to struct file but such a change would be very intrusive and is unlikely to happen. So finding other approaches is likely to become common. We see that already in the inotify subsystem which has "watch descriptors"; we see it here in binder too.

The avoidance of socket addresses does not seem to admit such a neat answer. In the cases of CMA, kdbus, and binder it doesn't seem to fit the need for various different reasons. For KNEM it seems best explained as arbitrary choice. The developer chose to write a new character device rather than a new networking domain (aka address family) and so used ioctl() and ad hoc addresses rather than sendmsg()/recvmsg() and socket addresses.

The conclusion here seems to be that there is a constant tension between protection and performance. Every step we take to control what one process can do to another by building meaning into an address adds extra setup cost and management cost. Possibly the practical approach is not to try to choose between them but to unify them and allow each client to choose. So a client could register itself with an external address that any other process can use if it knows it, or with an internal address (like the binder objects) which can only be used by a process that has explicitly been given it. Further, a registered address may only accept explicit messages, or may be bound to a memory region that other processes can read and write directly. If such addresses and messages could be used interchangeably in the one domain it might allow a lot more flexibility for innovation.

Publish and subscribe

One area where kdbus stands out from the rest is in support for a publish/subscribe interface. Each of the higher level IPC services (MPI, Binder, D-Bus) have some sort of multicast or broadcast facility, but only kdbus tries to bring it into the kernel. This could simply reflect the fact that multicast does not need to be optimized and can be adequately handled in user space. Alternately it could mean that implementing it in the kernel is too hard so few people try.

There are two ways we can think about implementing a publish/subscribe mechanism. The first follows the example of IP multicast where a certain class of addresses is defined to be multicast addresses and sockets can request to receive multicasts to selected addresses. Binder does actually have a very limited form of this. Any binder client can ask to be notified when a particular object dies; when a client closes its handle on the binder (e.g. when it exits) all the objects it owns die and messages are accordingly published for all clients who have subscribed to that object. It would be tempting to turn this into a more general publish/subscribe scheme.

The second way to implement publish/subscribe is through a mechanism like the Berkeley packet filter that the networking layer provides. This allows a socket to request to receive all messages, but the filter removes some of them based on content following an almost arbitrary program (which can now be JIT compiled). This is more in line with the approach that kdbus uses. D-Bus allows clients to present "match" rules such that they receive all messages with content that matches the rules. kdbus extracts those rules by monitoring D-Bus traffic and uses them to perform multicast routing in the kernel.

Alban Crequy, the author of kdbus, appears to have been exploring both of these approaches. It would be well worth considering this effort in any new fast-IPC mechanism introduced into Linux to ensure it meets all use cases well.

Single copy

A recurring goal in many efforts at improving communication speed is to reduce the number of times that message data is copied in transit. "Zero-copy" is sometime seen as the holy-grail and, while it is usually impractical to reach that, single-copy can be attained; three of our four examples do achieve it. The fourth, kdbus, doesn't really try to achieve single-copy. The standard D-Bus mechanism is four copies - sender to kernel to daemon to kernel to receiver. Kdbus reduces this to two copies (and more particularly reduces context-switches to one) which is quite an improvement. The others all aim for single-copy operation.

CMA and KNEM achieve single-copy performance by providing a system call which simply copies from one address space to the other with various restrictions as we have already seen. This is simple, but not secure in a hostile environment. Binder is, again, quite different. With binder, part of the address space of each process is managed by the binder module through the process calling mmap() on the binder file descriptor. Binder then allocates pages and places them in the address space as required.

This mapped memory is read-only to the process, all writing is performed by the kernel. When a message is sent from one process to another the kernel allocates some space in the destination process's mapped area, copies the message directly from the sending process, and then queues a short message to the receiving process telling it where the received message is. The recipient can then access that message directly and will ultimately tell the binder module that it is finished with the message and that the memory can be reused.

While this approach may seem a little complex - having the kernel effectively provide a malloc() implementation (best fit as it happens) for the receiving process - it has the particular benefit that it requires no synchronization between the sender and the recipient. The copy happens immediately for the sender and it can then move on assuming it is complete. The receiver doesn't need to know anything about the message until it is all there ready and waiting (much better to have the message waiting than the processes waiting).

This asynchronous behavior is common to all the single-copy mechanisms, which makes one wonder if using Linux's AIO (Asynchronous Input/Output) subsystem might provide another possible approach. The sender could submit an asynchronous write, the recipient an asynchronous read, and when the second of the two arrives the copy is performed and each is notified. One unfortunate, though probably minor, issue with this approach is that, while Linux-aio can submit multiple read and write requests in a single system call and can receive multiple completion notifications in another system call, it cannot do both in one. This contrasts with the binder which has a WRITE_READ ioctl() command that sends messages and then waits for the reply, allowing an entire transaction to happen in a single system call. As we have seen with addition of recvmmsg() and, more recently, sendmmsg(), doing multiple things in a single system call has real advantages. As Dave Miller once observed:

The old adage about syscalls being cheap no longer holds when we're talking about traversing all the way into the protocol stack socket code every call, taking the socket lock every time, etc.

Tracking transactions

All of the high-level APIs for IPC make a distinction between requests and replies, connecting them in some way to form a single transaction. Most of the in-kernel support for messaging doesn't preserve this distinction with any real clarity. Messages are just messages and it is up to user space to determine how they are interpreted. The binder module is again an exception; understanding why helps expose an important aspect of the binder approach.

Though the code and the API do not present it exactly like this, the easiest way to think about the transaction tracking in binder is to imagine that each message has a "transaction ID" label. A request and its reply will have the same label. Further, if the recipient of the message finds that it needs to make another IPC before it can generate a final reply, it will use the same label on this intermediate IPC, and will obviously expect it on the intermediate reply.

With this labeling in place, Binder allows (and in fact requires) a thread which has sent a message, and which is waiting for a reply to that message, to only receive further messages with the same transaction ID. This rule allows a thread to respond to recursive calls and, thus, allow that thread's own original request to progress, but causes it to ignore any new calls until the current one is complete. If a process is multithreaded, each thread can work on independent transactions separately, but a single thread is tied to one complex transaction at a time.

Apart from possibly simplifying the user-space programming model, this allows the transaction as a whole to have a single CPU scheduling priority inherited from the originating process. Binder presents a model that there is just one thread of control involved in a method call, but that thread may wander from one address space to another to carry out different parts of the task. This migration of process priority allows that model to be more fully honored.

While many of the things that binder does are "a bit different", this is probably the most unusual. Having the same open file descriptor behave differently in different threads is not what most of us would expect. Yet it seems to be a very effective way to implement an apparently useful feature. Whether this feature is truly generally useful, and whether or not there is a more idiomatic way to provide it in Linux are difficult questions. However they are questions that need to be addressed if we want the best possible high-speed IPC in our kernel of choice.

Inter-Programmer Communication

There is certainly no shortage of interesting problems to solve in the Linux kernel, and equally no shortage of people with innovative and creative solutions. Here we have seen four quite different approaches to one particular problem and how each brings value of one sort or another. However each could probably be improved by incorporating ideas and approaches from one of the others, or by addressing needs that others present.

My hope is that by exposing and contrasting the different solutions and the problems they address, we can take a step closer to finding unifying solutions that address both today's needs and the needs for our grandchildren.

Index entries for this article
Kernel	Message passing
GuestArticles	Brown, Neil

Fast interprocess communication revisited

Posted Nov 10, 2011 4:34 UTC (Thu) by karim (subscriber, #114) [Link]

Thanks for this, it's very well written and informative.

Fast interprocess communication revisited

Posted Nov 10, 2011 4:58 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

> All of the high-level APIs for IPC make a distinction between requests and replies, connecting them in some way to form a single transaction. Most of the in-kernel support for messaging doesn't preserve this distinction with any real clarity. Messages are just messages and it is up to user space to determine how they are interpreted.

if you have not already looked at the zero MQ guile (0MQ http://zguide.zeromq.org/page:all ), you should probably do so. There is a very interesting situation that shows up about halfway through where you take a standard 'request-response' sequence and flip it around so that the server process that's going to provide the service sends the 'request' and the 'response' is from the client asking the server process to do something.

think carefully about what terminology you use, when the technology gets twisted in interesting ways, the words you pick may end up causing confusion.

the 0MQ guide has a lot of discussions about different ways to use IPC messages, while still leaving the contents of the messages (outside the routing) as having meaning only to the application.

Fast interprocess communication revisited

Posted Nov 18, 2011 6:00 UTC (Fri) by sustrik (guest, #62161) [Link]

Ah. I see 0MQ is mentioned here. I am the person who designed it -- if you are interested in this kind of thing see my comment below about the kernel-space implementation of 0MQ-like system.

Other Linux IPC mechanisms

Posted Nov 10, 2011 20:28 UTC (Thu) by abacus (guest, #49001) [Link] (3 responses)

I'm afraid the above overview is incomplete. At least the following inter-process communication methods are supported by the Linux kernel but were not mentioned in the above article:

Shared memory. Processes can choose to share one or more memory segments and can even create POSIX mutexes etc. in shared memory segments. For more information, see e.g. the documentation of the PTHREAD_PROCESS_SHARED attribute.
RDMA or Remote Direct Memory Access. Although this API was added in the Linux kernel to add support for the InfiniBand transport, the RDMA API can also be used for inter-process communication. For who's not familiar with this API, it is an API that unlike the socket API can be implemented such that data is not copied (zero-copy).

Other Linux IPC mechanisms

Posted Nov 12, 2011 2:07 UTC (Sat) by drag (guest, #31333) [Link] (1 responses)

I don't think the point was to show all the possible IPC that Linux supports.

fifos, files, pipes, posix message queues, posix semaphores, signals, and sockets are all forms of IPC.

I believe that the PTHREAD stuff uses the POSIX version of shared memory, but you can also use the "System V"-style shared memory, system v message queues, and system v semaphores for other things.

:-)

Other Linux IPC mechanisms

Posted Nov 12, 2011 3:37 UTC (Sat) by neilbrown (subscriber, #359) [Link]

Precisely.

I think that a survey of all the different IPC mechanisms, comparing and contrasting their features, discussing the appropriate use cases and actual libraries/apps that use them, together with various performance measurements (with code) would make a very interesting and valuable article.

But this was not that article.

Other Linux IPC mechanisms

Posted Nov 18, 2011 2:12 UTC (Fri) by jd (guest, #26381) [Link]

RDMA is one-copy since you directly transfer data into an area of memory. Although the kernel has support for RDMA, I don't really know what it does - conventional RDMA is kernel-bypass. Ideally, RDMA would be totally offloaded to the card with the kernel only telling the card what buffers to use, what endpoints to use and whether the operation is read or write.

Fast interprocess communication revisited

Posted Nov 11, 2011 0:31 UTC (Fri) by cmccabe (guest, #60281) [Link] (6 responses)

Great writeup, Neil!

I didn't realize Binder had the transaction ID semantics you mentioned.

It might be worth mentioning that Binder allows the callee to identify the caller's UID and PID. Since Android gives each application its own UID, this is sufficient to implement a relatively sophisticated security policy.

Also, since Android 2.0, bionic supports mutexes and semaphores with PTHREAD_PROCESS_SHARED. (It's just futexes, after all.) So you can roll your own zero-copy solution if you're so inclined. As best as I understand, Binder always does one copy, so if you're passing an enormous amount of data, you might want to do this.

Fast interprocess communication revisited

Posted Nov 12, 2011 3:32 UTC (Sat) by neilbrown (subscriber, #359) [Link] (5 responses)

Thanks.

The passing of uid and pid is no different from Unix Domain sockets, except that they pass gid as well where binder doesn't. As I wanted to report on novelty rather than just pick holes in things there seemed little point in mentioning it - but yes, it could certainly be useful.

I wonder about the wisdom of using a shared-memory locks (such as a futex) between process with different uids. Presumably the reason that they have different uids is that they don't trust each other. And as a rogue process can corrupt a futex leaving the other processes unable to work effectively with them, it doesn't seem safe. Between processes in the same security domain (with same uid) it does make sense, but I'm not sure that applies to Android.

As yes - you might be able to create a zero-copy solution with shared mappings (if you trust the other processes) but my understanding from the CMA discussion was that this isn't as easy as it sounds. It means that you need to keep all of they data that you might want to send to some other process in shared memory. Which comes close to having your entire heap in shared memory. There might be reasons to not want to do this. But I'm not speaking from experience, just from what I've read, so this is probably an over-simplification.

Fast interprocess communication revisited

Posted Nov 14, 2011 11:27 UTC (Mon) by abacus (guest, #49001) [Link]

I wonder about the wisdom of using a shared-memory locks (such as a futex) between process with different uids. Presumably the reason that they have different uids is that they don't trust each other.

One way of setting up shared memory is by mapping a tmpfs file in memory. One can control which processes can map such a file by assigning proper (group) security attributes to it.

It's absolutely impossible.

Posted Nov 14, 2011 17:31 UTC (Mon) by khim (subscriber, #9252) [Link]

And as a rogue process can corrupt a futex leaving the other processes unable to work effectively with them, it doesn't seem safe.

It's impossible to corrupt a futex, not matter "what" and "if". Futex exist entirely in kernelspace, it does not matter if processes trust each other or not.

You can corrupt interprocess mutex (built on top of futex), but this is question of mutex implementation.

Fast interprocess communication revisited

Posted Nov 14, 2011 19:23 UTC (Mon) by cmccabe (guest, #60281) [Link]

It's certainly true that when you start sharing a lot of memory between processes, you start to see some of the same issues that you see with threads. I would encourage people to not go down this road unless they really have have to.

I'm curious whether it is safe in theory to share a futex with a malicious process. Obviously, the malicious guy can cause a denial-of-service (for example, by refusing to release the lock), but can he do more than that? I've read through some of the futex READMEs, including Ulrich Drepper's, but I still don't feel confident enough to answer this question.

(This is ignoring the practical reality that when you share enough memory, you'll inevitably forget to validate something somewhere in one process or the other.)

Fast interprocess communication revisited

Posted Nov 17, 2011 14:24 UTC (Thu) by renox (guest, #23785) [Link] (1 responses)

> I wonder about the wisdom of using a shared-memory locks (such as a futex) between process with different uids

I don't think that Binder is for 'untrusted' usage, if I read correctly the article if you send a message it becomes immediately part of the memory of the destination process, so it seems to be easy to create a DOS (increase the memory usage of the destination process too much and it would be killed) but maybe there are safeguards which weren't mentioned..

Fast interprocess communication revisited

Posted Nov 17, 2011 19:53 UTC (Thu) by neilbrown (subscriber, #359) [Link]

When a processes calls mmap on the binder fd, it specifies how much space to map. Binder refused to map more than 4Meg. If you send a message to a process and it doesn't have room in that space to store the message, you get an error back.

So the worst one process can do to another is fill up its incoming queue so that it cannot get real work done.

publish/subscribe in kernel

Posted Nov 17, 2011 10:17 UTC (Thu) by sustrik (guest, #62161) [Link] (1 responses)

There's a kernel-space implementation of 0MQ-inspired (http://lwn.net/Articles/370307/) publish/subscribe mechanism. It provides the pub/sub functionality via the traditional BSD sockets. For example:

https://gist.github.com/1372839

The implementation can be found here:

https://github.com/250bpm/linux-2.6
https://github.com/250bpm/sp-userland

LINX at sourceforge.

Posted Nov 17, 2011 15:58 UTC (Thu) by nysan (guest, #81015) [Link]

LINX, is more like the Android version of. iPC, with the additional functionality of remote communication over peripheral X,Y,Z as well.