November 9, 2011
This article was contributed by Neil Brown
Slightly over a year ago, LWN reported on a
couple of different kernel patches aimed at providing fast, or at least
faster, interprocess communication (IPC): Cross Memory Attach (CMA) and
kernel-dbus (kdbus). In one of the related email threads on
the linux-kernel list, a third (pre-existing) kernel patch called KNEM was discussed.
Meanwhile, yet another kernel module - "binder" used by the Android
platform - is in use in millions of devices worldwide to provide fast IPC,
and Linus recently observed that code that
actually is used is the code that is actually worth something so maybe
more of the Android code should be merged despite objections from some
corners. Binder wasn't explicitly mentioned in that discussion but could
reasonably be assumed to be included.
This article is not about whether any of these should be merged or not. That
is largely an engineering and political decision in which this author claims no
expertise, and in any case one of them - CMA - has already been
merged.
Rather we start with the observation that this many attempts to solve
essentially the same problem suggests that something is lacking in Linux.
There is, in other words, a real need for fast IPC that Linux doesn't address. The
current approaches to filling this gap seem to be piecemeal attempts: Each
patchset
is clearly addressing the needs of a specific IPC model without obvious
consideration for others. While this may solve current problems, it may not
solve future problems, and one of the strengths of the design of Unix and hence
Linux is the
full exploitation of a few key ideas rather
than the ad hoc accumulation of many distinct (though related)
ideas.
So, motivated by that observation we will explore these various implementations
to try to discover and describe the commonality they share and to highlight the
key design decisions each one makes. Hopefully this will lead to a greater
understanding of both the problem space and the solution space. Such
understanding may be our best weapon against chaos in the kernel.
What's your address?
One of the interesting differences between the different IPC schemes is
their mechanism for specifying the destination for a message.
CMA uses a process id (PID) number combined with offsets in the address space
of that process - a message is simply copied to that location. This has the
advantage of being very simple and efficient. PIDs are already managed by the
kernel and piggy-backing on that facility is certainly attractive. The obvious
disadvantage is that there is no room for any sophistication in access control,
so messages can only be sent to processes with exactly the same credentials.
This will not suit every context, but it is not a problem for the target area
(the MPI message passing interface) which is aimed at massively parallel
implementations in which all the processes are working together on one task.
In that case having uniform credentials is an obvious choice.
KNEM uses a "cookie" which is a byte string provided by the kernel and
which can be
copied between processes. One process registers a region of memory with KNEM
and receives a cookie in return. It can then pass this cookie to other
processes as a simple byte string; the recipients can then copy to or from the
registered region using that cookie as an address. Here again there is an
assumption that the processes are co-operating and not a threat to each other
(KNEM is also used for MPI). KNEM does not actually check process credentials
directly, so any process that registers a region with KNEM is effectively
allowing any other process that is able to use KNEM (i.e. able to open a
specific character device file) to freely access that memory.
Kdbus follows the model of D-Bus and uses simple strings to direct messages. It
monitors all D-Bus traffic to find out which endpoints own which names and then,
when it sees a message sent to a particular name, it routes it accordingly
rather than letting it go through the D-Bus daemon for routing.
Binder takes a very different approach from the other three. Rather than using
names that appear the same to all processes, binder uses a kernel-internal
object for which different processes see different object descriptors: small
integers much like file descriptors. Each object is owned by a particular
process (which can create new objects quite cheaply) and a message sent to an
object is routed to the owning process. As each process is likely to have a
different descriptor (or none at all) for the one object, descriptors cannot be
passed as byte strings. However they can be passed along with binder messages
much like file descriptors can be passed using Unix-domain sockets.
The main reason for using descriptors rather than names appears to
involve reference counting. Binder is designed to work in an object-oriented
system which (unsurprisingly) involves passing messages to objects,
where the messages can contain references other objects.
This is exactly the pattern seen in the kernel module. Any such system
needs some way
of determining when an object is no longer referenced, the typical
approaches being garbage collection and reference counting. Garbage collection
across multiple different processes is unlikely to be practical, so reference
counting is the natural choice. As binder allows communication between
mutually suspicious processes, there needs to be some degree of enforcement: a
process should not be able to send a message when it doesn't own a reference to
the target, and when a process dies, all its references should be released. To
ensure these rules are met it is hard to come up with any scheme much simpler
than the one used by binder.
Possibly the most interesting observation here is that two addressing schemes
used widely in Linux are completely missing in these implementations: file
descriptors and socket addresses (struct sockaddr).
File descriptors are used for pipes (the original UNIX IPC), for socket pairs
and other connected sockets, for talking to devices, and much more. It is not
hard to imagine them being used by CMA, and binder too. They are
appealing as they can be used with simple read() and write()
calls and similar standard interfaces. The most likely reason that they are
regularly avoided is their cost - they are not exactly lightweight. On
an x86_64 system a struct file - the minimum needed for each
file descriptor - is 288 bytes. Of these, maybe 64 are relevant to many novel
use cases, the rest is dead weight. This weight could possibly be reduced by a
more object-oriented approach to struct
file but such a change would be very intrusive and is unlikely to happen.
So finding other approaches is likely to become common. We see that already
in the inotify subsystem which has "watch descriptors"; we see it here
in binder too.
The avoidance of socket addresses does not seem to admit such a neat
answer. In the
cases of CMA, kdbus, and binder it doesn't seem to fit the need for various
different reasons. For KNEM it seems best explained as arbitrary choice. The
developer chose to write a new character device rather than a new networking
domain (aka address family) and so used ioctl() and ad
hoc addresses
rather than sendmsg()/recvmsg() and socket addresses.
The conclusion here seems to be that there is a constant tension between
protection and performance. Every step we take to control what one process can
do to another by building meaning into an address adds extra setup cost and
management cost. Possibly the practical approach is not to try to choose
between them but to unify them and allow each client to choose. So a
client could register itself with an external address that any other process
can use if it knows it, or with an internal address (like the binder objects)
which can only be used by a process that has explicitly been given it. Further,
a registered address may only accept explicit messages, or may be bound to a
memory region that other processes can read and write directly. If such
addresses and messages could be used interchangeably in the one domain it might
allow a lot more flexibility for innovation.
Publish and subscribe
One area where kdbus stands out from the rest is in support for a
publish/subscribe interface.
Each of the higher level IPC services (MPI, Binder, D-Bus) have some sort of
multicast or broadcast facility, but only kdbus tries to bring it into the
kernel. This could simply reflect the fact that multicast
does not need to be optimized and can be adequately handled in user space.
Alternately it could mean that implementing it in the kernel is too hard so few
people try.
There are two ways we can think about implementing a publish/subscribe
mechanism. The first follows the example of IP multicast where a certain class
of addresses is defined to be multicast addresses and sockets can
request to receive multicasts to selected addresses. Binder does actually
have a very
limited form of this. Any binder client can ask to be notified when a
particular object dies; when a client closes its handle on the binder
(e.g. when it exits) all the objects it owns die and messages are accordingly
published for all clients who have subscribed to that object. It would be
tempting to turn this into a more general publish/subscribe scheme.
The second way to implement publish/subscribe is through a mechanism like
the Berkeley packet filter that the networking layer
provides. This allows a socket to request to receive all messages, but the
filter removes some of them based on content following an almost arbitrary
program (which can now be JIT compiled). This
is more in line with the approach that kdbus uses. D-Bus allows clients to
present "match" rules such that they receive all messages with content that
matches the rules. kdbus extracts those rules by monitoring D-Bus traffic and
uses them to perform multicast routing in the kernel.
Alban Crequy, the author of kdbus, appears to have been
exploring
both of these approaches. It would be well worth considering this effort
in any new
fast-IPC mechanism introduced into Linux to ensure it meets all use cases well.
Single copy
A recurring goal in many efforts at improving communication speed is to reduce
the number of times that message data is copied in transit. "Zero-copy" is
sometime seen as the holy-grail and, while it is usually
impractical to reach that, single-copy can be attained; three of our four
examples do achieve it.
The fourth,
kdbus, doesn't really try to achieve single-copy. The standard D-Bus mechanism
is four copies - sender to kernel to daemon to kernel to receiver. Kdbus
reduces this to two copies (and more particularly reduces context-switches
to one) which
is quite an improvement. The others all aim for single-copy operation.
CMA and KNEM achieve single-copy performance by providing a system call
which simply copies
from one address space to the other with various restrictions as we have
already seen. This is simple, but not secure in a hostile environment.
Binder is, again, quite different. With binder, part of the address space of
each process is managed by the binder module through the process calling
mmap()
on the binder file descriptor. Binder then allocates pages and places them in
the address space as required.
This mapped memory is read-only to the process, all writing is performed by the
kernel. When a message is sent from one process to another the kernel
allocates some space in the destination process's mapped area, copies the
message directly
from the sending process, and then queues a short message to the receiving
process telling it where the received message is. The recipient can then access
that message directly and will ultimately tell the binder module that it is
finished with the message and that the memory can be reused.
While this approach may seem a little complex - having the kernel effectively
provide a malloc() implementation (best fit as it happens) for the
receiving process - it has the particular benefit that it requires no
synchronization between the sender and the recipient. The copy happens
immediately for the sender and it can then move on assuming it is complete.
The receiver doesn't need to know anything about the message until it is all
there ready and waiting (much better to have the message waiting than
the processes waiting).
This asynchronous behavior is common to all the single-copy mechanisms, which
makes one wonder if using Linux's AIO (Asynchronous Input/Output) subsystem
might provide
another possible approach. The sender could submit an asynchronous write, the
recipient an asynchronous read, and when the second of the two arrives the copy
is performed and each is notified. One unfortunate, though probably minor,
issue with this approach is that, while Linux-aio can submit multiple read and
write requests in a single system call and can receive multiple completion
notifications in another system call, it cannot do both in one. This contrasts
with the binder which has a WRITE_READ ioctl() command
that sends messages and then
waits for the reply, allowing an entire transaction to happen in a single system
call. As we have seen with addition of
recvmmsg() and, more recently,
sendmmsg(), doing multiple things in a single
system call has real advantages. As Dave Miller
once observed:
The old adage about syscalls being cheap no longer holds when
we're talking about traversing all the way into the protocol
stack socket code every call, taking the socket lock every
time, etc.
Tracking transactions
All of the high-level APIs for IPC make a distinction between requests and
replies, connecting them in some way to form a single transaction. Most of the
in-kernel support for messaging doesn't preserve this distinction with any real
clarity. Messages are just messages and it is up to user space to
determine how they are
interpreted. The binder module is again an exception; understanding why
helps expose an important aspect of the binder approach.
Though the code and the API do not present it exactly like this, the easiest
way to think about the transaction tracking in binder is to imagine that each
message has a "transaction ID" label. A request and its reply will
have the same label. Further, if the recipient of the message finds that it
needs to make another IPC before it can generate a final reply, it will use the
same label on this intermediate IPC, and will obviously expect it on the
intermediate reply.
With this labeling in place, Binder allows (and in fact requires) a thread
which has sent a message, and which is waiting for a reply to that message,
to only receive further messages with the same transaction ID. This rule
allows a thread to respond to recursive calls and, thus, allow that
thread's own original request to progress, but causes it to ignore any new
calls until the current one is complete. If a process is multithreaded,
each thread can work on independent transactions separately, but a single
thread is tied to one complex transaction at a time.
Apart from possibly simplifying the user-space programming model, this allows
the transaction as a whole to have a single CPU scheduling priority inherited
from the originating process. Binder presents a model that there is just one
thread of control involved in a method call, but that thread may wander
from one address
space to another to carry out different parts of the task. This migration of
process priority allows that model to be more fully honored.
While many of the things that binder does are "a bit different", this is
probably the most unusual. Having the same open file descriptor behave
differently in different threads is not what most of us would expect. Yet it
seems to be a very effective way to implement an apparently useful feature.
Whether this feature is truly generally useful, and whether or not there is a
more idiomatic way to provide it in Linux are difficult questions. However
they are questions that need to be addressed if we want the best possible
high-speed IPC in our kernel of choice.
Inter-Programmer Communication
There is certainly no shortage of interesting problems to solve in the Linux
kernel, and equally no shortage of people with innovative and creative
solutions. Here we have seen four quite different approaches to one particular
problem and how each brings value of one sort or another. However each could
probably be improved by incorporating ideas and approaches from one of the
others, or by addressing needs that others present.
My hope is that by exposing and contrasting the different solutions and the
problems they address, we can take a step closer to finding unifying solutions
that address both today's needs and the needs for our grandchildren.
(
Log in to post comments)