Fast interprocess messaging

By Jonathan Corbet
September 15, 2010

As the number of cores in systems increases, the need for fast communications between processes running on those cores will also increase. This week has seen the posting of a couple of patches aimed at making interprocess messaging faster on Linux systems; both have the potential to significantly improve system performance.

The first of these patches is motivated by a desire to make MPI faster. Intra-node communications in MPI are currently handled with shared memory, but that is still not fast enough for some users. Rather than copy messages through a shared segment, they would rather deliver messages directly into another process's address space. To this end, Christopher Yeoh has posted a patch implementing what he calls cross memory attach.

This patch implements a pair of new system calls:

    int copy_from_process(pid_t pid, unsigned long addr, unsigned long len,
                          char *buffer, int flags);
    int copy_to_process(pid_t pid, unsigned long addr, unsigned long len,
                        char *buffer, int flags);

A call to copy_from_process() will attempt to copy len bytes, starting at addr in the address space of the process identified by pid into the given buffer. The current implementation does not use the flags argument. As would be expected, copy_to_process() writes data into the target process's address space. Either both processes must have the same ownership or the copying process must have the CAP_SYS_PTRACE capability; otherwise the copy will not be allowed.

The patch includes benchmark numbers showing significant improvement with a variety of different tests. The reaction to the concept was positive, though some problems with the specific patch have been pointed out. Ingo Molnar suggested that an iovec-based interface (like readv() and writev()) might be preferable; he also suggested naming the new system calls sys_process_vm_read() and sys_process_vm_write(). Nobody has expressed opposition to the idea, so we might just see these system calls in a future kernel.

Many of us do not run MPI on our systems, but the use of D-Bus is rather more common. D-Bus was not designed for performance in quite the same way as MPI, so its single-system operation is somewhat slower. There is a central daemon which routes all messages, so a message going from one process to another must pass through the kernel twice; it is also necessary to wake the D-Bus daemon in the middle. That's not ideal from a performance standpoint.

Alban Crequy has written about an alternative: performing D-Bus processing in the kernel. To that end, the "kdbus" kernel module introduces a new AF_DBUS socket type. These sockets behave much like the AF_UNIX variety, but the kernel listens in on the message traffic to learn about the names associated with every process on the "bus"; once it has that information recorded, it is able to deliver much of the D-Bus message traffic without involving the daemon (which still exists to handle things the kernel doesn't know what to do with).

When the daemon can be shorted out, a message can be delivered with only one pass through the kernel and only one copy. Once again, significant performance improvements have been measured, even though larger messages must still be routed through the daemon. People have occasionally complained about the performance of D-Bus for years, so there may be real value in improving the system in this way.

It may be some time, though, before this code lands on our desktops. There is a git tree available with the patches, but they have never been cleaned up and posted to the lists for review. The patch set is not small, so chances are good that there will be a lot of things to fix before it can be considered for mainline inclusion. The D-Bus daemon, it seems, will be busy for a little while yet.

Index entries for this article
Kernel	D-Bus
Kernel	Message passing
Kernel	Networking/D-Bus

Fast interprocess messaging

Posted Sep 16, 2010 4:12 UTC (Thu) by neilbrown (subscriber, #359) [Link] (4 responses)

I'm wondering why you cannot achieve copy_*_process using pread and pwrite.

Open /proc/$PID/mem with O_DIRECT (maybe) and use pread / pwrite.
Or maybe readv/writev.

I don't see the need to invent a new syscall (unless maybe preadv/pwritev would be helpful).

Fast interprocess messaging

Posted Sep 16, 2010 6:13 UTC (Thu) by nikanth (guest, #50093) [Link] (2 responses)

I also don't see any inefficiency in using /proc/$pid/mem.
Waiting for your mail in that thread in LKML.

Fast interprocess messaging

Posted Sep 16, 2010 6:47 UTC (Thu) by nikanth (guest, #50093) [Link] (1 responses)

Ah.. this was already discussed.
To use /proc/$pid/mem the process needs to be ptraced.
May be that restriction will be removed instead of new syscalls.

Fast interprocess messaging

Posted Sep 16, 2010 10:26 UTC (Thu) by intgr (subscriber, #39733) [Link]

Another problem with opening /proc/*/mem is that every process needs to keep a file handle open for every other process that it wants to communicate to. So if you have N processes communicating to each other, they will need N² file handles total. Now I'm not sure if this actually matters in the HPC world, they have tons of memory anyway... Just a thought.

The alternative is opening the mem file for each message, send it and close it again. Maybe it works sufficiently well with the VFS scalability patches, but it still seems inefficient.

Fast interprocess messaging

Posted Sep 16, 2010 8:30 UTC (Thu) by mjthayer (guest, #39183) [Link]

> I'm wondering why you cannot achieve copy_*_process using pread and pwrite.
> Open /proc/$PID/mem with O_DIRECT (maybe) and use pread / pwrite.

I suppose that an advantage of copy_*_process would be that it will be more convenient to implement on other systems.

Fast interprocess messaging

Posted Sep 16, 2010 8:35 UTC (Thu) by mjthayer (guest, #39183) [Link] (2 responses)

That is nice for processes with the same owner of course, but a limited version for processes with different owners could be even nicer. For instance if it were possible for a process to open access to a section of its memory to, say, the process at the other end of a socket.

Fast interprocess messaging

Posted Jan 6, 2012 19:49 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

Like the reverse of vmsplice(2)? I think vmsplice() would suffice as-is. Even better, it only requires the sender to know about the interface; the receiver can keep using read().

Actually, an optimized sockets implementation could accomplish single copy if both sender and receiver are in the kernel. The kernel could just copy from one buffer directly to the next. But perhaps the code for such an optimization would be to odious.

Fast interprocess messaging

Posted Jan 6, 2012 20:03 UTC (Fri) by wahern (subscriber, #37304) [Link]

Oops. I should've RTFA. They considered vmsplice() and thought it too much trouble to keep the processes' synchronized so messages don't end up buffered in the kernel. I don't see it. Either way signaling needs to occur between sender and receiver. I suppose all of this really cries out for a proper kernel AIO implementation (assuming there isn't one). Sender or receiver need a way to queue an op with an associated buffer.

Still far from proprietary MPI implementations

Posted Sep 16, 2010 12:03 UTC (Thu) by Np237 (guest, #69585) [Link] (6 responses)

Ideally the kernel should not copy the data at all, but provide a way to map memory pages belonging to one process in the other process, marking them copy-on-write.

With copy_to_process you have 1 copy instead of 2, but there should be 0 copy for MPI to behave better.

Still far from proprietary MPI implementations

Posted Sep 16, 2010 12:14 UTC (Thu) by Trelane (subscriber, #56877) [Link]

I wonder if there's some way to swap a page or a number of pages between processes.

Still far from proprietary MPI implementations

Posted Sep 16, 2010 12:50 UTC (Thu) by nix (subscriber, #2304) [Link]

Wasn't this sort of thing what the old skas patch for user-mode-linux used to do?

Still far from proprietary MPI implementations

Posted Sep 16, 2010 12:58 UTC (Thu) by ejr (subscriber, #51652) [Link] (1 responses)

There's a definition of zero copy floating around often attributed to Don Becker: Zero copy means someone *else* makes the copy.

That is more or less what happens in message passing using any shared memory mechanism. What you are describing is plain shared memory. It's perfectly fine to use within a single node, and I've done such a thing within MPI jobs working off large, read-only data sets to good success. (Transparent memory scaling of the data set when you're using multiple MPI processes on one node.) But it's not so useful for implementing MPI.

The interface here would help MPI when the receiver has already posted its receive when the send occurs. You then have the one necessary copy rather than two. Also, this interface has the *potential* of being smart with cache invalidation by avoiding caching the output on the sending processor! That is a serious cost; a shared buffer ends up bouncing between processors.

Still far from proprietary MPI implementations

Posted Sep 16, 2010 13:20 UTC (Thu) by Np237 (guest, #69585) [Link]

Indeed that makes the performance much less predictable. I wonder how well this behaves on real-life codes, though. At least Bull claims their MPI implementation does that, and the single-node performance is impressive.

Still far from proprietary MPI implementations

Posted Sep 17, 2010 5:38 UTC (Fri) by rusty (guest, #26) [Link] (1 responses)

> With copy_to_process you have 1 copy instead of 2, but there should be 0
> copy for MPI to behave better.

I'm not so sure. I believe that a future implementation could well remap pages, but playing with mappings is not as cheap as you might think, especially if you want the same semantics: the process the pages come from will need to mark them R/O so it can COW them if it tries to change them.

I'm pretty sure we'd need a TLB flush on every cpu either task has run on. Nonetheless, you know how much data is involved so if it turns out to be dozens of pages it might make sense. With huge pages or only KB of data, not so much.

And if you're transferring MB of data over MPI, you're already in a world of suck, right?

Cheers,
Rusty.

Still far from proprietary MPI implementations

Posted Jan 7, 2012 0:13 UTC (Sat) by wahern (subscriber, #37304) [Link]

I believe that was also Linus' analysis 6 years ago:

http://lists.freebsd.org/pipermail/freebsd-arch/2006-Apri...

Fast interprocess messaging

Posted Sep 16, 2010 15:13 UTC (Thu) by eduard.munteanu (guest, #66641) [Link]

I'm not sure why the ownership restriction is needed. Ideally, such an interface would let a process tell the kernel "I'm allowing somebody else to send messages to me". That is, the copy would occur only if a copy_to_process() pairs up with a copy_from_process() and the buffers match. In effect, the processes would negotiate a communication channel, it doesn't really matter who owns them. Though yes, I can see that looking at the PID isn't enough to prevent issues, perhaps another authentication scheme is in order?

Besides this, it's really good to see IPC performance improvements in the kernel.

Any thoughts?

Corosync anyone?

Posted Sep 16, 2010 21:17 UTC (Thu) by cma (guest, #49905) [Link] (4 responses)

And why not implement something like corosync (http://www.kernel.org/doc/ols/2009/ols2009-pages-61-68.pdf) focusing on performance and scalability?

I mean, it would be great to have an very scalable Linux IPC with file I/O semantics. It would be very nice to abstract a "shared memory" like IPC using async-io back-ends with syscalls like epoll, or even using libevent or libev on top of.

I'm very interested in making a Java based app talk with very low latencies with a C/C++ app via NIO on Java's side and libevent/libev on C/C++ side. The point is that no TCP stack (or UNIX sockets) would be used, instead a memory area mapped to a "file descriptor" hook on both sides (Java and C/C++). Is that possible?

Any thoughts/ideas?

Fast IPC

Posted Sep 16, 2010 23:53 UTC (Thu) by vomlehn (guest, #45588) [Link] (2 responses)

Not sure why you wouldn't just use shared memory, which ensures zero copies, and one of a number of synchronization primitives, depending on your particular needs. If not that, then a vmsplice()/splice() variant could be cooked up.

At least at a quick glance, I don't see why any of the other ideas add to the mix.

Fast IPC

Posted Sep 16, 2010 23:56 UTC (Thu) by vomlehn (guest, #45588) [Link] (1 responses)

Ah, I get it. The idea is to copy into memory not visible from the other process. Never mind.

Fast IPC

Posted Sep 17, 2010 3:09 UTC (Fri) by cma (guest, #49905) [Link]

Yep ;)

The problem is that with this kind of IPC "shared-memory" based it would be possible to code a "self-cotained" app that would not depend on a typical shared-memory which Java native code is not possible to implement (i'm not talking of a JNI based solution). Semophores, locks and so on would not be needed here since with this "new IPC model" we would just stick with file/socket io programming making it possible to obtain really awesome inter-process communication latency and throught put using a unique programming semantics, like async-io on top of NIO, epoll, or even libevent/libev.

The trick is that the kernel should be doing all the complex stuff like cache aware, numa etc affinities exposing just what we need, a file descriptor ;) Regards

Corosync anyone?

Posted Sep 18, 2010 6:19 UTC (Sat) by njs (subscriber, #40338) [Link]

> a memory area mapped to a "file descriptor" hook on both sides

I think this is the precise definition of the word "pipe"?