Fast interprocess messaging
The first of these patches is motivated by a desire to make MPI faster. Intra-node communications in MPI are currently handled with shared memory, but that is still not fast enough for some users. Rather than copy messages through a shared segment, they would rather deliver messages directly into another process's address space. To this end, Christopher Yeoh has posted a patch implementing what he calls cross memory attach.
This patch implements a pair of new system calls:
int copy_from_process(pid_t pid, unsigned long addr, unsigned long len, char *buffer, int flags); int copy_to_process(pid_t pid, unsigned long addr, unsigned long len, char *buffer, int flags);
A call to copy_from_process() will attempt to copy len bytes, starting at addr in the address space of the process identified by pid into the given buffer. The current implementation does not use the flags argument. As would be expected, copy_to_process() writes data into the target process's address space. Either both processes must have the same ownership or the copying process must have the CAP_SYS_PTRACE capability; otherwise the copy will not be allowed.
The patch includes benchmark numbers showing significant improvement with a variety of different tests. The reaction to the concept was positive, though some problems with the specific patch have been pointed out. Ingo Molnar suggested that an iovec-based interface (like readv() and writev()) might be preferable; he also suggested naming the new system calls sys_process_vm_read() and sys_process_vm_write(). Nobody has expressed opposition to the idea, so we might just see these system calls in a future kernel.
Many of us do not run MPI on our systems, but the use of D-Bus is rather more common. D-Bus was not designed for performance in quite the same way as MPI, so its single-system operation is somewhat slower. There is a central daemon which routes all messages, so a message going from one process to another must pass through the kernel twice; it is also necessary to wake the D-Bus daemon in the middle. That's not ideal from a performance standpoint.
Alban Crequy has written about an alternative: performing D-Bus processing in the kernel. To that end, the "kdbus" kernel module introduces a new AF_DBUS socket type. These sockets behave much like the AF_UNIX variety, but the kernel listens in on the message traffic to learn about the names associated with every process on the "bus"; once it has that information recorded, it is able to deliver much of the D-Bus message traffic without involving the daemon (which still exists to handle things the kernel doesn't know what to do with).
When the daemon can be shorted out, a message can be delivered with only one pass through the kernel and only one copy. Once again, significant performance improvements have been measured, even though larger messages must still be routed through the daemon. People have occasionally complained about the performance of D-Bus for years, so there may be real value in improving the system in this way.
It may be some time, though, before this code lands on our desktops. There
is a
git tree available with the patches, but they have never been cleaned
up and posted to the lists for review. The patch set is not small, so
chances are good that there will be a lot of things to fix before it can be
considered for mainline inclusion. The D-Bus daemon, it seems, will be
busy for a little while yet.
Index entries for this article | |
---|---|
Kernel | D-Bus |
Kernel | Message passing |
Kernel | Networking/D-Bus |
Posted Sep 16, 2010 4:12 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (4 responses)
Open /proc/$PID/mem with O_DIRECT (maybe) and use pread / pwrite.
I don't see the need to invent a new syscall (unless maybe preadv/pwritev would be helpful).
Posted Sep 16, 2010 6:13 UTC (Thu)
by nikanth (guest, #50093)
[Link] (2 responses)
Posted Sep 16, 2010 6:47 UTC (Thu)
by nikanth (guest, #50093)
[Link] (1 responses)
Posted Sep 16, 2010 10:26 UTC (Thu)
by intgr (subscriber, #39733)
[Link]
Another problem with opening /proc/*/mem is that every process needs to keep a file handle open for every other process that it wants to communicate to. So if you have N processes communicating to each other, they will need N2 file handles total. Now I'm not sure if this actually matters in the HPC world, they have tons of memory anyway... Just a thought. The alternative is opening the mem file for each message, send it and close it again. Maybe it works sufficiently well with the VFS scalability patches, but it still seems inefficient.
Posted Sep 16, 2010 8:30 UTC (Thu)
by mjthayer (guest, #39183)
[Link]
I suppose that an advantage of copy_*_process would be that it will be more convenient to implement on other systems.
Posted Sep 16, 2010 8:35 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Jan 6, 2012 19:49 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (1 responses)
Actually, an optimized sockets implementation could accomplish single copy if both sender and receiver are in the kernel. The kernel could just copy from one buffer directly to the next. But perhaps the code for such an optimization would be to odious.
Posted Jan 6, 2012 20:03 UTC (Fri)
by wahern (subscriber, #37304)
[Link]
Posted Sep 16, 2010 12:03 UTC (Thu)
by Np237 (guest, #69585)
[Link] (6 responses)
With copy_to_process you have 1 copy instead of 2, but there should be 0 copy for MPI to behave better.
Posted Sep 16, 2010 12:14 UTC (Thu)
by Trelane (subscriber, #56877)
[Link]
Posted Sep 16, 2010 12:50 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Sep 16, 2010 12:58 UTC (Thu)
by ejr (subscriber, #51652)
[Link] (1 responses)
That is more or less what happens in message passing using any shared memory mechanism. What you are describing is plain shared memory. It's perfectly fine to use within a single node, and I've done such a thing within MPI jobs working off large, read-only data sets to good success. (Transparent memory scaling of the data set when you're using multiple MPI processes on one node.) But it's not so useful for implementing MPI.
The interface here would help MPI when the receiver has already posted its receive when the send occurs. You then have the one necessary copy rather than two. Also, this interface has the *potential* of being smart with cache invalidation by avoiding caching the output on the sending processor! That is a serious cost; a shared buffer ends up bouncing between processors.
Posted Sep 16, 2010 13:20 UTC (Thu)
by Np237 (guest, #69585)
[Link]
Posted Sep 17, 2010 5:38 UTC (Fri)
by rusty (guest, #26)
[Link] (1 responses)
I'm not so sure. I believe that a future implementation could well remap pages, but playing with mappings is not as cheap as you might think, especially if you want the same semantics: the process the pages come from will need to mark them R/O so it can COW them if it tries to change them.
I'm pretty sure we'd need a TLB flush on every cpu either task has run on. Nonetheless, you know how much data is involved so if it turns out to be dozens of pages it might make sense. With huge pages or only KB of data, not so much.
And if you're transferring MB of data over MPI, you're already in a world of suck, right?
Cheers,
Posted Jan 7, 2012 0:13 UTC (Sat)
by wahern (subscriber, #37304)
[Link]
http://lists.freebsd.org/pipermail/freebsd-arch/2006-Apri...
Posted Sep 16, 2010 15:13 UTC (Thu)
by eduard.munteanu (guest, #66641)
[Link]
Besides this, it's really good to see IPC performance improvements in the kernel.
Any thoughts?
Posted Sep 16, 2010 21:17 UTC (Thu)
by cma (guest, #49905)
[Link] (4 responses)
Posted Sep 16, 2010 23:53 UTC (Thu)
by vomlehn (guest, #45588)
[Link] (2 responses)
At least at a quick glance, I don't see why any of the other ideas add to the mix.
Posted Sep 16, 2010 23:56 UTC (Thu)
by vomlehn (guest, #45588)
[Link] (1 responses)
Posted Sep 17, 2010 3:09 UTC (Fri)
by cma (guest, #49905)
[Link]
Posted Sep 18, 2010 6:19 UTC (Sat)
by njs (subscriber, #40338)
[Link]
I think this is the precise definition of the word "pipe"?
Fast interprocess messaging
Or maybe readv/writev.
Fast interprocess messaging
Waiting for your mail in that thread in LKML.
Fast interprocess messaging
To use /proc/$pid/mem the process needs to be ptraced.
May be that restriction will be removed instead of new syscalls.
Fast interprocess messaging
Fast interprocess messaging
> Open /proc/$PID/mem with O_DIRECT (maybe) and use pread / pwrite.
Fast interprocess messaging
Fast interprocess messaging
Fast interprocess messaging
Still far from proprietary MPI implementations
Still far from proprietary MPI implementations
Still far from proprietary MPI implementations
Still far from proprietary MPI implementations
Still far from proprietary MPI implementations
Still far from proprietary MPI implementations
> copy for MPI to behave better.
Rusty.
Still far from proprietary MPI implementations
Fast interprocess messaging
And why not implement something like corosync (http://www.kernel.org/doc/ols/2009/ols2009-pages-61-68.pdf) focusing on performance and scalability?Corosync anyone?
I mean, it would be great to have an very scalable Linux IPC with file I/O semantics. It would be very nice to abstract a "shared memory" like IPC using async-io back-ends with syscalls like epoll, or even using libevent or libev on top of.
I'm very interested in making a Java based app talk with very low latencies with a C/C++ app via NIO on Java's side and libevent/libev on C/C++ side. The point is that no TCP stack (or UNIX sockets) would be used, instead a memory area mapped to a "file descriptor" hook on both sides (Java and C/C++). Is that possible?
Any thoughts/ideas?
Fast IPC
Fast IPC
Yep ;)Fast IPC
The problem is that with this kind of IPC "shared-memory" based it would be possible to code a "self-cotained" app that would not depend on a typical shared-memory which Java native code is not possible to implement (i'm not talking of a JNI based solution). Semophores, locks and so on would not be needed here since with this "new IPC model" we would just stick with file/socket io programming making it possible to obtain really awesome inter-process communication latency and throught put using a unique programming semantics, like async-io on top of NIO, epoll, or even libevent/libev.
The trick is that the kernel should be doing all the complex stuff like cache aware, numa etc affinities exposing just what we need, a file descriptor ;)
Regards
Corosync anyone?