LWN: Comments on "Fast interprocess messaging"

Still far from proprietary MPI implementations

wahern — Sat, 07 Jan 2012 00:13:09 +0000

I believe that was also Linus' analysis 6 years ago:

http://lists.freebsd.org/pipermail/freebsd-arch/2006-Apri...

Fast interprocess messaging

wahern — Fri, 06 Jan 2012 20:03:42 +0000

Oops. I should've RTFA. They considered vmsplice() and thought it too much trouble to keep the processes' synchronized so messages don't end up buffered in the kernel. I don't see it. Either way signaling needs to occur between sender and receiver. I suppose all of this really cries out for a proper kernel AIO implementation (assuming there isn't one). Sender or receiver need a way to queue an op with an associated buffer.

Fast interprocess messaging

wahern — Fri, 06 Jan 2012 19:49:41 +0000

Like the reverse of vmsplice(2)? I think vmsplice() would suffice as-is. Even better, it only requires the sender to know about the interface; the receiver can keep using read().

Actually, an optimized sockets implementation could accomplish single copy if both sender and receiver are in the kernel. The kernel could just copy from one buffer directly to the next. But perhaps the code for such an optimization would be to odious.

Corosync anyone?

njs — Sat, 18 Sep 2010 06:19:26 +0000

> a memory area mapped to a "file descriptor" hook on both sides

I think this is the precise definition of the word "pipe"?

Still far from proprietary MPI implementations

rusty — Fri, 17 Sep 2010 05:38:23 +0000

> With copy_to_process you have 1 copy instead of 2, but there should be 0
> copy for MPI to behave better.

I'm not so sure. I believe that a future implementation could well remap pages, but playing with mappings is not as cheap as you might think, especially if you want the same semantics: the process the pages come from will need to mark them R/O so it can COW them if it tries to change them.

I'm pretty sure we'd need a TLB flush on every cpu either task has run on. Nonetheless, you know how much data is involved so if it turns out to be dozens of pages it might make sense. With huge pages or only KB of data, not so much.

And if you're transferring MB of data over MPI, you're already in a world of suck, right?

Cheers,
Rusty.

Fast IPC

cma — Fri, 17 Sep 2010 03:09:43 +0000

Yep ;)

The problem is that with this kind of IPC "shared-memory" based it would be possible to code a "self-cotained" app that would not depend on a typical shared-memory which Java native code is not possible to implement (i'm not talking of a JNI based solution). Semophores, locks and so on would not be needed here since with this "new IPC model" we would just stick with file/socket io programming making it possible to obtain really awesome inter-process communication latency and throught put using a unique programming semantics, like async-io on top of NIO, epoll, or even libevent/libev.

The trick is that the kernel should be doing all the complex stuff like cache aware, numa etc affinities exposing just what we need, a file descriptor ;) Regards

Fast IPC

vomlehn — Thu, 16 Sep 2010 23:56:18 +0000

Ah, I get it. The idea is to copy into memory not visible from the other process. Never mind.

Fast IPC

vomlehn — Thu, 16 Sep 2010 23:53:11 +0000

Not sure why you wouldn't just use shared memory, which ensures zero copies, and one of a number of synchronization primitives, depending on your particular needs. If not that, then a vmsplice()/splice() variant could be cooked up.

At least at a quick glance, I don't see why any of the other ideas add to the mix.

Corosync anyone?

cma — Thu, 16 Sep 2010 21:17:17 +0000

And why not implement something like corosync (http://www.kernel.org/doc/ols/2009/ols2009-pages-61-68.pdf) focusing on performance and scalability?

I mean, it would be great to have an very scalable Linux IPC with file I/O semantics. It would be very nice to abstract a "shared memory" like IPC using async-io back-ends with syscalls like epoll, or even using libevent or libev on top of.

I'm very interested in making a Java based app talk with very low latencies with a C/C++ app via NIO on Java's side and libevent/libev on C/C++ side. The point is that no TCP stack (or UNIX sockets) would be used, instead a memory area mapped to a "file descriptor" hook on both sides (Java and C/C++). Is that possible?

Any thoughts/ideas?

Fast interprocess messaging

eduard.munteanu — Thu, 16 Sep 2010 15:13:26 +0000

I'm not sure why the ownership restriction is needed. Ideally, such an interface would let a process tell the kernel "I'm allowing somebody else to send messages to me". That is, the copy would occur only if a copy_to_process() pairs up with a copy_from_process() and the buffers match. In effect, the processes would negotiate a communication channel, it doesn't really matter who owns them. Though yes, I can see that looking at the PID isn't enough to prevent issues, perhaps another authentication scheme is in order?

Besides this, it's really good to see IPC performance improvements in the kernel.

Any thoughts?

Still far from proprietary MPI implementations

Np237 — Thu, 16 Sep 2010 13:20:03 +0000

Indeed that makes the performance much less predictable. I wonder how well this behaves on real-life codes, though. At least Bull claims their MPI implementation does that, and the single-node performance is impressive.

Still far from proprietary MPI implementations

ejr — Thu, 16 Sep 2010 12:58:10 +0000

There's a definition of zero copy floating around often attributed to Don Becker: Zero copy means someone *else* makes the copy.

That is more or less what happens in message passing using any shared memory mechanism. What you are describing is plain shared memory. It's perfectly fine to use within a single node, and I've done such a thing within MPI jobs working off large, read-only data sets to good success. (Transparent memory scaling of the data set when you're using multiple MPI processes on one node.) But it's not so useful for implementing MPI.

The interface here would help MPI when the receiver has already posted its receive when the send occurs. You then have the one necessary copy rather than two. Also, this interface has the *potential* of being smart with cache invalidation by avoiding caching the output on the sending processor! That is a serious cost; a shared buffer ends up bouncing between processors.

Still far from proprietary MPI implementations

nix — Thu, 16 Sep 2010 12:50:23 +0000

Wasn't this sort of thing what the old skas patch for user-mode-linux used to do?

Still far from proprietary MPI implementations

Trelane — Thu, 16 Sep 2010 12:14:23 +0000

I wonder if there's some way to swap a page or a number of pages between processes.

Still far from proprietary MPI implementations

Np237 — Thu, 16 Sep 2010 12:03:55 +0000

Ideally the kernel should not copy the data at all, but provide a way to map memory pages belonging to one process in the other process, marking them copy-on-write.

With copy_to_process you have 1 copy instead of 2, but there should be 0 copy for MPI to behave better.

Fast interprocess messaging

intgr — Thu, 16 Sep 2010 10:26:28 +0000

Another problem with opening /proc/*/mem is that every process needs to keep a file handle open for every other process that it wants to communicate to. So if you have N processes communicating to each other, they will need N² file handles total. Now I'm not sure if this actually matters in the HPC world, they have tons of memory anyway... Just a thought.

The alternative is opening the mem file for each message, send it and close it again. Maybe it works sufficiently well with the VFS scalability patches, but it still seems inefficient.

Fast interprocess messaging

mjthayer — Thu, 16 Sep 2010 08:35:22 +0000

That is nice for processes with the same owner of course, but a limited version for processes with different owners could be even nicer. For instance if it were possible for a process to open access to a section of its memory to, say, the process at the other end of a socket.

Fast interprocess messaging

mjthayer — Thu, 16 Sep 2010 08:30:16 +0000

> I'm wondering why you cannot achieve copy_*_process using pread and pwrite.
> Open /proc/$PID/mem with O_DIRECT (maybe) and use pread / pwrite.

I suppose that an advantage of copy_*_process would be that it will be more convenient to implement on other systems.

Fast interprocess messaging

nikanth — Thu, 16 Sep 2010 06:47:25 +0000

Ah.. this was already discussed.
To use /proc/$pid/mem the process needs to be ptraced.
May be that restriction will be removed instead of new syscalls.

Fast interprocess messaging

nikanth — Thu, 16 Sep 2010 06:13:17 +0000

I also don't see any inefficiency in using /proc/$pid/mem.
Waiting for your mail in that thread in LKML.

Fast interprocess messaging

neilbrown — Thu, 16 Sep 2010 04:12:12 +0000

I'm wondering why you cannot achieve copy_*_process using pread and pwrite.

Open /proc/$PID/mem with O_DIRECT (maybe) and use pread / pwrite.
Or maybe readv/writev.

I don't see the need to invent a new syscall (unless maybe preadv/pwritev would be helpful).