> With copy_to_process you have 1 copy instead of 2, but there should be 0
> copy for MPI to behave better.
I'm not so sure. I believe that a future implementation could well remap pages, but playing with mappings is not as cheap as you might think, especially if you want the same semantics: the process the pages come from will need to mark them R/O so it can COW them if it tries to change them.
I'm pretty sure we'd need a TLB flush on every cpu either task has run on. Nonetheless, you know how much data is involved so if it turns out to be dozens of pages it might make sense. With huge pages or only KB of data, not so much.
And if you're transferring MB of data over MPI, you're already in a world of suck, right?