my first reaction on reading this is that the processes are stalling, and the behavior you describe to improve performance sounds like it's on the same tack.
did you try tweaking the buffer sizes that rsync uses to see if larger buffers may smooth things out a bit?
I would still expect it to take significantly more CPU time than a plain cp, I'm not bothered by that (it's really designed for a different job that it excels at, this is a degenerate corner case for it), but the throughput should be higher.
while the approach is less efficient than the read()/write() in a tight loop, if it can keep the buffers between the processes full it should be able to do fairly well (as your testing shows when you tweak things so that the threads aren't waiting for each other much), so perhaps larger buffers can avoid the stalls.