I don't have any test data in front of me but I'm not sure your statement is entirely true. On Linux threads are not significantly lighter than processes because processes are extremely light weight. Both are scheduled the same in the kernel and with copy on write memory management there isn't that much difference for the in-kernel housekeeping between creating a thread or forking (cloning) a process.
On a modern multi-socket multi-core machine each socket is its own largely independent computer and the whole machine is a NUMA cluster. That means that each process is assigned to a particular node and that's where its memory lives, splitting memory between nodes or bouncing a process between different nodes reduces performance. My guess is that threading will scale weirdly when you get beyond what can be handled by all the cores in one socket whereas a multi-process model can keep more memory local to the socket the process is running on.
I would suggest starting with a multi-process model because you get better fault tolerance with memory protection then consider threading if that doesn't test out for concurrent performance