2) Message queues between threads, where data structures are owned by one thread avoiding the need for locks.
3) Messages queues in the form of pipes and sockets between processes.
1) is clearly the hardest to code. In principle it scales well of an unlimited amount of CPUs but it will often it ends up in lock congestion, where every thread waits for a specific resource. There is also a huge runtime overhead in managing the locks and logic around it.
2) and 3) are almost the same. 3) has the downside as you say that everything has to be serialized, which is a (huge) overhead. The upside is that individual processes can crash independently making it a lot easier to debug. You can also use sockets as a debug output, such the system can be inspected while running. 3) also have the possibility to move some of the system to a another machine.
But with a well written framework, where the communication channels are abstracted, 3) can be made into 2) without changing the core application code. No matter if you use 2) or 3), you have to use the same state machines in the application to handle the concurrency problems.
But once the application is written with explicit locks as under 1), there is no way back.
My advice is thus: Start by using 3). With templates and well a well written framework it is really not that hard to avoid threads. Then if the overhead of serialization is too high, merge the processes as in 2).
What I oppose to, is having a C++ runtime assuming it is multi threaded, having the runtime system using atomic operations, when I want to stick to single threaded programs. And what I argue for, is that most people ought to stick to single threaded programs anyway!