If we're using terms like BSP, STM, shared-everything, etc., then we're talking about what semantics your runtime provides, not how it's implemented in terms of OS primitives.
Right now Python on Linux has two easy options for parallelism. You can fork(), which gives you shared-nothing parallelism (which is great, very easy to reason about) with somewhat cumbersome and expensive IPC (serialized objects copied explicitly over sockets or shared memory, explicit chunk-of-shared-bytes via mmap, that sort of thing). Or you can use "threads", and get shared-everything parallelism (which is horrible) with fast IPC, except the GIL kind of kills that.
But you can imagine PyPy implementing fork()-like shared-nothing semantics *with* fast IPC. It'd use pthreads underneath, but with each thread having a totally different Python namespace, and when you passed an object to another thread it would just mark it copy-on-write instead of actually making a copy. This is how languages designed for parallelism, like Erlang or Rust, are intended to work.
Implementing this on top of CPython would be harder, and you'd have to have a whole transition period while people audited their existing C modules to adapt them to handle GIL-less operation and the new copy-on-write stuff, but it'd be possible. And seems more likely to me than designing a new C-compatible language like Armin is suggesting. I doubt it's worth it for a relatively minor IPC speedup, though...