Why not just use shared memory for anything performance critical, such as data uploads to the GPU?
As for context switches, most modern CPUs are multicore, so you might not need any actual context switches at all (just some cacheline bouncing).
Hardware 3D already usually communicates to a remote GPU via a DMA-based FIFO and uploads, so having an additional mechanism (faster due to using shared memory instead of DMA) shouldn't be the end of the world.
I'm not sure whether this additional IPC overhead would be actually higher than the performance degradation imposed by limiting the instruction set (for example, memory accesses seem to have extra overhead due to that).
Of course, you could also in principle trust the OS to be secure, and run arbitrary code in a security context with limited privileges, but with access to the GPU and other useful stuff; unfortunately, the history of local root holes on all OSes (not to mention the graphics drivers...) makes this probably an unwise choice.