For the Cell SPEs or GPU shaders they probably aren't capable enough to bother with having the OS run them directly. I only brought them up because there's no automatic cache coherence with them. They read the data into cache (the local 256 KB). They write the data out. Both read and write are done explicitly.
No one runs a MP system of different instruction sets, true. It isn't impossible though. The OS would need to be built twice, once for each instruction set. It could share the data structures.
I wonder if Intel ever played around with this for an Itanium server? It seems that I remember Itanium once shared the same socket layout as a Xeon so this would have been possible on a hardware level.
Now, if you got very tricky and decided to require that all software be in an intermediate language, the OS could use LLVM, Java, .NET or whatever to compile the binary to whichever CPU was going to execute the process. That would really make core switching expensive! And you'd need some way to mark where task switches were allowed to happen, and maybe run the task up to the next switch point so you could change over to the same equivalent point in the other architecture.
A bit more realistic would be cores with the same base instruction set, but specialized functions on different cores. That could work really well. When the program hit an "Illegal Instruction" fault, the OS could inspect the instruction and find one of the system cores that supports it, then migrate the task there or emulate the instruction. Or launch a user-space helper to download a new core design off the internet and program the FPGA!. That would let programmers use specialized vector, GPU, audio or regular expression instructions without worrying about what cores to use.