I do not understand what you two are squabbling about.
If an application builds a MIME message for each Cut/Copy operation, and receives a MIME message for each Paste, why does this not work?
The creator of the message can (in the case of a PDF reader, for example) present both text and pbm versions; the receiver of the message can take the text/plain part (if an xterm) or the image/pbm part (if an image editor), or prompt the user for the format they want if it so desires.
How the message is stored is immaterial - disk file, network server, whatever - why would this "text only" interface not work?