Yeah, funnily enough, that's exactly what X does now.
And still doesn't work out that well, because you've got a few big ticket issues. First is that you need to develop and maintain that network-aware protocol, which is actually pretty hard. Image-based network transparency needs a single protocol for transmitting image deltas and input events, and nothing else. An accelerated protocol either needs a generic RPC framework (which has historically proven to be heinously inefficient) or a custom-tailored protocol that has to be updated every few months for the new OpenGL versions and extensions. Plus, the network protocol _still_ is transmitting images all over the place, because modern apps still do a ton of image processing client-side, and rely on really fast buses between the client app and the GPU to keep things running smoothly, so you still end up needing a fast image transfer portion to the protocol. Plus there's the latency issues -- and this is the real killer -- because apps are using the GPU more and more for non-graphical processing, and data needs to transition back and forth between the GPU and CPU many times when carrying out even relatively simple tasks. For instance, picking an object when you click a mouse button can and is implemented in many apps and frameworks by rendering silhouettes in fixed colors to a buffer, then selecting the color at the texel corresponding to the cursor. Separating the CPU and GPU will make this god-awful slow.
Turns out that the image-based remoting works damn fast. Modern image compression techniques blow away what we had when VNC and X were still the top dogs in remoting. It's really quite trivial to push 60 frames per second of 1080p image content reliably on low-bandwidth pipes (and that's FAR more than any app other than a video player or game would ever need), and then all you need is the input events being sent back.
And there's no reason this has to be a whole-desktop affair and not a trivially easy to use per-application transparent setup. Wayland, in fact, makes this way easier than X does! Redirect the app to an offscreen buffer (just like the compositor already does), but instead of rendering it to the screen you instead compress, motion-diff, and encode the data and push it across a channel in your SSH session (just like X already does), and then the remote end decodes and displays the result. Send input events back. Super easy and there's no reason it would be any more difficult to use as an end-user than what X gives you today. And it works better, is forward-compatible with whatever advances come about in GPU technology, etc.
About the only thing you'll lose is the ability for a headless box with no GPU to get accelerated rendering on your desktop, but as I pointed out above you don't have that anyway, at least not for anything beyond the incredibly anemic and borderline useless 1.x versions of OpenGL.
Oh, and the Wayland-based network transparency could reuse an existing image-based protocol so it actually becomes easy to display those networked UNIX/Linux apps on a Windows/OSX/iOS/whatever machine without needing to jump through the hoops of getting a huge crazy X server installed. In fact, it can transparently work with any app from any OS. Neat.
Is all this written and working yet? No, of course not. That's no reason to claim it can't be written, though, or that it won't. In the end, the people who actually need something better than X and are willing to put in that effort are going to get what they want, and the people who want to keep esoteric X features but aren't willing to do the work to keep them on the modern architecture will lose out. That is the way of Open Source, as it is.