The X server may or may not know about scrolling (it depends on which APIs the client is using), and the interface to expose this would be a nightmare... plus scrolling is an extremely trivial sort of motion to detect (integer number of pixels with absolutely unchanged contents). Why not just write some motion detection code?
Both sides of the connection have the old window contents available (though the server may not keep it in memory right now). A cheap-but-effective algorithm would be: Break this into fixed size chunks (say 16x16 pixels), discard solid-color ones, and hash the remaining with a rolling hash. When you get a new region to send, first run the rolling hash over this region to see if there are any matches. If so, align the new data with the matching old data, subtract, and send the offset + the difference (which will be highly compressible). Then the client inverts this to reconstruct. This will be effective even if some parts of the region are changing at the same time they scroll, new bits of the window are scrolling onto screen, etc., and the computational overhead would still be absolutely swamped by gzip, never mind vpx/x264.