The real performance killer is that the zero-page check is very fast CPU-wise, but it has to wait synchronously for the next cacheline to arrive from the slow RAM. Your CPU will be stalled most of the time, waiting for the data to arrive from RAM or a lower-level cache. (And let's not forget that it only makes sense to compress rarely-accessed data -- which is unlikely to already be in cache).
In contrast, any compression algorithm is likely to be slow enough that by the time it needs to access the next cacheline, the hardware prefetcher will already have that data in cache -- or at least reduced the time your CPU is stalled behind memory. Thus RAM (pre)fetching and CPU compression can work in parallel.
And let's not forget that hardware prefetchers are so good these days that even *asynchronous* prefetch instructions can frequently hurt performance: https://lwn.net/Articles/444336/
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds