|
|

# Specialised compression method from 1999 paper

## Specialised compression method from 1999 paper

Posted Feb 16, 2013 7:50 UTC (Sat) by CChittleborough (subscriber, #60775)
Parent article: The zswap compressed swap cache

Some academics tried this technique back in 1999 using three compression methods, including one designed for memory contents rather than file contents. See "The Case for Compressed Caching in Virtual Memory Systems" by Paul R. Wilson, Scott F. Kaplan, Yannis Smaragdakis, Proc. 1999 Usenix Annual Tech. Conf. (PDF, CiteSeerX), especially section 2.

It might be worth adding one of these word-orientated, recency-based compressors to the kernel and trying it with zswap. (One problem: Wilson's compressor design assumes a 32-bit machine.)

Specialised compression method from 1999 paper (WKdm et al.)

Posted Oct 28, 2013 6:56 UTC (Mon) by phil (guest, #91719) [Link]

Actually, Wilson's algorithms don't assume 32 bits, and work even better on 64-bit machines. That's one reason they're used in the new version of OSX ("Mavericks") now.

They assume that most stuff is more or less aligned on AT LEAST 32 bit boundaries, or can be passably approximated that way.

So for example, they'll notice if a 32 bit word is all zeroes, or if it's zeroes in the upper 22 bits; many integers are.

For a 64 bit int that's zeroes in all but the low 10 bits, what will happen is that the first 32 bits will be encoded as a big zero, in two bits, and second 32 bits will be encoded as an "almost zero," in 2 + 10 = 12 bits, so you get 64 bits crammed into 2 + 12 = 14 bits, for a compression ratio greater than 4.

Similar things happen with 64-bit patterns that aren't mostly zeroes, but are similar to other 64-bit patterns nearby in memory, as happens with many integers and most pointers. Often integers are numerically similar to other integers nearby (e.g., in arrays that represent discretized smooth functions) and pointers are similar to other pointers nearby (e.g., other pointers into the same few kilobytes, megabytes or gigabytes holding related elements of some data structure).

When you have one of those similar-to-something-nearby matches, a 32 bit word gets compressed to 2 tag bits plus 4 selector bits saying which recently-seen pattern it's similar to, plus 10 lower bits it that differ somewhere. For a 32-bit partial match, that gives you a compression ratio of 2. (2 + 4 + 10 = 16 / 32)

But if you're actually looking at 64 bit pointers or ints, you're likely to do much better in the case of partial all-but-low-bits matches like that---the upper 4 bytes will compress to 6 bits, and the lower 4 bytes will compress to 16, for a compression ratio of 64 / (6 + 16 = 20) or greater than 3.

The short version is that if you guess 32-bit alignment, you get pretty good compression very, very fast on 32-bit machines, and even better compression "for free" on 64-bit machines---and a shade faster still---, because 64 bit ints and pointers typically have even less information in their upper bytes than 32-bit ints and pointers.