LWN.net Logo

In-kernel memory compression

April 3, 2013

This article was contributed by Dan Magenheimer

Amdahl's law tells us that there is always a bottleneck in any computing system. Historically, the bottleneck in many workloads on many systems is the CPU and so system designers have made CPUs faster and more efficient and also continue to increase the number of CPU cores even in low-end systems. So now, increasingly, RAM is the bottleneck; CPUs wait idly while data is moved from disk to RAM and back again. Adding more RAM is not always a timely or cost-effective option and sometimes not an option at all. Faster I/O buses and solid-state disks reduce the bottleneck but don't eliminate it.

Wouldn't it be nice if it were possible to increase the effective amount of data stored in RAM? And, since those CPUs are waiting anyway, perhaps we could use those spare CPU cycles to contribute towards that objective? This is the goal of in-kernel compression: We keep more data — compressed — in RAM and use otherwise idle CPU cycles to execute compression and decompression algorithms.

With the recent posting of zswap, there are now three in-kernel compression solutions proposed for merging into the kernel memory management (MM) subsystem: zram, zcache, and zswap. While a casual observer might think that only one is necessary, the three have significant differences and may target different user bases. So, just as there are many filesystems today co-existing in the kernel, someday there may be multiple compression solutions. Or maybe not... this will be decided by Linus and key kernel developers. To help inform that decision, this article compares and contrasts the different solutions, which we will call, for brevity, the "zprojects". We will first introduce some of the key concepts and challenges of compression. Then we will identify three design layers and elaborate on the different design choices made by the zprojects. Finally, we will discuss how the zprojects may interact with the rest of the kernel and then conclude.

Compression basics

For in-kernel compression to work, the kernel must take byte sequences in memory and compress them, and then keep the compressed version in RAM until some point in the future when the data is needed again. While the data is in a compressed state, it is impossible to read any individual bytes from it or write any individual bytes to it. Then, when the data is needed again, the compressed sequence must be decompressed so that individual bytes can again be directly accessed.

It is possible to compress any number of sequential bytes but it is convenient to use a fixed "unit" of compression. A standard storage unit used throughout the kernel is a "page" which is made up of a fixed constant PAGE_SIZE bytes (4KB on most architectures supported by Linux). If this page is aligned at a PAGE_SIZE address boundary, it is called a "page frame"; the kernel maintains a corresponding "struct page" for every page frame in system RAM. All three zprojects use a page as the unit of compression and allocate and manage page frames to store compressed pages.

There are many possible compression algorithms. In general, achieving a higher "compression ratio" takes a larger number of CPU cycles, whereas less-effective compression can be done faster. It is important to achieve a good balance between time and compression ratio. All three zprojects, by default, use the LZO(1X) algorithm in the kernel's lib/ directory, which is known to deliver a good balance. However, it is important that the choice of algorithm remain flexible; possibly the in-CPU algorithm might be replaced in some cases by an architecture-specific hardware compression engine.

In general, the number of cycles required to compress, and to later decompress, a sequence of bytes is roughly proportional to the number of bytes in the sequence. Since the size of a page is fairly large, page compression and decompression are expensive operations and, so, we wish to limit the number of those operations. Thus we must carefully select which pages to compress, choosing pages that are likely to be used again but not likely to be used in the near future, lest we spend all the CPU's time repeatedly compressing and then decompressing pages. Since individual bytes in a compressed page are inaccessible, we must not only ensure that normal CPU linear byte addressing is not attempted on any individual byte, but also ensure that the kernel can clearly identify the compressed page so that it can be found and decompressed when the time comes to do so.

When a page is compressed, the compression algorithm is applied and the result is a sequence of bytes which we can refer to as a "zpage". The size of a zpage, its "zsize", is highly data-dependent, hard to predict, and highly variable. Indeed the expected distribution of zsize drives a number of key zproject design decisions so it is worthwhile to understand it further. The "compression ratio" for a page is zsize divided by PAGE_SIZE. For nearly all pages, zsize is less than PAGE_SIZE so the compression ratio is less than one, but odd cases may occur where compression actually results in the zsize being larger than PAGE_SIZE. A good compression solution must have a contingency plan to deal with such outliers. At the other extreme, a data page containing mostly zeroes or ones may compress by a factor of 100x or more. For example, LZO applied to an all-zero page results in a zpage with zsize equal to 28, for a compression ratio of 0.0068.

As a rough rule of thumb, "on average", compression has a ratio of about 0.5 (i.e. it reduces a page of data by about a factor of two), so across a wide set of workloads, it is useful to envision the distribution as a bell curve, centered at PAGE_SIZE/2. We will refer to zpages with zsize less than PAGE_SIZE/2 as "thin" zpages, and zpages with zsize greater than PAGE_SIZE/2 as "fat" zpages. For any given workload, of course, the distribution may "skew thin", (if many mostly-zero pages are compressed, for example) or "skew fat", as is true if the data is difficult to compress (already-compressed data like a JPEG image would be an example). A good compression solution must thus be able to handle zsize distributions from a wide range of workloads.

With this overview and terminology, we can now compare and contrast the three zprojects. At a high level, each uses a "data source layer" to provide pages to compress and a "data management layer" to organize and store the zpages. This data management layer has two sublayers, the first for data organization, which we will call the "metadata layer", and the second for determining how best to utilize existing kernel memory and kernel allocation code (such as alloc_page()) to optimally store the zpages, which we will call the "zpage allocator layer".

Data source layer — overview

As previously noted, the nature of compression limits the types and quantities of pages to which it can be applied. It doesn't make sense to compress a page for which there is a high probability of imminent reuse. Nor does it make sense to compress a page for which there is a high probability the data it contains will never again be reused. All three zprojects assume one or more data sources that provide a "pre-qualified" sequence of data pages. In one zproject, an existing kernel subsystem is the data source. For the other two zprojects, previous kernel hooks added as part of the "transcendent memory" projects known as cleancache and frontswap, source the data page stream.

Each data source must also provide a unique identifier, or "key", for each page of data so that the page, once stored and associated with that unique key, can be later retrieved by providing the key. It's also important to note that different data sources may result in streams of pages with dramatically different contents and thus, when compressed, different zsize distributions.

Swap/anonymous pages as a data source

The Linux swap subsystem provides a good opportunity for compression because, when the system is under memory pressure, swap acts as a gatekeeper between (a) frequently used anonymous pages which are kept in RAM, and (b) presumably less frequently used anonymous pages that "overflow", or can be "swapped", to a very-much-slower-than-RAM swap disk. If, instead of using a slow swap disk, we can cause the gatekeeper to store the overflow pages in RAM, compressed, we may be able to avoid swapping them entirely. Further, the swap subsystem already provides a unique key for each page so that the page can be fetched from the swap disk. So, unsurprisingly, all three zprojects consider swap pages as an ideal data source for compression.

One zproject, zram, utilizes the existing swap device model. A zram swap device is explicitly created and enabled in user space (via mkswap and swapon) and this zram device is prioritized (presumably highest) against other configured swap devices. When the swap subsystem sends a data page to the zram device, the page goes through the block I/O subsystem to the zram "driver". All swap pages sent to zram are compressed and associated with a key created by concatenating the swap device identifier and page offset. Later, when the swap subsystem determines (via a page fault) that the page must be brought back in from the swap device to fill a specific page frame, the zram device is notified; it then decompresses the zpage matching the swap device number and page offset into that page frame.

The other two zprojects, zcache and zswap, use the kernel's frontswap hooks (available since Linux 3.5) in the swap subsystem. Frontswap acts as a "fronting store" — a type of a cache — to an existing swap device. It avoids the block I/O subsystem completely; swapped pages instead go directly to zcache/zswap where they are compressed. On the fetch path, when the page fault occurs, the block I/O subsystem is again bypassed and then, as with zram, the page is decompressed directly into the faulting page frame.

There are some subtle but important side effects of the different approaches:

  • The block I/O subsystem is designed to work with fixed-size block devices and is not very happy when a block write results in a failure. As a result, when a poorly-compressible page ("very fat" zpage) is presented to zram, all PAGE_SIZE bytes of the uncompressed page are stored by zram in RAM, for no net space savings. The frontswap-based zprojects have more flexibility; if a very fat zpage is presented, zcache/zswap can simply reject the request, in which case the swap subsystem will pass on the original data page to the real swap disk.

  • Since zram presents itself as a swap disk, user-space configuration is required, but this also means that zram can work on systems with no physical swap device, a configuration common for embedded Linux systems. On the other hand, zcache and zswap depend entirely on already-configured swap devices and so are not functional unless at least one physical swap device is configured and sized adequately. As a result, one might think of zram as being a good match for an embedded environment, while the others are better suited for a more traditional server/data-center environment.

  • It is important to note that all three zprojects must never discard any of these data pages unless explicitly instructed, otherwise user-space programs may experience data loss. Since all three zprojects have a finite capacity on any given system, it is prudent to have a "pressure relief valve". Both zcache and zswap have the ability to "writeback" data to the swapdisk to which the data was originally intended to be written; this cannot be done with zram.

Clean page cache pages as a data source

It is not uncommon for the vast majority of page frames in a running system to contain data pages that are identical to pages already existing on a disk in a filesystem. In Linux, these pages are stored in the page cache; the data in these pages is kept in RAM in anticipation that it will be used again in the future. When memory pressure occurs and the kernel is looking to free RAM, the kernel will be quick to discard some of this duplicate "clean" data because it can be fetched from the filesystem later if needed. A "cleancache" hook, present in the kernel since Linux 3.0, can divert the data in these page frames, sending the pages to zcache where the data can be compressed and preserved in RAM. Then, if the kernel determines the data is again needed in RAM, a page fault will be intercepted by another cleancache hook which results in the decompression of the data.

Because modern filesystems can be large, unique identification of a page cache page is more problematic than is unique identification of swap pages. Indeed, the key provided via cleancache must uniquely identify: a filesystem; an "exportable" inode value representing a file within that filesystem; and a page offset within the file. This combination requires up to 240 bits.

Zcache was designed to handle cleancache pages, including the full range of required keys. As a result, the data management layer is much more complex, consisting of a combination of different data structures which we will describe below. Further, the location of some cleancache hooks in the VFS part of the kernel results in some calls to the data management layer with interrupts disabled which must be properly handled.

Even more than with swap data, filesystem data can proliferate rapidly and, even compressed, can quickly fill up RAM. So, as with swap data, it is prudent to have a pressure-relief valve for page cache data. Unlike swap data, however, we know that page cache data can simply be dropped whenever necessary. Since zcache was designed from the beginning to manage page cache pages, its data management layer was also designed to efficiently handle the eviction of page cache zpages.

Zram can maintain entire fixed-size filesystems in compressed form in RAM, but not as a cache for a non-RAM-based filesystem. In essence, part of RAM can be pre-allocated as a "fast disk" for a filesystem; even filesystem metadata is compressed and stored in zram. While zram may be useful for small filesystems that can entirely fit in RAM, the caching capability provided via cleancache combined with the compression provided by zcache is useful even for large filesystems.

Zswap is singularly focused on swap and so does not make use of the page cache at all or filesystem data as a data source. As a result, its design is simpler because it can ignore the complexities required to handle the tougher page cache data source.

The metadata layer

Once the zproject receives a data page for compression, it must set up and maintain data structures so that the zpage can be stored and associated with a key for that page, and then later found and decompressed. Concurrency must also be considered to ensure that unnecessary serialization bottlenecks do not occur. The three zprojects use very different data structures for different reasons, in part driven by the requirements of data sources to be maintained.

Since the number of swap devices in a system is small, and the number of pages in each is predetermined and fixed, a small, unsigned integer representing the swap device combined with a page offset is sufficient to identify a specific page. So zram simply uses a direct table lookup (one table per configured zram swap device) to find and manage its data. For example, for each data page stored, the zram table contains a zsize, since zsize is useful for decompression validation, as well as information to get to the stored data. Concurrency is restricted, per-device, by a reader-writer semaphore.

Zswap must manage the same swap-driven address space, but it utilizes one dynamic red-black tree (rbtree) per swap device to manage its data. During tree access, a spinlock on the tree must be held; this is not a bottleneck because simultaneous access within any one swap device is greatly limited by other swap subsystem factors. Zswap's metadata layer is intentionally minimalist.

Zcache must manage both swap data and pagecache data, and the latter has a much more extensive key space which drives the size and complexity of zcache data structures. For each filesystem, zcache creates a "pool" with a hash table of red-black tree root pointers; each node in the rbtree represents a filesystem inode — the inode space in an exportable filesystem is often very sparsely populated which lends itself to management by an rbtree. Then each rbtree node contains the head of a radix tree — the data structure used throughout the kernel for page offsets — and this radix tree is used to look up a specific page offset. The leaf of the radix tree points to a descriptor which, among other things, references the zpage. Each hash table entry has a spinlock to manage concurrency; unlike swap and frontswap, filesystem accesses via cleancache may be highly concurrent so contention must be aggressively avoided.

Impact of writeback on the metadata layer

Earlier it was noted that zcache and zswap support "writeback"; this adds an important twist to the data structures in that, ideally, we'd like to write back pages in some reasonable order, such as least recently used (LRU). To that end, both zcache and zswap maintain a queue to order the stored data; zswap queues each individual zpage and zcache queues the page frames used to store the zpages. While lookup requires a key and searches the data structures from the root downward, writeback chooses one or more zpages from the leaves of the data structure and then must not only remove or decompress the data, but must also remove its metadata from the leaf upward. This requirement has two ramifications: First, the leaf nodes of data structures must retain the key along with information necessary to remove the metadata. Second, if writeback is allowed to execute concurrently with other data structure accesses (lookup/insert/remove), challenging new race conditions arise which must be understood and avoided.

Zswap currently implements writeback only as a side effect of storing data (when kernel page allocation fails) which limits the possible race conditions. Zcache implements writeback as a shrinker routine that can be run completely independently and, thus, must handle a broader set of potential race conditions. Zcache must also handle this same broader set when discarding page cache pages.

Other metadata layer topics

One clever data management technique is worth mentioning here: Zram deduplicates "zero pages" — pages that contain only zero bytes — requiring no additional storage. In some zsize distributions, such pages may represent a significant portion of the zpages to be stored. While the data of the entire page may need to be scanned to determine if it is a zero page, this overhead may be small relative to that of compression and, in any case, it pre-populates the cache with bytes that the compression algorithm would need anyway. Zcache and zswap should add this feature. More sophisticated deduplication might also be useful. Patches are welcome.

Zpage allocator layer

Memory allocation for zpages can present some unique challenges. First, we wish to maximize the number of zpages stored in a given number of kernel page frames, a ratio which we will call "density". Maximizing the density can be challenging as we do not know a priori the number of zpages to be stored, the distribution of zsize of those zpages, or the access patterns which insert and remove those zpages. As a result, fragmentation can be a concern. Second, allocating memory to store zpages may often occur in an environment of high-to-severe memory pressure; an effective allocator must handle frequent failure and should minimize large (i.e. multiple-contiguous-page) allocations. Third, if possible, we wish the allocator to support some mechanism to enable ordered writeback and/or eviction as described above.

One obvious alternative is to use an allocator already present and widely used in the kernel: kmalloc(), which is optimized for allocations that nicely fit into default sizes of 2N byte chunks. For allocations of size 2N+x, however, as much as 50% of the allocated memory is wasted. kmalloc() provides an option for "odd" sizes but this requires pre-allocated caches, often of contiguous sets of pages ("higher order" pages) which are difficult to obtain under memory pressure. Early studies showed that kmalloc() allocation failures were unacceptably frequent and density was insufficient, so kmalloc() was discarded and other options were pursued, all based on alloc_page(), which always allocates a single page frame.

To improve density, the xvmalloc allocator was written, based on the two-level sequential fit (TLSF) algorithm. Xvmalloc provided high density for some workloads and was the initial default allocator for zram and for the frontswap-driven portion of the initial zcache. It was found, however, that the high density led to poor fragmentation characteristics, especially for zsize distributions that skewed fat. Xvmalloc has since been discarded (though remnants of it survive through Linux 3.8.)

A new allocator, zbud, was written for the cleancache-driven portion of the original zcache code. The initial version of zbud had less focus on high density and more on the ability to efficiently evict cleancache-provided zpages. With zbud, no more than two zpages are contained in a page frame, buddying up a pair of zpages, one fat and one thin. Page frames, rather than zpages, are entered into an LRU queue so that entire page frames can be easily "freed" to the system. On workloads where zsize skews thin, a significant amount of space is wasted, but fragmentation is limited, and eviction is still easy.

Zsmalloc, an allocator intended to "rule them all", was then written. Zsmalloc takes the best of kmalloc() but with the ability to provide the complete range of "size classes" suitable to store zpages; zsmalloc also never requires higher-order allocations. All zpages with zsize within a "class" (e.g. between 33-48 bytes) are grouped together and stored in the same "zspage". A clever feature (especially useful for fat zpages) allows discontiguous page frames to be "stitched" together, so that the last bytes in the first page frame contain the first part of the zpage and the first bytes of the second page frame contain the second part. A group of N of these discontiguous pages are allocated on demand for each size class, with N chosen by the zsmalloc code to optimize density.

As a result of these innovations, zsmalloc achieves great density across a wide range of zsize distributions: balanced, fat, and thin. Alas, zsmalloc's strengths also proves to be its Achilles heel: high density and page crossings make it very difficult for eviction and writeback to concurrently free page frames, resulting in a different kind of fragmentation (and consequent lower density) and rendering it unsuitable for management of cleancache-provided pages in zcache. So zbud was revised to utilize some of the techniques pioneered by zsmalloc. Zbud is still limited to two zpages per page frame, but when under heavy eviction or writeback, its density compares favorably with zsmalloc. The decision by zcache to use zbud for swap pages sadly led to a fork of zcache, first to "old zcache" (aka zcache1) and "new zcache" (aka zcache2) and then to a separate zproject entirely, zswap, because the author prefers the density of zsmalloc over the ability to support cleancache pages and the predictability and page frame-reclaim of zbud.

So, today, there are two zpage allocators, both still evolving. The choice for best zpage allocator depends on use models and unpredictable workloads. It may be possible that some hybrid of the two will emerge, or perhaps some yet unwritten allocator may arise to "rule them all".

Memory management subsystem interaction

In some ways, a zproject is like a cache for pages temporarily removed from the MM subsystem. However it is a rather unusual cache in that, to store its data, it steals capacity (page frames) from its backing store, the MM subsystem. From the point-of-view of the MM subsystem, the RAM is just gone, just as if the page frames were absorbed by a RAM-voracious device driver. This is a bit unfortunate, given that both the zproject and the MM subsystem are respectively, but separately, managing compressed and uncompressed anonymous pages (and possibly compressed and uncompressed page cache pages as well). So it is educational to consider how a zproject may "load balance" its memory needs with the MM subsystem, and with the rest of the kernel.

Currently, all three zprojects obtain page frames using the standard kernel alloc_page() call, with the "GFP flags" parameter chosen to ensure that the kernel's emergency reserve pages are never used. So each zproject must be (and is) resilient to the fairly frequent situation where an alloc_page() call returns failure.

Zram makes no other attempt to limit its total use of page frames but, as we've seen, it is configured by the administrator as a swap device, and swap parameters do limit the number of swap pages, and thus zpages, that can be accepted. This indirectly but unpredictably limits the number of page frames used. Also, the configuration is independent of and not cognizant of the total RAM installed in the system, so configuration errors are very possible.

Zswap uses a slightly modified version of zsmalloc to track the total number of page frames allocated. It ensures this number does not exceed a certain percent of total RAM in the system, and this limit is configurable via sysfs. As previously noted, using writeback, zswap can reduce the number of (anonymous) zpages stored and this, indirectly, can reduce the number of page frames, though at the likely cost of fragmentation.

Zcache and zbud were designed from the beginning with much more dynamic constraints on page frame utilization in mind, intended eventually to flexibly mesh with the dramatically time-varying memory balancing algorithms that the MM subsystem uses. While the exact interaction model and best future control policy are not yet determined, zcache's and zbud's underlying page frame-based writeback and eviction mechanisms support a shrinker-like interface that can be told to, for example, reduce the number of page frames used for storing anonymous zpages from 10000 to 8937 and the number of page frames used for storing page cache zpages from 3300 to 463, and zcache can do that. Since zbud is much more predictable (two zpages per page frame), the MM subsystem can estimate and manage both the total number of anonymous and pagecache pages (compressed and uncompressed) in the system and the total number of page frames used to manage them.

Status and conclusions

Zram, zcache, and zswap are all advancing the concept of in-kernel compression in different ways. Zram and zcache are in-tree but both still in the staging tree. While the staging tree has served to expose the concept of in-kernel compression to a wide group of kernel developers and even to the users of a few leading-edge distributions, as Andrew Morton says, the staging tree "is where code goes to be ignored." Staging tree maintainer Greg Kroah-Hartman has become reluctant to accept any new zproject functionality into the staging tree. This creates a conundrum for these zprojects: evolution in the staging tree has resulted in rapid improvements in the design and implementation and exposure of the zprojects, but because of this evolution they are not stable enough for promotion into the core kernel and further evolution is now stymied.

Zswap is attempting to circumvent the staging tree entirely by proposing what is essentially a simplified frontswap-only fork of zcache using zsmalloc instead of zbud, for direct merging into the MM subsystem. Since it is also much simpler than zcache, it has garnered more reviewers which is a valuable advantage for to-be-merged code. But it is entirely dependent on the still-in-staging zsmalloc, does not support page cache pages at all, and does not support page-frame-based writeback, so its simplicity comes at a cost. If zswap is merged, it remains to be seen if it will ever be extended adequately.

So, in-kernel compression has clear advantages and real users. It remains unclear though when (or if) it will be merged into the core kernel and how it should interact with the core kernel. If you are intrigued, the various zproject developers encourage your ideas and contributions.

Acknowledgments

Nitin Gupta was the original author of compcache and I was the original author of transcendent memory. While both projects were initially targeted to improve memory utilization in a multi-tenant virtualized environment, their application to compression in a single kernel quickly became apparent. Nitin designed and wrote the Linux code for zram, xvmalloc, and zsmalloc, and wrote an early prototype implementation of zcache. I wrote frontswap and cleancache (with the cleancache hook placement originally done by Chris Mason), and the Linux code for zcache, zbud, and ramster. Seth Jennings contributed a number of improvements to zcache and is the author of zswap. Kernel mm developer Minchan Kim was a helpful influence for all the zprojects, and none of the zprojects would have been possible without Greg Kroah-Hartman's support — and occasional whippings — as maintainer of the staging drivers tree.

As with any open-source project, many others have contributed ideas, bug fixes, and code improvements and we are thankful for everyone's efforts.


(Log in to post comments)

LZO

Posted Apr 4, 2013 8:45 UTC (Thu) by SLi (subscriber, #53131) [Link]

LZO indeed delivers a good balance for this and many other use cases. It generally bothers me that people tend to treat compression algorithms in a one-size-fits-all fashion. Especially LZO tends to be underrated, but the important thing is choosing the right algorithm for the job.

LZO excels in one thing, and one thing only: It is blazingly fast. Specifically it is fast enough that in most applications it is not the bottleneck. LZO-compressing 1 GiB of zeros (or mostly any other data) takes less than a second on a modern computer, while gzip takes closer to 10 seconds. The result, when compressing zeros, is about five times the size of the size of gzip output; generally expect much worse compression ratio for LZO compared to any slower algorithm.

This has an interesting consequence: In very many cases it pays off to LZO-compress your data, even temporarily. With current disk and CPU speeds, it usually pays off to LZO-compress data before writing it to disk and uncompress it after reading: The time of reading the compressed data from the disk and uncompressing it is in most cases less than the time to read the uncompressed data from the disk. Another thing LZO (and the gzip-like filter program 'lzop') is great for is moving large pieces of data over the network. Even with modern CPUs and gzip you still are close to the point where gzip may end up being a bottleneck on a 100 Mbit/s network. Doing LZO compression on the sending end and decompression on the receiving end, on the other hand, is usually a pure win, unless your data is entirely uncompressible, in which case it hardly makes a difference.

LZO

Posted Apr 4, 2013 14:08 UTC (Thu) by djm1021 (guest, #31130) [Link]

Thanks for the interesting addition to the article! As some have pointed out though, LZO doesn't seem to do well with text (~0.6 compression ratio as opposed to ~0.4 for gzip). While the ~10x speed may still be worth it, it would be *really* nice if there were a way to do a quick analysis on the first N bytes of an uncompressed page and, from that analysis, adaptively select one of the several kernel compression algorithms to use.

Separately re LZO, Markus Oberhumer (the "O" in "LZO") has provided an updated better LZO for the Linux kernel, but so far there has been no maintainer to volunteer to shepherd it through to merge it (including syntactic changes so that it gets by checkpatch). If someone would step up and help with that (and serve as its maintainer), that would be very helpful to the longterm success of the zprojects. Volunteers?

LZO

Posted Apr 4, 2013 14:53 UTC (Thu) by SLi (subscriber, #53131) [Link]

On the contrary, from the numbers you quote, it seems to me that LZO is doing great with text! I would have expected an even larger discrepancy between the compression ratios, something like 0.7 to 0.4, and my gut feeling is that 0.4 to 0.6 is as good as it gets for nearly any compressible data in a competition between LZO and gzip.

LZO is just not optimized for compression ratio so much as for speed. That's why it's good we have both algorithms. One is good for cases where speed matters more, and the other is better for cases where compression ratio matters more.

It's simply a question of what bounds your performance. The more you are constrained about I/O speed, whether disk or network or something else, the better choice a CPU-heavy algorithm is. But even with modern CPUs, gzip is going to tax your CPU at disk or network speeds, while lzop <file1 >file2 is generally faster than cp file1 file2 for most storage media if the file in question is compressible.

Here are some quick numbers from a few rounds of compressing a kernel tarball (50.6 MB uncompressed):

lzop: 1.62 seconds to compress, 18.0 MB, ratio=.355
gzip: 19.13 seconds, 10.8 MB, ratio=.213
bzip2: 60.81 seconds, 8.46 MB, ratio=.176
xz: 311.65 seconds, 7.33 MB, ratio=.145
...

It's always possible to compress better by using more time, and you cannot just claim that one of the algorithms here is "better" than any of the others without reference to a specific scenario. Some algorithms are also designed to have significantly faster decompression than compression.

LZO

Posted Apr 4, 2013 15:58 UTC (Thu) by djm1021 (guest, #31130) [Link]

Right... I was specifically referring to the impact in the context of the zprojects. There's a big difference between whether a page compresses to a zsize of (PAGE_SIZE/2)-1 with algorithm A which is slow; vs a zsize of (PAGE_SIZE/2)+1 with algorithm B which is fast:
In the former case, two zpages can be stored in one pageframe; in the latter, only three zpages can be stored in TWO pageframes.
So if one could know these respective zsizes without running BOTH algorithms, running A might be worth the extra cycles.
Exact knowledge of zsizes is highly unlikely, but there might be some useful patterns that could drive adaptive algorithm selection. OTOH, pattern analysis might take more cycles than the difference between A and B!

Related, compression algorithms may have very different strengths and weaknesses when the uncompressed size is limited to a smaller value (such as PAGE_SIZE) as opposed to a larger value (such as the 50MB in a kernel tarball) so be careful drawing conclusions from the latter.

LZO

Posted Apr 5, 2013 9:05 UTC (Fri) by epa (subscriber, #39769) [Link]

http://code.google.com/p/snappy/ looks like an interesting alternative to LZO.

LZO

Posted Apr 5, 2013 15:38 UTC (Fri) by djm1021 (guest, #31130) [Link]

See Andi Kleen's comment at https://lkml.org/lkml/2012/1/24/498 and the long thread before it.

If snappy can be shown to be consistently better than the existing LZO (on PAGE_SIZE-sized streams, specifically anonymous pages and file pages) AND if it is merged into the kernel, the zprojects should be able to switch to it very easily. Actually, snappy should also be compared to the not-yet-merged (see other comment about not having a LZO maintainer) newer version of LZO, see https://lkml.org/lkml/2012/1/18/232

LZO

Posted Apr 9, 2013 22:29 UTC (Tue) by zt (guest, #90306) [Link]

LZO 2.05 uses the same tricks as Google Snappy to achieve roughly the same performance. A C-only port of Snappy was submitted to the kernel a long time ago, before LZO was sped-up, but didn't get much traction. With LZO 2.05 getting into the kernel (eventually), I don't see a reason to bother.

http://driverdev.linuxdriverproject.org/pipermail/devel/2...

LZO

Posted Apr 9, 2013 9:24 UTC (Tue) by etienne (subscriber, #25256) [Link]

LZO is very good for big files, but does it fit the problem space: we are talking of compressing 4 Kbytes pages.
Compressing basically means building a data dictionary and then building an index of reference into that dictionary - so any repeated sequence is reduced because the second and third times you just insert a reference to the first sequence.
If you have a big file, and you compress independently each of its 4 K blocks, then you basically rewrite the same dictionary times and times again.
Now if you re-write the wheel^Walgorithm compressing a 4 Kbytes buffer, you would use references (i.e. windows) max out at 4 Kbytes.
Because there is a way out if the compression is not too effective, and because the speed is paramount, maybe it is better to find repeated sequences stating at 32 bits boundary instead of 8 bits for standard algorithm or 1 bit for slow as hell algorithm.
IHMO LZO would be very good for large page (2 Mbytes), or if someone find a way to keep an out-of-band dictionary... but any general compression algorithm would not be so good for 4 Kbytes.

LZO

Posted Apr 22, 2013 16:24 UTC (Mon) by turrini (subscriber, #90157) [Link]

Why not just LZ4? It's extremely fast.

http://code.google.com/p/lz4/

In-kernel memory compression

Posted Apr 4, 2013 17:06 UTC (Thu) by ssam (subscriber, #46587) [Link]

it seems to me that there are 2 ways that compression could give speed benefits. here by compressing pages in RAM we can fit more in it, and so avoid costly fall backs to using slow disk. This will give a huge speed up on setups where you hit swap often, or where you benefit from having files in the RAM cache.

But compression can also (in some cases) give you speedups by increasing bandwidth. For example if i my hard disk can manage 100MB/s, and i enable the compression option in my filesystem, then i might be able to get 200MB/s of data out of it (assuming an average 0.5 compression ratio). This works when the interface is slower than (de)compression.

So i wonder if there is any opportunity that compression could help get around the problem of limited memory bandwidth. If memory bandwidths are of the order 10s of GB/s, shared between multiple CPUs, can the CPU decode fast enough for this to be any use? or is latency to RAM more of an issue than bandwidth?

In-kernel memory compression

Posted Apr 4, 2013 19:50 UTC (Thu) by alankila (subscriber, #47141) [Link]

It seems to me that tricks with compression will always lose when dealing with limited memory bandwidth, because the compressed form generally enters the main memory somehow first, and must be read, and then written in uncompressed form again. Even if the compressed data didn't have to be read from the main memory (somehow), then you'd still end up writing the uncompressed form once, so the memory bandwidth utilization would be unchanged with or without compression.

In-kernel memory compression

Posted Apr 4, 2013 22:18 UTC (Thu) by ssam (subscriber, #46587) [Link]

maybe the uncompressed would be written to the CPU cache, compressed there, and then pushed out to RAM. this may be impossible with current cache architecture.

In-kernel memory compression

Posted Apr 5, 2013 2:45 UTC (Fri) by djm1021 (guest, #31130) [Link]

Probably not quite what you had in mind, but its been suggested that for microblade clusters (like Facebook's Group Hug), moving compressed pages between systems is kind of the next step beyond NUMA... obviously NOT cache-coherent though.
Ramster (see link in the article) is a layer built on top of zcache which is a first step in this direction... it is in the staging tree in 3.9 as a subdirectory of zcache and works today over kernel sockets.

In-kernel memory compression

Posted Apr 6, 2013 22:10 UTC (Sat) by jospoortvliet (subscriber, #33164) [Link]

A question triggered by the thread parent: on a system with a compressed filesystem (think btrfs with lzo), does the cleancache side of zcache make much sense? Is data from the filesystem cached on compressed or uncompressed form? Thinking of it, the answer is probably 'uncompressed'...

In-kernel memory compression

Posted Apr 6, 2013 23:28 UTC (Sat) by djm1021 (guest, #31130) [Link]

Uncompressed. Pages in the page cache are usually mapped into processes so must be uncompressed. When kswapd decides to evict (reclaim) a page from the page cache, the page either goes the cleancache->zcache route or filesystem->blockio route. For the former, the page is compressed and kept in RAM, for the latter it is compressed by btrfs before being sent to blockio and to the disk. (Note I am somewhat ignorant of btrfs internals so btrfs experts can correct me if I got the btrfs part wrong.)

Fast zero page deduplication?

Posted Apr 5, 2013 14:51 UTC (Fri) by cesarb (subscriber, #6266) [Link]

Instead of comparing the whole page to zero and then compressing it, why not do the opposite?

If the compression is deterministic, all zero pages will compress to the exact same byte string. So first do the compression, them check the size of the result. If it is the same as the size of a compressed zero page, compare its compressed form with the compressed form of a zero page. Since it is a small byte string (around 28 bytes as mentioned in the article), the comparison will be fast.

I do not know which way would be faster. If the comparison to a zero page bails out fast (as soon as the first non-zero byte is found), most non-zero pages have the non-zero bytes early, and zero pages are common, it can be much faster to compare first and compress; if the non-zero bytes are at the end, or zero pages are rare, it might be faster to compress first and compare the compressed result. Only a benchmark could tell.

Fast zero page deduplication?

Posted Apr 5, 2013 15:49 UTC (Fri) by djm1021 (guest, #31130) [Link]

It's been awhile since I measured LZO compression and it was on a Core 2 Duo. I should probably re-measure on a newer machine, but the numbers I vaguely remember are tha,t on average, compression was about 20K cycles and decompression was about 8K? (x86_64 so PAGE_SIZE=4K).

However, if you think about it, to compress a page, any algorithm must certainly examine every byte and then do some other action which must include at least some arithmetic and/or a table lookup. So I can't imagine a case where compression would be faster than comparing every byte (or word or doubleword) against a register containing zero.

Fast zero page deduplication?

Posted Apr 5, 2013 18:21 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

True, but given that the CPU is so much faster than the ram, it probably makes sense to do this zero checking as part of the compression routine. Something along the lines of

set a flag allzero=true

then in your loop

if data
then allzero=false

this should be a very fast (and small) check on modern CPUs.

Fast zero page deduplication?

Posted Apr 5, 2013 23:12 UTC (Fri) by djm1021 (guest, #31130) [Link]

I agree it would be more convenient (and possibly faster) if the check for zero-filled could be done in the compression algorithm. I'm not signing up for that though... the LZO algorithm/code is incredibly dense!

But even though the CPU is much faster than the memory bus, much of the cost for scanning a page (or even copying it) is due to the cache misses incurred getting all the data bytes in the page in from RAM to cache. So scanning a page for zero-filled also serves as an effective pre-fetch. Once the page has been scanned, and assuming the data cache is not tiny, the compression algorithm will be reading pre-cached data.

Lots of good discussion, would be good to measure the different options.

Fast zero page deduplication?

Posted Apr 5, 2013 23:20 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

The CPU caches are rather small, especially if there are other active processes on the box that you have to share it with.

I don't think that considering it a pre-fetch is really a good way of looking at it.

Fast zero page deduplication?

Posted Apr 6, 2013 23:22 UTC (Sat) by djm1021 (guest, #31130) [Link]

Hmmm... on what recent architectures is the data cache small relative to PAGE_SIZE? (Maybe on ARM or embedded archs? I'm not very familiar with those.)

I'm not sure I understand why you wouldn't think the check for zero-filled is a good pre-fetch? The code looks basically like:

if (!is_zero_filled(page))
compress(page);

which is high in both address locality and temporal locality for all the bytes in the page. Seems like a perfect match for what data caches are designed for.

Fast zero page deduplication?

Posted Apr 7, 2013 0:20 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

the problem is that !is_zero_filled(page) is really not that simple

when a page is 4K bytes this boils down to is more like

while !endofpage
for 1..32
fetch 128 bytes
wait for the fetch to complete
check if they are zero

and during the time that this is running, other programs are also running, fetching data from other places in ram that may evict some of these 128 byte chunks.

if you have a multi-core chip, processes running on these other cores are fetching memory that can evict chunks

If you have a hyperthreading chip, while you are stalled waiting for the fetch to complete, other programs may run on the same core, generating more fetches that can evict chunks of the page

And then the OS can decide that you have used your share of time and let several other programs run for a while before you get the CPU back, and they all can generate memory fetches that could evict this from the cache.

so, you may luck out and find that the data is still in the cache when you come back to re-use it for the compression, but it's not a reliable thing to depend on.

But if you instead do

while !done
fetch 128 bytes
check if they are zero
compress them

the odds of having to fetch data multiple times are much lower (not zero, but much lower)

Fast zero page deduplication?

Posted Apr 7, 2013 1:47 UTC (Sun) by djm1021 (guest, #31130) [Link]

Heh... good discussion, but I still have to disagree with you.

Any data cache that can't hold onto the first cache line of a sequentially-read page until after the last cache line of a page has been read is either very very small or was poorly designed for its CPU. Data caches for multi-core hyperthreaded CPUs should be either large or N-way associative or both.

And the is_zero_filled() and compress() calls are in the kernel, so kernel pre-emption must occur to cause cache evictions from the same hyperthread. Possible but presumably rare.

And in your algorithm, if is_zero_filled() proves true, you've spent CPU cycles compressing almost the entire page which is totally unnecessary.

So IMHO for some mostly older or tiny CPUs you may be right. And for some workloads you may be right. If you want to provide a patch to LZO to add is_zero_filled() checking and you can demonstrate non-synthetic workloads where your patch is faster, I will owe you a beer. Else I think the algorithm in zram and the zcache patch recently posted by Wanpeng Li should work just fine.

P.S. If your understanding of the LZO code in the kernel is good enough to provide the patch, please also update LZO to the latest version from Markus Oberhumer and volunteer to serve as kernel maintainer for lzo.c!

Fast zero page deduplication?

Posted Apr 7, 2013 2:05 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

caches are getting larger, but last I looked, the speed ratio between the CPU and memory was continuing to grow (although not as fast as it was).

you can spend quite a few CPU cycles between the time that you tell the cpu to fetch the next chunk of ram into the cache and the time it's available. I don't know how many, and it will vary from chip to chip (with the increasing impact of ARM/MIPS and other non-X86 chips, I expect that this is getting to be a more interesting concern than it ever has been). The ability to vary the speed of the CPU without varying the speed of the memory makes this even more interesting :)

I don't know how many CPU cycles you spend in the compression. It would be interesting to find out how this compares to the 'free' cycles that are available on current CPUs when walking though memory.

I have never looked at any LZO code, so I'm not in the position to make modifications there.

> Else I think the algorithm in zram and the zcache patch recently posted by Wanpeng Li should work just fine.

I expect they work just fine as well, I'm just suggesting a possible improvement (and then explaining why I'm doing so when it sounds like I'm not getting the point across)

Fast zero page deduplication?

Posted Apr 8, 2013 14:50 UTC (Mon) by intgr (subscriber, #39733) [Link]

I think cache eviction is a red herring, but I agree that this talk about "let's write a fast loop to synchronously prefetch the data, then process this data in a slow loop" is a big fallacy.

The real performance killer is that the zero-page check is very fast CPU-wise, but it has to wait synchronously for the next cacheline to arrive from the slow RAM. Your CPU will be stalled most of the time, waiting for the data to arrive from RAM or a lower-level cache. (And let's not forget that it only makes sense to compress rarely-accessed data -- which is unlikely to already be in cache).

In contrast, any compression algorithm is likely to be slow enough that by the time it needs to access the next cacheline, the hardware prefetcher will already have that data in cache -- or at least reduced the time your CPU is stalled behind memory. Thus RAM (pre)fetching and CPU compression can work in parallel.

And let's not forget that hardware prefetchers are so good these days that even *asynchronous* prefetch instructions can frequently hurt performance: https://lwn.net/Articles/444336/

Fast zero page deduplication?

Posted Apr 10, 2013 12:25 UTC (Wed) by eras (guest, #90319) [Link]

How about combining the check for zero-pages and compression?

First check if the first 8 bytes are zero (or whatever convenient size); if not, goto compression. If they are zero, check the next 8 bytes.

And here is the different step: if they are not zero, lookup pre-generated state representing the compressed 8 zeroes. Otherwise keep on doing the check (possibly later looking up a 8*n-byte zero compression data).

This would require 511 lookup table entries, and if would be blazingly fast to compress pages that begin with zeroes - or pages that are fully zero, in which case you jump to the compression routine. You also never throw any work away.

Fast zero page deduplication?

Posted Apr 10, 2013 15:39 UTC (Wed) by etienne (subscriber, #25256) [Link]

> How about combining the check for zero-pages and compression?

Note that under Linux and a filesystem which supports holes, the check for pages full of zero bytes will not often succeed because they are mapped in memory as the special zero-page (and marked copy on write).
But the ultra-simplified compression algorithm looks like: read the first byte, read the second byte and see if it equal to the first byte, read the third byte and see if it is equal to either the first one or the second one or both, ... so you will catch quickly a page full of the same byte.

Fast zero page deduplication?

Posted Apr 9, 2013 18:43 UTC (Tue) by engla (guest, #47454) [Link]

No, not every algorithm out there examines every byte. If you see a long stretch of incompressible data you'll skip forward in increasing steps, skipping over some parts. If you do this, you'll scan through incompressible files pretty fast, and you have a reasonable go at skipping over incompressible parts and landing in compressible parts where you go back to normal scan.

In-kernel memory compression

Posted Apr 12, 2013 10:27 UTC (Fri) by Duncan (guest, #6647) [Link]

Thanks for answering the simple sysadmin's perspective question "Compressed memory, great, but which zproject should I enable in my kernel config and why would I choose it over the others?"

LWN has previously covered the three in various levels of detail, but this was the first coverage that really had a practical answer to that simple question. =:^)

Duncan

In-kernel memory compression

Posted Apr 17, 2013 7:49 UTC (Wed) by dag- (subscriber, #30207) [Link]

It's probably too early to do performance comparisons (especially when done by one of the authors of one of the projects). However as a sysadmin, that's one of the metrics that would influence my decision ;-)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds