Compcache: in-memory compressed swapping

May 26, 2009

This article was contributed by Nitin Gupta

The idea of memory compression—compress relatively unused pages and store them in memory itself—is simple and has been around for a long time. Compression, through the elimination of expensive disk I/O, is far faster than swapping those pages to secondary storage. When a page is needed again, it is decompressed and given back, which is, again, much faster than going to swap.

An implementation of this idea on Linux is currently under development as the compcache project. It creates a virtual block device (called ramzswap) which acts as a swap disk. Pages swapped to this disk are compressed and stored in memory itself. The project home contains use cases, performance numbers, and other related bits. The whole aim of the project is not just performance — on swapless setups, it allows running applications that would otherwise simply fail due to lack of memory. For example, Edubuntu included compcache to lower the RAM requirements of its installer.

The performance page on the project wiki shows numbers for configurations that closely match netbooks, thin clients, and embedded devices. These initial results look promising. For example, in the benchmark for thin clients, ramzswap gives nearly the same effect as doubling the memory. Another benchmark shows that average time required to complete swap requests is reduced drastically with ramzswap. With a swap partition located on a 10000 RPM disk, average time required for swap read and write requests was found to be 168ms and 355ms, respectively. While with ramzswap, corresponding numbers were mere 12µs and 7µs, respectively — this includes time for checking zero-filled pages and compressing/decompressing all non-zero pages.

The approach of using a virtual block device is a major simplification over earlier attempts. The previous implementation required changes to the swap write path, page fault handler, and page cache lookup functions (find_get_page() and friends). Those patches did not gain widespread acceptance due to their intrusive nature. The new approach is far less intrusive, but at a cost: compcache has lost the ability to compress page cache (filesystem backed) pages. It can now compress swap cache (anonymous) pages only. At the same time, this simplicity and non-intrusiveness got it included in Ubuntu, ALT Linux, LTSP (Linux Terminal Server Project) and maybe other places as well.

It should be noted that, when used at the hypervisor level, compcache can compress any part of the guest memory and for any kind of guest OS (Linux, Windows etc) — this should allow running more virtual machines for a given amount of total host memory. For example, in KVM the guest physical memory is simply anonymous memory for the host (Linux kernel in this case). Also, with the recent MMU notifier support included in the Linux kernel, nearly the entire guest physical memory is now swappable [PDF].

Implementation

All of the individual components are separate kernel modules:

LZO compressor: lzo_compress.ko, lzo_decompress.ko (already in mainline)
xvMalloc memory allocator: xvmalloc.ko
compcache block device driver: ramzswap.ko

Once these modules are loaded, one can just enable the ramzswap swap device:

    swapon /dev/ramzswap0

Note that ramzswap cannot be used as a generic block device. It can only handle page-aligned I/O, which is sufficient for use as a swap device. No use case has yet come to light that would justify the effort to make it a generic compressed read-write block device. Also, to minimize block layer overhead, ramzswap uses the "no queue" mode of operation. Thus, it accepts requests directly from the block layer and avoids all overhead due to request queue logic.

The ramzswap module accepts parameters for "disk" size, memory limit, and backing swap partition. The optional backing swap partition parameter is the physical disk swap partition where ramzswap will forward read/write requests for pages that compress to a size larger than PAGE_SIZE/2 — so we keep only highly compressible pages in memory. Additionally, purely zero filled pages are checked and no memory is allocated for such pages. For "generic" desktop workloads (Firefox, email client, editor, media player etc.), we typically see 4000-5000 zero filled pages.

Memory management

One of the biggest challenges in this project is to manage variable sized compressed chunks. For this, ramzswap uses memory allocator called xvmalloc developed specifically for this project. It has O(1) malloc/free, very low fragmentation (within 10% of ideal in all tests), and can use highmem (useful on 32-bit systems with >1G memory). It exports a non-standard allocator interface:

    struct xv_pool *xv_create_pool(void);
    void xv_destroy_pool(struct xv_pool *pool);

    int xv_malloc(struct xv_pool *pool, u32 size, u32 *pagenum, u32 *offset, gfp_t flags);
    void xv_free(struct xv_pool *pool, u32 pagenum, u32 offset);

xv_malloc() returns a <pagenum, offset> pair. It is then up to the caller to map this page (with kmap()) to get a valid kernel-space pointer.

The justification for the use of a custom memory allocator was provided when the compcache patches were posted to linux-kernel. Both the SLOB and SLUB allocators were found to be unsuitable for use in this project. SLOB targets embedded devices and claims to have good space efficiency. However, it was found to have some major problems: It has O(n) alloc/free behavior and can lead to large amounts of wasted space as detailed in this LKML post.

On the other hand, SLUB has different set of problems. The first is the usual fragmentation issue. The data presented here shows that kmalloc uses ~43% more memory than xvmalloc. Another problem is that it depends on allocating higher order pages to reduce fragmentation. This is not acceptable for ramzswap as it is used in tight-memory situations, so higher order allocations are almost guaranteed to fail. The xvmalloc allocator, on the other hand, always allocates zero-order pages when it needs to expand a memory pool.

Also, both SLUB and SLOB are limited to allocating from low memory. This particular limitation is applicable only for 32-bit system with more than 1G of memory. On such systems, neither allocator is able to allocate from the high memory zone. This restriction is not acceptable for the compcache project. Users with such configurations reported memory allocation failures from ramzswap (before xvmalloc was developed) even when plenty of high-memory was available. The xvmalloc allocator, on the other hand, is able to allocate from the high memory region.

Considering above points, xvmalloc could potentially replace the SLOB allocator. However, this would involve lot of additional work as xvmalloc provides a non-standard malloc/free interface. Also, xvmalloc is not scalable in its current state (neither is SLOB) and hence cannot be considered as a replacement for SLUB.

The memory needed for compressed pages is not pre-allocated; it grows and shrinks on demand. On initialization, ramzswap creates an xvmalloc memory pool. When the pool does not have enough memory to satisfy an allocation request, it grows by allocating single (0-order) pages from kernel page allocator. When an object is freed, xvmalloc merges it with adjacent free blocks in the same page. If the resulting free block size is equal to PAGE_SIZE, i.e. the page no longer contains any object; we release the page back to the kernel.

This allocation and freeing of objects can lead to fragmentation of the ramzswap memory. Consider the case where a lot of objects are freed in a short period of time and, subsequently, there are very few swap write requests. In that case, the xvmalloc pool can end up with a lot of partially filled pages, each containing only a small number of live objects. To handle this case, some sort of xvmalloc memory defragmentation scheme would need to be implemented; this could be done by relocating objects from almost-empty pages to other pages in the xvmalloc pool. However, it should be noted that, practically, after months of use on several desktop machines, waste due to xvmalloc memory fragmentation never exceeded 7%.

Swap limitations and and tools

Being a block device, ramzswap can never know when a compressed page is no longer required — say, when the owning process has exited. Such stale (compressed) pages simply waste memory. But with recent "swap discard" support, this is no longer as much of a problem. Swap discard sends BIO_RW_DISCARD bio request when it finds a free swap cluster during swap allocation. Although compcache does not get the callback immediately after a page becomes stale, it is still better than just keeping those pages in memory until they are overwritten by another page. Support for the swap discard mechanism was added in compcache-0.5.

In general, the discard request comes a long time after a page has become stale. Consider a case where a memory-intensive workload terminates and there is no further swapping activity. In those cases, ramzswap will end up having lots of stale pages. No discard requests will come to ramzswap since no further swap allocations are being done. Once swapping activity starts again, it is expected that discard requests will be received for some of these stale pages. So, to make ramzswap more effective, changes are required in the kernel (not yet done) to scan the swap bitmap more aggressively to find any freed swap clusters — at least in the case of RAM backed swap devices. Also, an adaptive compressed cache resizing policy would be useful — monitor accesses to the compressed cache and move relatively unused pages to a physical swap device. Currently, ramzswap can simply forward uncompressible pages to a backing swap disk, but it cannot swap out memory allocated by xvmalloc.

Another interesting sub-project is the SwapReplay infrastructure. This tool is meant to easily test memory allocator behavior under actual swapping conditions. It is a kernel module and a set of userspace tools to replay swap events in userspace. The kernel module stacks a pseudo block device (/dev/sr_relay) over a physical swap device. When kernel swaps over this pseudo device, it dumps a <sector number, R/W bit, compress length> tuple to userspace and then forwards the I/O request to the backing swap device (provided as a swap_replay module parameter). This data can then be parsed using a parser library which provides a callback interface for swap events. Clients using this library can provide any action for these events — show compressed length histograms, simulate ramzswap behavior etc. No kernel patching is required for this functionality.

The swap replay infrastructure has been very useful throughout ramzswap development. The ability to replay swap traces allows for easy and consistent simulation of any workload without the need to set it up and run it again and again. So, if a user is suffering from high memory fragmentation under some workloads, he could simply send me swap trace for his workload and I have all the data needed to reproduce the condition on my side — without the need to set up the same workload.

Clients for the parser library were written to simulate ramzswap behavior over traces from a variety of workloads leading to easier evaluation of different memory allocators and, ultimately, development and enhancement of the xvmalloc allocator. In the future, it will also help testing variety of eviction policies to support adaptive compressed cache resizing.

Conclusion

The compcache project is currently under active development; some of the additional features planned are: adaptive compression cache resizing, allow swapping of xvmalloc memory to physical swap disk, memory defragmentation by relocating compressed chunks within memory and compressed swapping to disk (4-5 pages swapped out with single disk I/O). Later, it might be extended to compress page-cache pages too (as earlier patches did) — for now, it just includes the ramzswap component to handle anonymous memory compression.

Last time the ramzswap patches were submitted for review, only LTSP performance data was provided as a justification for this feature. Andrew Morton was not satisfied with this data. However, now there is a lot more data uploaded to the performance page on the project wiki that shows performance improvements with ramzswap. Andrew also pointed out lack of data for cases where ramzswap can cause performance loss:

We would also be interested in seeing the performance _loss_ from these patches. There must be some cost somewhere. Find a worstish-case test case and run it and include its results in the changelog too, so we better understand the tradeoffs involved here.

The project still lacks data for such cases. However, it should be available by the 2.6.32 time frame, when these patches will be posted again for possible inclusion in mainline.

Index entries for this article
Kernel	Memory management/Swapping
GuestArticles	Gupta, Nitin

Compcache: in-memory compressed swapping

Posted May 26, 2009 20:15 UTC (Tue) by JoeF (guest, #4486) [Link] (16 responses)

I have noticed compcache on my Ubuntu-based EasyPeasy installation on my EeePC, which loads it by default.
I had some issues with it wrt hibernation, though. When I tried to hibernate, the system would complain about not enough swap space being available, even though the disk-based swap is big enough. I "fixed" that for now with swapoff /dev/ramzswap.

Compcache: in-memory compressed swapping

Posted May 26, 2009 20:17 UTC (Tue) by BrucePerens (guest, #2510) [Link] (15 responses)

Oops. You need disk-based backing store to hibernate. Not yet a feature?

Compcache: in-memory compressed swapping

Posted May 26, 2009 20:21 UTC (Tue) by BrucePerens (guest, #2510) [Link] (1 responses)

I see it is a feature. Maybe it's not configured correctly? Also, the code would have to be hibernation-aware, because it has to push its entire RAM out to backing store before hibernating.

Compcache: in-memory compressed swapping

Posted May 26, 2009 20:29 UTC (Tue) by JoeF (guest, #4486) [Link]

"Maybe it's not configured correctly?"

Could be. I haven't checked with the EasyPeasy people yet. I am using all defaults, though.
I did notice that the priority of /dev/ramzswap is being set to 100. I tried setting the priority of the on-disk swap space higher than that, but it didn't help with hibernation.

Compcache: in-memory compressed swapping

Posted May 26, 2009 20:23 UTC (Tue) by JoeF (guest, #4486) [Link] (12 responses)

For hibernate, the ram-based swap of course should be bypassed.
But I guess the hibernate code just takes the first swap device and goes with that, assuming that all swap space is disk-based.

Compcache: in-memory compressed swapping

Posted May 26, 2009 20:29 UTC (Tue) by BrucePerens (guest, #2510) [Link] (5 responses)

For hibernate, the ram-based swap of course should be bypassed.

It's worse than that. Memory belonging to the ram-based swap medium would be marked as not itself swappable. Otherwise, you would get in a loop. So, it has to back itself up to its private backing store device before hibernating, and restore itself before the OS is allowed to resume. It can't just stand by and passively allow another swap device to take care of its pages.

Compcache: in-memory compressed swapping

Posted May 27, 2009 1:44 UTC (Wed) by nitingupta (guest, #53817) [Link] (2 responses)

Yes, true. Currently, swapping compressed memory to private swap disk is under development. It can be made hibernation aware once this is done.

Compcache: in-memory compressed swapping

Posted May 27, 2009 6:17 UTC (Wed) by avik (guest, #704) [Link] (1 responses)

swapoff /dev/ramzswap

Should allow hibernation.

Compcache: in-memory compressed swapping

Posted May 27, 2009 13:41 UTC (Wed) by JoeF (guest, #4486) [Link]

Yup. That's what I do right now.

Could compcache improve restore after STD responsiveness?

Posted May 27, 2009 6:50 UTC (Wed) by rvfh (guest, #31018) [Link] (1 responses)

Interestingly, storing compressed memory to swap to hibernate reminds me a lot of TuxOnIce (and maybe now uswsusp, but that needs user space magic), which could save much more than the default swsup thanks to compression...

Could ramzswap help have a more responsive system after restore too? Maybe with a little tweaking?

Could compcache improve restore after STD responsiveness?

Posted Apr 15, 2010 10:13 UTC (Thu) by dgm (subscriber, #49227) [Link]

Also, apparently compression would add vert little overhead to the swap-to-disk case, allowing for faster i/o and better use of swap space.

Maybe it should be considered to add this feature to the generic swap code?

Compcache: in-memory compressed swapping

Posted May 27, 2009 7:23 UTC (Wed) by macc (guest, #510) [Link] (5 responses)

Shoudn't that be layered?

mem -> compressed swap|blockdev --> to disk

snitching on disk IO should work for hibernation too
as long as compression is faster than diskaccess ( true )

MACC

Compcache: in-memory compressed swapping

Posted May 28, 2009 10:24 UTC (Thu) by rvfh (guest, #31018) [Link] (4 responses)

Maybe the swap partition should be compressed too, so we don't

mem -> compressed swap -> uncompressed swap on disk

Compcache: in-memory compressed swapping

Posted May 29, 2009 4:15 UTC (Fri) by nitingupta (guest, #53817) [Link] (2 responses)

> Maybe the swap partition should be compressed too, so we don't
> mem -> compressed swap -> uncompressed swap on disk

This is the idea I'm working on. Swap-out entire xvmalloc pages -- each containing multiple compressed pages -- to swap disk. The aim here is not to save disk space but to improve performance.

Would xvmalloc and swap readahead play nice?

Posted Jun 6, 2009 7:37 UTC (Sat) by gmatht (guest, #58961) [Link] (1 responses)

Wouldn't swapping out xvmalloc pages prevent swap readahead from being of any use, given that adjacent pages are unlikely to be allocated in adjacent positions by xvmalloc? On a conventional HDD, reading an uncompressed page should take only ~0.1ms while seeking to the page should take ~10ms. My concern is that optimizing the 0.1ms while forcing a 10ms seek for every page would be a big performance loss.

There seems to be a big difference between the optimal layout for a memory allocator where seek is not a problem and the optimal layout on a conventional hard disk were seek times dwarf virtually everything else.

If, OTOH, adjacent pages were written out in adjacent positions on disk this could *halve* the cost of swap readahead; both halving the time required to read in the extra pages and also halving the memory used by pages that where read from disk but not used.

(I can see just swapping out xvmalloc pages being a win for SSD, where seek is not a problem for random reads. Also clearly if you are writing out an xvmalloc page there should be very little overhead, and you know you will get 4k of real memory back for each page swapped out. Even so, wouldn't you still have to read in the entire 4K xvmalloc page just to access one of the compress pages stored on that page?)

Would xvmalloc and swap readahead play nice?

Posted Jun 7, 2009 5:22 UTC (Sun) by nitingupta (guest, #53817) [Link]

> Wouldn't swapping out xvmalloc pages prevent swap readahead from being of any use, given that adjacent pages are unlikely to be allocated in adjacent positions by xvmalloc? On a conventional HDD, reading an uncompressed page should take only ~0.1ms while seeking to the page should take ~10ms. My concern is that optimizing the 0.1ms while forcing a 10ms seek for every page would be a big performance loss.

With compressed swapping to disk, the seek times will also reduce as pages will be spread over a smaller area on disk. Still, in general swapping out xvmalloc pages is expected to incur higher swap read overhead due to more no. of seeks involved - an xvmalloc page contains almost unrelated pages.

> There seems to be a big difference between the optimal layout for a memory allocator where seek is not a problem and the optimal layout on a conventional hard disk were seek times dwarf virtually everything else.

Yes, this the whole problem. Theoretically, this problem could be solved by first collecting together physically contiguous pages (w.r.t disk sectors) in a single memory page and then swap this page to disk. However, when pages are swapped out this way, we are not guaranteed that we will be able to free even a single page. Also, this will increase in-memory fragmentation as these pages will be taken out from random xvmalloc pages. So, after lots of such pages are swapped out, we have to do some in-memory defragmentation (not yet implemented) to bring down fragmentation and free pages.

> If, OTOH, adjacent pages were written out in adjacent positions on disk this could *halve* the cost of swap readahead; both halving the time required to read in the extra pages and also halving the memory used by pages that where read from disk but not used.

In general, swap readahead in its present state is almost meaningless in case most of the pages are in (compressed) memory. Decompressing pages is almost instant. Instead, more useful will be to implement some sort of prefetch ioctl for ramzswap so its prefetches pages from backing swap and keeps them compressed in memory. But which pages to prefectch? This will need more study and experimentation.

> (I can see just swapping out xvmalloc pages being a win for SSD, where seek is not a problem for random reads. Also clearly if you are writing out an xvmalloc page there should be very little overhead, and you know you will get 4k of real memory back for each page swapped out. Even so, wouldn't you still have to read in the entire 4K xvmalloc page just to access one of the compress pages stored on that page?)

Yes, reading in single xvmalloc page will bring in bunch of unrelated pages to memory. These additional pages may be kept/discarded based on configurable/hardcoded policy in ramzswap.

Compcache: in-memory compressed swapping

Posted May 29, 2009 5:57 UTC (Fri) by zmi (guest, #4829) [Link]

> mem -> compressed swap -> uncompressed swap on disk
> [and from the article:]
> allow swapping of xvmalloc memory to physical swap disk

That was my immediate idea when reading the article. I'd love it to be a
layer inserted just before normal swap disks, absolutely transparent. Like
this, (compressed) pages not used for a long time can be put to disk swap
at low I/O rates (or low I/O times, if that's easily measurable). And when
too much real mem is used, ramzswap can move pages to disk swap (maybe just
as a last resort to recover before OOM conditions).

The disk swap should support compressed pages directly, and you can also
drop (or at least increase) the "if not enough compression gain, store
uncompressed to disk" rule, and just store pages that are not good
compressed to disk swap, but in it's compressed state. That should help
lower I/O, which is never a failure :-)

If this feature arrives, the vm.swappiness can be increased to more quickly
swap. Currently I lower it to 10 on my desktop (8GB RAM) because the disk
swapping in the morning after nightly backup used to be a nightmare with
the default value, the system very much unresponsive for quite a long time
(at least it feels like before the first coffee *g*). And that's already on
a 10krpm VelociRaptor drive.

I wonder if ramzswap will help on my 8GB desktop, and want to test it.
(already running now)

BTW: shouldn't there be a compressed name also? Like zap or just zp ;-)

Compcache: in-memory compressed swapping

Posted May 28, 2009 8:23 UTC (Thu) by jimparis (guest, #38647) [Link] (3 responses)

How does compcache differ in concept from, say:
- use mtdram driver to make a MTD device out of ram
- use jffs2 with compression enabled on the new mtd device
- put a swapfile on that filesystem and swap to it
Is there a fundamental reason that this wouldn't work and compcache is needed, or is compcache just trying to remove some of those layers?

Compcache: in-memory compressed swapping

Posted May 28, 2009 8:45 UTC (Thu) by amikins (guest, #451) [Link]

More layers of indirection will result in notably more processing per action. Making a specialty 'device' with fewer features and simpler assumptions about usage allows you to cut a lot of significant corners.

I'd be -very- interested in some measurements of the difference between the 'existing possible' approach and compcache..

Compcache: in-memory compressed swapping

Posted May 29, 2009 4:29 UTC (Fri) by nitingupta (guest, #53817) [Link]

How does compcache differ in concept from, say:

> - use mtdram driver to make a MTD device out of ram

mtdram driver simply simulates MTD device in ram - no compression and no memory management (it simply preallocates all the memory). Also, there are unnecessary overheads involved -- simulate erasing eraseblock and such.

> - use jffs2 with compression enabled on the new mtd device
> - put a swapfile on that filesystem and swap to it

Base for this indirection hierarchy is the in-ram mtd device which has problems mentioned above. Also, as amikims pointed out, additional levels of indirection means more overhead.

Compcache: in-memory compressed swapping

Posted Jun 12, 2009 19:10 UTC (Fri) by bluefoxicy (guest, #25366) [Link]

I've tried using the device mapper. It tends to deadlock if you try to device map a file on a tmpfs and create swap partitions on it.

compcache and vm_deadlock for thin client stability

Posted Jun 6, 2009 19:19 UTC (Sat) by gvy (guest, #11981) [Link]

Thanks for both this article and compcache of course!

It (and vm_deadlock patches by Peter Zijlstra) saved me and colleagues quite a lot of frustration while eatin' our own dog food, that is working on thin clients with 64M RAM.

Hope #ltsp folks did get around to integration on TCs (the article mentions installer but the culprit with school terminal networks is usually the client not the server), at least we did "sell" both patches to them back then. :)

So thanks again, it was nice to communicate with you and it is nice to read up on current developments either.

Compcache: in-memory compressed swapping

Posted Jun 7, 2009 12:41 UTC (Sun) by seeg (guest, #58966) [Link]

Does this mean that Windows would run faster because of more memory under Linux hypervising?

Compcache: in-memory compressed swapping

Posted Jun 14, 2009 20:04 UTC (Sun) by alankila (guest, #47141) [Link]

An amazing piece of technology. This makes my 128 MB laptop suddenly usable for more than single task at a time. Kudos.

Compcache: in-memory compressed swapping

Posted Jun 25, 2009 11:38 UTC (Thu) by phil42 (guest, #5175) [Link]

Do you get a free memory test as part of the compression/decompression process?