April 3, 2013
This article was contributed by Dan Magenheimer
Amdahl's law tells us that there is always a bottleneck in any computing
system. Historically,
the bottleneck in many workloads on many systems is the CPU and
so system designers have made CPUs faster and more efficient and
also continue to increase the number of CPU cores even in low-end
systems. So now, increasingly, RAM is
the bottleneck; CPUs wait idly while data is moved
from disk to RAM and back again. Adding
more RAM is not always a timely or cost-effective option and sometimes
not an option at all. Faster I/O buses and solid-state disks reduce
the bottleneck but don't eliminate it.
Wouldn't it be nice if it were possible to increase the effective amount
of data stored in RAM? And, since those CPUs are waiting anyway,
perhaps we could use those spare CPU cycles to contribute towards
that objective? This is the goal of in-kernel compression: We keep more
data — compressed — in RAM and use otherwise idle CPU cycles to execute
compression and decompression algorithms.
With the recent posting of zswap, there are
now three in-kernel compression
solutions proposed for merging into the kernel memory management (MM)
subsystem: zram, zcache, and zswap. While a casual observer might
think that only one is necessary, the three have significant differences
and may target different user bases. So, just as there are many filesystems
today co-existing in the kernel, someday there may be multiple
compression solutions. Or maybe not... this will be decided by Linus
and key kernel developers. To help inform that decision,
this article compares and contrasts the different solutions, which
we will call, for brevity, the "zprojects". We will first introduce
some of the key concepts and challenges of compression. Then
we will identify three design layers and elaborate on the different
design choices made by the zprojects. Finally, we will discuss how
the zprojects may interact with the rest of the kernel and then
conclude.
Compression basics
For in-kernel compression to work, the kernel must take byte sequences
in memory and compress them, and then keep the compressed version in RAM
until some point in the future when the data is needed again.
While the data is in a compressed state, it is impossible to
read any individual bytes from it or write any individual bytes
to it. Then, when the data is needed again,
the compressed sequence must be decompressed so that individual bytes
can again be directly accessed.
It is possible to compress any number of sequential bytes but
it is convenient to use a fixed "unit" of compression. A
standard storage unit used throughout the kernel is a "page"
which is made up of a fixed constant PAGE_SIZE bytes (4KB on
most architectures supported by Linux). If this page is aligned at a
PAGE_SIZE address boundary, it is called a "page frame";
the kernel maintains a corresponding "struct page" for every
page frame in system RAM. All three zprojects use a page as
the unit of compression and allocate and manage page frames
to store compressed pages.
There are many possible compression algorithms. In general,
achieving a higher "compression ratio" takes a larger number
of CPU cycles, whereas less-effective compression can be done faster.
It is important to achieve a good balance between time and
compression ratio. All three zprojects, by default, use the
LZO(1X) algorithm in the kernel's lib/ directory, which is
known to deliver a good balance. However, it is important that
the choice of algorithm remain flexible; possibly the in-CPU
algorithm might be replaced in some cases by an architecture-specific
hardware compression engine.
In general, the number of cycles required to compress, and to
later decompress, a sequence of bytes is roughly proportional
to the number of bytes in the sequence. Since the size of a page
is fairly large, page compression and decompression are expensive
operations and, so, we wish to limit the number of those operations. Thus
we must carefully select which pages
to compress, choosing pages that are likely to be used again but
not likely to be used in the near future, lest we spend all the
CPU's time repeatedly compressing and then decompressing pages.
Since individual bytes in a compressed page are inaccessible,
we must not only ensure that normal CPU linear byte addressing
is not attempted on any individual byte, but also ensure that the
kernel can clearly identify the compressed page so that it can
be found and decompressed when the time comes to do so.
When a page is compressed, the compression algorithm is applied
and the result is a sequence of bytes which we can refer to as a
"zpage".
The size of a zpage, its "zsize", is highly data-dependent,
hard to predict, and highly variable. Indeed the expected distribution of
zsize drives a number of key zproject design decisions so it
is worthwhile to understand it further. The "compression ratio"
for a page is zsize divided by PAGE_SIZE. For nearly all pages,
zsize is less than PAGE_SIZE so the compression ratio is less than one,
but odd cases may occur where compression actually results in
the zsize being larger than PAGE_SIZE. A good compression solution must
have a contingency plan to deal
with such outliers. At the other extreme, a data page containing
mostly zeroes or ones may compress by a factor of 100x or more.
For example, LZO applied to an all-zero page results in a
zpage with zsize equal to 28, for a compression ratio of 0.0068.
As a rough rule of thumb, "on average", compression has a ratio of about
0.5 (i.e. it reduces a page of data by about a factor of two),
so across a wide set of workloads, it is useful to envision the distribution
as a bell curve, centered at PAGE_SIZE/2.
We will refer to zpages with zsize less than PAGE_SIZE/2 as "thin"
zpages, and zpages with zsize greater than PAGE_SIZE/2 as "fat" zpages.
For any given workload, of course, the distribution may "skew thin",
(if many mostly-zero pages are compressed, for example) or "skew fat",
as is true if the data is difficult to compress (already-compressed data
like a JPEG image would be an example).
A good compression
solution must thus be able to handle zsize distributions from a wide range of
workloads.
With this overview and terminology, we can now compare and
contrast the three zprojects. At a high level, each uses
a "data source layer" to provide pages to compress and a
"data management layer" to organize and store the zpages.
This data management layer has two sublayers, the first
for data organization, which we will call the "metadata layer",
and the second for determining how best to utilize existing
kernel memory and kernel allocation code (such as alloc_page())
to optimally store the zpages, which we will call the
"zpage allocator layer".
Data source layer — overview
As previously noted, the nature of compression limits the
types and quantities of pages to which it can be applied.
It doesn't make sense to compress a page for which there is
a high probability of imminent reuse. Nor does it make
sense to compress a page for which there is a high probability
the data it contains will never again be reused. All three
zprojects assume one or more data sources that provide
a "pre-qualified" sequence of data pages. In one zproject,
an existing kernel subsystem is the data source. For
the other two zprojects, previous kernel hooks added as part of
the "transcendent memory" projects
known as cleancache and frontswap,
source the data page stream.
Each data source must also provide a unique identifier,
or "key", for each page of data so that the page, once stored
and associated with that unique key, can be later retrieved by
providing the key.
It's also important to note that different data sources
may result in streams of pages with dramatically different
contents and thus, when compressed, different
zsize distributions.
Swap/anonymous pages as a data source
The Linux swap subsystem provides a good opportunity for compression
because, when the system is under memory pressure, swap acts as a
gatekeeper between (a) frequently used anonymous pages which are kept
in RAM, and (b) presumably less frequently used anonymous pages that
"overflow", or can be "swapped", to a very-much-slower-than-RAM swap disk.
If, instead of using a slow swap disk, we can cause the gatekeeper to
store the overflow pages in RAM, compressed, we may be
able to avoid swapping them entirely. Further, the swap subsystem already
provides a unique key for each page so that the page
can be fetched from the swap disk.
So, unsurprisingly, all three zprojects consider swap pages as
an ideal data source for compression.
One zproject, zram, utilizes the existing swap device model. A zram swap
device is explicitly created and enabled in
user space (via mkswap and swapon) and this zram device is prioritized
(presumably highest) against other configured swap devices. When the swap subsystem sends
a data page to the zram device, the page goes through the
block I/O subsystem to the zram "driver". All swap pages sent
to zram are compressed and associated with a key created by concatenating
the swap device identifier and page offset. Later, when the swap subsystem
determines (via a page fault) that the page must be brought back
in from the swap device to fill a specific page frame, the zram
device is notified; it then decompresses the zpage matching the swap
device number and page offset into that page frame.
The other two zprojects, zcache and zswap, use the kernel's frontswap
hooks (available since Linux 3.5) in the swap subsystem. Frontswap
acts as a "fronting store" — a type of a cache — to an existing
swap device. It avoids the block I/O subsystem completely;
swapped pages instead go directly to
zcache/zswap where they are compressed. On the fetch path, when the
page fault occurs, the block I/O subsystem is again bypassed and then,
as with zram, the page is decompressed directly into the faulting page frame.
There are some subtle but important side effects of the different
approaches:
- The block I/O subsystem is designed to work with fixed-size block
devices and is not very happy when a block write results in a failure.
As a result, when a poorly-compressible page ("very fat" zpage) is
presented to zram, all PAGE_SIZE bytes of the uncompressed page are
stored by zram in RAM, for no net space savings. The frontswap-based
zprojects have more flexibility; if a very fat zpage is presented,
zcache/zswap can simply reject the request, in which case the swap subsystem
will pass on the original data page to the real swap disk.
- Since zram presents itself as a swap disk, user-space configuration is
required, but this also means that zram can work on systems with no
physical swap device, a configuration common for embedded Linux
systems. On the other hand, zcache and zswap depend entirely on
already-configured swap devices and so are not functional unless at
least one physical swap device is configured and sized adequately. As
a result, one might think of zram as being a good match for an
embedded environment, while the others are better suited for a more
traditional server/data-center environment.
- It is important to note that all three zprojects must never discard
any of these data pages unless explicitly instructed, otherwise
user-space programs may experience data loss. Since all three
zprojects have a finite capacity on any given system, it is prudent to
have a "pressure relief valve". Both zcache and zswap have the
ability to "writeback" data to the swapdisk to which the data was
originally intended to be written; this cannot be done with zram.
Clean page cache pages as a data source
It is not uncommon for the vast majority of page frames in a running
system to contain data pages that are identical
to pages already existing on
a disk in a filesystem. In Linux, these pages are stored in
the page cache; the data in these pages is kept in RAM in anticipation
that it will be used again in the future.
When memory pressure occurs and the kernel is looking to free
RAM, the kernel will be quick to
discard some of this duplicate "clean" data because it can be fetched from
the filesystem later if needed.
A "cleancache" hook, present in the kernel since Linux 3.0, can
divert the data in these page frames, sending the pages to zcache
where the data can be compressed and preserved in RAM. Then, if the
kernel determines the data is again needed in RAM, a page fault
will be intercepted by another cleancache hook
which results in the decompression of the data.
Because modern filesystems can be large, unique identification
of a page cache page is more problematic than is unique identification of
swap pages. Indeed, the key provided via cleancache must uniquely
identify: a filesystem; an "exportable" inode value representing a file
within that filesystem; and a page offset within the file. This
combination requires up to 240 bits.
Zcache was designed to handle cleancache pages, including the full
range of required keys. As a result, the data management layer
is much more complex, consisting of a combination of different data
structures which we will describe below. Further, the location
of some cleancache hooks in the VFS part of the kernel results
in some calls to the data management layer with interrupts disabled
which must be properly handled.
Even more than with swap data, filesystem data can proliferate
rapidly and, even compressed, can quickly fill up RAM. So,
as with swap data, it is prudent to have a pressure-relief valve
for page cache data. Unlike swap data, however, we know that
page cache data can simply be
dropped whenever necessary. Since zcache
was designed from the beginning to manage page cache pages, its
data management layer was also designed to efficiently handle
the eviction of page cache zpages.
Zram can maintain entire fixed-size filesystems in compressed form
in RAM, but not as a cache for a non-RAM-based filesystem. In essence,
part of RAM can be pre-allocated as a "fast disk" for a filesystem;
even filesystem metadata is compressed and stored in zram.
While zram may be useful for small filesystems that can entirely fit
in RAM, the caching capability provided via cleancache combined with
the compression provided by zcache is useful even for large filesystems.
Zswap is singularly focused on swap and so does not make use of the
page cache at all or filesystem data as a data source. As a result, its
design is simpler because it can ignore the complexities required
to handle the tougher page cache data source.
The metadata layer
Once the zproject receives a data page for compression, it must
set up and maintain data structures so that the zpage can be stored
and associated with a key for that page, and then later found
and decompressed. Concurrency must also be considered to ensure
that unnecessary serialization bottlenecks do not occur. The three
zprojects use very different data structures for different reasons,
in part driven by the requirements of data sources to be maintained.
Since the number of swap devices in a system is small, and the
number of pages in each is predetermined and fixed, a small, unsigned
integer representing the swap device combined with a page offset is sufficient to identify
a specific page. So zram simply uses a direct table lookup (one
table per configured zram swap device) to find and manage its data.
For example, for each data page stored, the zram table contains a
zsize, since zsize is useful for decompression validation, as well
as information to get to the stored data. Concurrency is restricted,
per-device, by a reader-writer semaphore.
Zswap must manage the same swap-driven address space, but it utilizes
one dynamic red-black tree (rbtree) per swap device
to manage its
data. During tree access, a spinlock on the tree must be held; this
is not a bottleneck because simultaneous access within any one
swap device is greatly limited by other swap subsystem factors.
Zswap's metadata layer is intentionally minimalist.
Zcache must manage both swap data and pagecache data, and the latter
has a much more extensive key space which drives the size and
complexity of zcache data structures. For each filesystem,
zcache creates a "pool" with a hash table of red-black tree root
pointers; each node in the rbtree represents a filesystem
inode — the inode space in an exportable filesystem is
often very sparsely populated which lends itself to management
by an rbtree. Then each rbtree node contains the head of a radix
tree — the data structure used throughout the kernel for page
offsets — and this radix tree is used to look up a specific page
offset. The leaf of the radix tree points to a descriptor which,
among other things, references the zpage.
Each hash table entry has a spinlock to manage concurrency; unlike
swap and frontswap, filesystem
accesses via cleancache may be highly concurrent so contention must
be aggressively avoided.
Impact of writeback on the metadata layer
Earlier it was noted that zcache and zswap support "writeback";
this adds an important twist to the data structures in that,
ideally, we'd like to write back pages in some reasonable order, such as
least recently
used (LRU). To that end, both zcache and
zswap maintain a queue to order the stored data; zswap queues
each individual zpage and zcache queues the page frames used to store
the zpages. While lookup requires a key and searches the data
structures from the root downward, writeback chooses one or more zpages from
the leaves of the data structure and then must not only remove or
decompress the data, but must also remove its metadata from the leaf upward.
This requirement has two ramifications: First, the leaf nodes of
data structures must retain the key along with information necessary
to remove the metadata. Second, if writeback is allowed to execute
concurrently with other data structure accesses (lookup/insert/remove),
challenging new race conditions arise which must be understood
and avoided.
Zswap currently implements writeback only as a side effect of
storing data (when kernel page allocation fails) which limits the
possible race conditions. Zcache implements writeback as a shrinker
routine that can be run completely independently and, thus,
must handle a broader set of potential race conditions. Zcache must also
handle this same broader set when discarding page cache pages.
Other metadata layer topics
One clever data management technique is worth mentioning here: Zram
deduplicates "zero pages" — pages that contain only zero bytes —
requiring no additional storage. In some zsize distributions,
such pages may represent a significant portion of the zpages
to be stored. While the data of the entire page may need to be
scanned to determine if it is a zero page, this overhead may
be small relative to that of compression and, in any case,
it pre-populates the cache with bytes that the compression algorithm
would need anyway. Zcache and zswap should add this feature.
More sophisticated deduplication might also be useful. Patches are
welcome.
Zpage allocator layer
Memory allocation for zpages can present some unique challenges.
First, we wish to maximize the number of zpages stored in a given
number of kernel page frames, a ratio which we will call "density".
Maximizing the density
can be challenging as we do not know a priori the number of
zpages to be stored, the distribution of zsize of those zpages,
or the access patterns which insert and remove those zpages. As a result,
fragmentation can be a concern. Second, allocating memory to store
zpages may often occur in an environment of high-to-severe memory
pressure; an effective allocator must handle frequent failure and
should minimize large (i.e. multiple-contiguous-page) allocations. Third,
if possible, we wish the allocator to support some mechanism to enable
ordered writeback and/or eviction as described above.
One obvious alternative is to use an allocator already present and
widely used in the kernel: kmalloc(), which is optimized for
allocations that nicely fit into default sizes of 2N byte chunks.
For allocations of size 2N+x, however, as much as 50% of the
allocated memory is wasted. kmalloc() provides an option for
"odd" sizes but this requires pre-allocated caches, often of
contiguous sets of pages ("higher order" pages) which are difficult
to obtain under memory pressure. Early studies showed that kmalloc()
allocation failures were unacceptably frequent and density was insufficient,
so kmalloc() was discarded and other options were pursued, all based
on alloc_page(), which always allocates a single page frame.
To improve density, the xvmalloc allocator was written, based on
the two-level sequential fit (TLSF) algorithm. Xvmalloc provided
high density
for some workloads and was the initial default allocator for zram and for
the frontswap-driven portion of the initial zcache. It was found, however,
that the high density led to poor fragmentation characteristics,
especially for zsize distributions that skewed fat. Xvmalloc has
since been discarded (though remnants of it survive through Linux 3.8.)
A new allocator, zbud, was written for the cleancache-driven portion of the
original zcache code. The initial version of zbud had less focus on high
density and more on the ability to efficiently evict cleancache-provided
zpages. With zbud, no more than two zpages are contained
in a page frame, buddying up a pair of zpages, one fat and one thin.
Page frames, rather than zpages, are entered into an LRU queue so
that entire page frames can be easily "freed" to the system.
On workloads where zsize skews thin, a significant amount of space
is wasted, but fragmentation is limited, and eviction is still easy.
Zsmalloc, an allocator intended to "rule them all", was then written.
Zsmalloc takes the best of kmalloc() but with the ability to provide
the complete range
of "size classes" suitable to store zpages; zsmalloc also never requires
higher-order allocations. All zpages with zsize within
a "class" (e.g. between 33-48 bytes) are grouped together and stored
in the same "zspage". A clever feature (especially useful for fat zpages)
allows discontiguous page frames to be "stitched" together, so that
the last bytes in the first page frame contain the first part of the zpage
and the first bytes of the second page frame contain the second part.
A group of N of these discontiguous pages are allocated on demand
for each size class, with N chosen by the zsmalloc code
to optimize density.
As a result of these innovations, zsmalloc achieves great density
across a wide range of zsize distributions: balanced, fat, and thin.
Alas, zsmalloc's strengths also proves to be its Achilles heel:
high density and page crossings make it very difficult for eviction
and writeback to concurrently free page frames, resulting in a different
kind of fragmentation (and consequent lower density) and rendering it
unsuitable for management of cleancache-provided pages in zcache.
So zbud was revised to utilize some of the techniques pioneered
by zsmalloc.
Zbud is still limited to two zpages per page frame, but when under
heavy eviction or writeback, its density compares favorably
with zsmalloc. The decision by zcache to use zbud for swap pages
sadly led to a fork of zcache, first to "old zcache" (aka zcache1)
and "new zcache" (aka zcache2) and then to a separate zproject
entirely, zswap, because the author prefers the density of zsmalloc
over the ability to support cleancache pages and the predictability
and page frame-reclaim of zbud.
So, today, there are two zpage allocators, both still evolving.
The choice for best zpage allocator depends
on use models and unpredictable workloads. It may be possible that
some hybrid of the two will emerge, or perhaps some yet unwritten
allocator may arise to "rule them all".
Memory management subsystem interaction
In some ways, a zproject is like a cache for pages temporarily removed
from the MM subsystem. However it is a rather unusual cache in that,
to store its data, it steals capacity (page frames) from its backing
store, the MM subsystem. From the point-of-view of the MM subsystem,
the RAM is just gone, just as if the page frames were absorbed by
a RAM-voracious device driver. This is a bit unfortunate, given that
both the zproject and the MM subsystem are respectively, but separately,
managing compressed and uncompressed anonymous pages (and possibly
compressed and uncompressed page cache pages as well).
So it is educational to consider how
a zproject may "load balance" its memory needs with the MM subsystem,
and with the rest of the kernel.
Currently, all three zprojects obtain page frames using the standard
kernel alloc_page()
call, with the "GFP flags" parameter chosen to ensure that the kernel's
emergency reserve pages are never used.
So each zproject must be (and is) resilient to the fairly frequent
situation where an alloc_page() call returns failure.
Zram makes no other attempt to limit its total use of page frames but,
as we've seen,
it is configured by the administrator as a swap device, and swap parameters do
limit the number of swap pages, and thus zpages, that can be accepted.
This indirectly
but unpredictably limits the number of page frames used. Also,
the configuration is independent of and not cognizant of the total RAM
installed in the system, so configuration errors are very possible.
Zswap uses a slightly modified version of zsmalloc to track the
total number of page frames allocated. It ensures this number does
not exceed a certain percent of total RAM in the system, and this
limit is configurable via sysfs. As previously noted, using
writeback, zswap can reduce the number of (anonymous) zpages stored
and this, indirectly, can reduce the number of page frames,
though at the likely cost of fragmentation.
Zcache and zbud were designed from the beginning with much more dynamic
constraints on page frame utilization in mind, intended eventually to
flexibly mesh with the dramatically time-varying memory balancing algorithms
that the MM subsystem uses.
While the exact interaction model and best future control policy are not
yet determined, zcache's and zbud's underlying page frame-based writeback
and eviction mechanisms support a shrinker-like interface that can
be told to, for example, reduce the number of page frames used for
storing anonymous zpages from 10000 to 8937 and the number of page frames
used for storing page cache zpages from 3300 to 463, and zcache can do that.
Since zbud is much more predictable (two zpages per page frame), the
MM subsystem can estimate and manage both the total number of anonymous
and pagecache pages (compressed and uncompressed) in the system and the
total number of page frames used to manage them.
Status and conclusions
Zram, zcache, and zswap are all advancing the concept of in-kernel compression
in different ways. Zram and zcache are in-tree but both still in
the staging tree. While the staging tree has served to expose
the concept of in-kernel compression to a wide group of kernel
developers and even to the users of a few leading-edge distributions,
as Andrew Morton says,
the staging tree "is where code goes to be ignored."
Staging tree maintainer Greg Kroah-Hartman has become reluctant to accept any new
zproject functionality into the staging tree. This creates a
conundrum for these zprojects: evolution in the staging tree
has resulted in rapid improvements in the design and implementation
and exposure of the zprojects, but because of this evolution
they are not stable enough for promotion into the core kernel
and further evolution is now stymied.
Zswap is attempting to circumvent the staging tree entirely by
proposing what is essentially a simplified frontswap-only fork of
zcache using zsmalloc instead of zbud, for direct merging into the
MM subsystem. Since it is also much simpler than zcache, it has
garnered more reviewers
which is a valuable advantage for to-be-merged code. But it is
entirely dependent on the still-in-staging zsmalloc, does not
support page cache pages at all, and does not support page-frame-based
writeback, so its simplicity comes at a cost. If zswap is
merged, it remains to be seen if it will ever be extended
adequately.
So, in-kernel compression has clear advantages and real users.
It remains unclear though when (or if) it will be merged
into the core kernel and how it should interact with
the core kernel. If you are intrigued, the various zproject
developers encourage your ideas and contributions.
Acknowledgments
Nitin Gupta was the original author of compcache and I
was the original author of transcendent memory. While both projects
were initially targeted to improve memory utilization in a multi-tenant
virtualized environment, their application to compression in a single
kernel quickly became apparent. Nitin designed and wrote the Linux
code for zram, xvmalloc, and zsmalloc, and wrote an early prototype
implementation of zcache. I wrote frontswap and cleancache (with the
cleancache hook placement originally done by Chris Mason), and the
Linux code for zcache, zbud, and ramster.
Seth Jennings contributed
a number of improvements to zcache and is the author of zswap.
Kernel mm developer Minchan Kim was a helpful influence for all the
zprojects, and none of the zprojects would have been possible without
Greg Kroah-Hartman's support — and occasional whippings — as
maintainer of the staging drivers tree.
As with any open-source project, many others have contributed ideas,
bug fixes, and code improvements and we are thankful for everyone's
efforts.
(
Log in to post comments)