August 12, 2011
This article was contributed by Dan Magenheimer
The Linux kernel carefully enumerates and tracks all of its memory
and, for the most part, it can individually access every byte of
it. The purpose of transcendent memory ("tmem") is to provide the
kernel with the capability to utilize memory that it cannot enumerate,
sometimes cannot track, and cannot directly address. This may sound
counterintuitive, or even silly, to core kernel developers, but as we
will see it is actually quite useful; indeed it adds a level of
flexibility to the kernel that allows some rather complex functionalities
to be implemented and layered on a handful of tiny changes
to the core kernel. The end goal is that memory can be more
efficiently utilized by one kernel and/or load-balanced between
multiple kernels (in a virtualized OR a non-virtualized environment),
resulting in higher performance and/or lower RAM costs in a system or
across a data center. This article will provide an overview of
transcendent memory and how it is used in the Linux kernel.
Exactly how the kernel talks to tmem will be described in Part 2, but there
are certain classes of data maintained by the kernel that are suitable.
Two of these are known to kernel developers as "clean pagecache pages"
and "swap pages". The patch that deals with the former is known
as "cleancache"; it was merged into the 3.0 kernel. The patch that
deals with swap pages is known as frontswap and is still being reviewed
on the Linux kernel mailing list, with a target of linux-3.2.
There may well be other classes of data that will also work well with
tmem. Collectively these sources of suitable data for tmem can be
referred to as "frontends" for tmem and we will detail them in Part 3.
There are multiple implementations of tmem which store data using
different methods. We can refer to these data stores as "backends"
for tmem, and all frontends can be used by all backends (possibly
using a shim to connect them). The initial tmem implementation, known as
"Xen tmem," allows Xen hypervisor memory to be
used to store data for one or more tmem-enabled guest kernels.
Xen tmem has been implemented in Xen for over two years and has
been shipping in Xen since Xen 4.0; the in-kernel shim for Xen tmem
was merged into 3.0 (for cleancache only, updated to also support
frontswap in 3.1). Another Xen driver component, the Xen
self-ballooning driver, which helps encourage a guest kernel to use
tmem efficiently, was merged for 3.1 and also includes
the "frontswap-selfshrinker". See Appendix A for more information
about these.
The second tmem implementation does not involve virtualization at
all and is known as "zcache," it is an in-kernel driver that stores
compressed pages. Zcache essentially "doubles RAM" for any class
of kernel memory that the kernel can provide via a tmem frontend
(e.g. cleancache, frontswap), thus reducing memory requirements in,
for example, embedded kernels. Zcache was merged as a staging driver
in 2.6.39 (though dependent on the cleancache and frontswap
patchsets which were not yet upstream)
A third tmem implementation is underway; it is known as "RAMster."
In RAMster, a "closely-connected" set of kernels effectively pool
their RAM so that a RAM-hungry workload on one machine can
temporarily and transparently utilize RAM on another machine
which is presumably idle or running a non RAM-hungry workload.
RAMster has also been dubbed "peer-to-peer transcendent memory"
and is intended for non-virtualized kernels but is being tested
also with virtualized kernels. While RAMster is best suited in
an environment where multiple systems are connected by a high-speed
"exofabric", in which one system can directly address another
systems memory, the initial prototype is built on a standard
ethernet connection.
Other tmem implementations have been proposed: For example,
there has been some argument about how useful tmem might be
for KVM and/or for containers. With recent changes to zcache
merged in 3.1, it may be very easy to simply implement the necessary
shims and try these out; nobody has yet stepped up to do it.
As another example, it has been observed that the tmem protocols
may be ideal for certain kinds of RAM-like technologies such as
"phase-change" memory (PRAM); most of these technologies have
certain idiosyncrasies, such as limited write-cycles, that can
be managed effectively through a software interface such as tmem.
Discussions have begun with certain vendors of such RAM-like
technologies. Yet another example is a variation of RAMster:
a single machine in a cluster acts as a "memory server" and memory
is added solely to that machine; the memory may be RAM, may be
RAM-like, or perhaps may be a fast SSD.
The existing tmem implementations will be described in Part 4
along with some speculation about future implementations.
2: How the kernel talks to transcendent memory
The kernel "talks" to tmem through a carefully defined interface,
which was crafted to provide maximum flexibility for the tmem
implementation while incurring low impact on the core kernel.
The tmem interface may appear odd but there are good reasons for
its peculiarities. Note that in some cases the
tmem interface is completely internal to the kernel and is
thus an "API"; in other cases it defines the boundary between
two independent software components (e.g. Xen and a guest Linux
kernel) so is properly called an "ABI".
(Casual readers may wish to skip this section.)
Tmem should be thought of as another "entity" that "owns"
some memory. The entity might be an in-kernel driver, another
kernel, or a hypervisor/host. As previously noted, tmem cannot be
enumerated by the kernel; the size of tmem is unknowable to the
kernel, may change dynamically, and may at any time be "full".
As a result, the kernel must "ask" tmem on every individual page
to accept data or to retrieve data.
Tmem is not byte-addressable -- only large chunks of data (exactly
or approximately a page in size) are copied between kernel memory and
tmem. Since the kernel cannot "see" tmem, it is the tmem side of the
API/ABI that copies the data from/to kernel memory. Tmem organizes
related chunks of data in a pool; within a pool, the kernel chooses
a unique "handle" to represent the equivalent of an address for
the chunk of data. When the kernel requests the creation of
a pool, it specifies certain attributes to be described below.
If pool creation is successful, tmem provides a "pool id".
Handles are unique within pools, not across pools, and consist
of a 192-bit "object id" and a 32-bit "index." The rough equivalent
of an object is a "file" and the index is the rough equivalent of
a page offset into the file.
The two basic operations of tmem are "put" and "get". If the
kernel wishes to save a chunk of data in tmem, it uses the "put"
operation, providing a pool id, a handle, and the location of the
data; if the put returns success, tmem has copied the data. If the kernel
wishes to retrieve data, it uses the "get" operation and provides the
pool id, the handle, and a location for tmem to place the data; if
the get succeeds, on return, the data will be present at the specified
location. Note that, unlike I/O, the copying performed by tmem is fully
synchronous. As a result, arbitrary locks can (and, to avoid races,
often should!) be held by the caller.
There are two basic pool types: ephemeral and persistent.
Pages successfully put to an ephemeral pool may or may not be
present later when the kernel uses a subsequent get with a matching
handle. Pages successfully put to a persistent pool are guaranteed
to be present for a subsequent get. (Additionally, a pool may
be "shared" or "private".)
The kernel is responsible for maintaining coherency between tmem
and the kernel's own data, and tmem has two types of "flush" operations
to assist with this: To disassociate a handle from any tmem data, the
kernel uses a "flush" operation. To disassociate all chunks of data in
an object, the kernel uses a "flush object" operation. After a flush,
subsequent gets will fail. A get on an (unshared) ephemeral pool is
destructive, i.e. implies a flush; otherwise, the get is non-destructive
and an explicit flush is required. (There are two additional coherency
guarantees that are described in Appendix B.)
3: Transcendent memory frontends: frontswap and cleancache
While other frontends are possible, the two existing tmem frontends,
frontswap and cleancache, cover two of the primary types of kernel
memory that are sensitive to memory pressure. These two frontends
are complementary: cleancache handles (clean) mapped pages that
would otherwise be reclaimed by the kernel; frontswap handles
(dirty) anonymous pages that would otherwise be swapped out by
the kernel. When a successful cleancache_get happens, a disk
read has been avoided. When a successful frontswap_put (or get)
happens, a swap device write (or read) had been avoided. Together,
assuming tmem is significantly faster than disk paging/swapping,
substantial performance gains may be obtained in a memory-constrained
environment.
Frontswap
The total amount of "virtual memory" in a Linux system is the
sum of the physical RAM plus the sum of all configured swap devices.
When the "working set" of a workload exceeds the size of physical
RAM, swapping occurs -- swap devices are essentially used to emulate
physical RAM. But, generally, a swap device is several orders of
magnitude slower than RAM so swapping has become synonymous with
horrible performance. As a result, wise system administrators increase
physical RAM and/or redistribute workloads to ensure that swapping
is avoided. But what if swapping isn't always slow?
Frontswap allows the Linux swap subsystem to use transcendent memory,
when available, in place of sending data to and from a swap device.
Frontswap is not in itself a swap device and, thus, requires no
swap-device-like configuration. It does not change the total virtual
memory in the system; it just results in faster swapping... some/most/nearly
all of the time, but not necessarily always. Remember that the
quantity of transcendent memory is unknowable and dynamic. With
frontswap, whenever a page needs to be swapped out the swap subsystem asks
tmem if it is willing to take the page of data. If tmem rejects it,
the swap subsystem writes the page, as normal, to the swap device.
If tmem accepts it, the swap subsystem can request the page of
data back at any time and it is guaranteed to be retrievable from
tmem. And, later, if the swap subsystem is certain the data is no
longer valid (e.g. if the owning process has exited), it can flush the
page of data from tmem.
Note that tmem can reject any or every frontswap "put". Why would it?
One example is if tmem is a resource shared between multiple kernels
(aka tmem "clients"), as is the case for Xen tmem or for RAMster;
another kernel may have already claimed the space, or perhaps this
kernel has exceeded some tmem-managed quota. Another example is
if tmem is compressing data as it does in zcache and it determines
that the compressed page of data is too large; in this case, tmem might
reject any page that isn't sufficiently compressible OR perhaps even
if the mean compression ratio is growing unacceptably.
The frontswap patchset is non-invasive and does not impact the
behavior of the swap subsystem at all when frontswap is disabled.
Indeed, a key kernel maintainer has observed that frontswap appears
to be "bolted on" to the swap subsystem. That is a good thing as
the existing swap subsystem code is very stable, infrequently used
(because swapping is so slow), yet critical to system correctness;
dramatic change to the swap subsystem is probably unwise and frontswap
only touches the fringes of it.
A few implementation notes: Frontswap requires one bit
of metadata per page of enabled swap. (The Linux swap subsystem
until recently required 16 bits, and now requires eight bits of
metadata per page so frontswap increases this by 12.5%.) This
bit-per-page records whether the page is in tmem or is on the physical
swap device. Since, at any time, some pages may be in frontswap and
some on the physical device, the swap subsystem "swapoff" code also
requires some modification. And because in-use tmem is more valuable
than swap device space, some additional modifications are provided
by frontswap so that a "partial swapoff" can be performed. And,
of course, hooks are at the read-page and write-page routines
to divert data into tmem and a hook is added to flush the data
when it is no longer needed. All told, the patch parts that affect
core kernel components add up to less than 100 lines.
Cleancache
In most workloads, the kernel fetches pages from a slow disk and, when
RAM is plentiful, the kernel retains copies of many of these pages in
memory, assuming that a disk page used once is likely to be used again.
There's no sense incurring two disk reads when one will do and
there's nothing else to do with that plentiful RAM anyway. If any data
is written to one of those pages, the changes must be written to disk
but, in anticipation of future changes, the (now clean) page continues to
be retained in memory. As a result, the number of clean pages in this
"page cache" often grows to fill the vast majority of memory. Eventually,
when memory is nearly filled, or perhaps if the workload grows to require
more memory, the kernel "reclaims" some of those clean pages; the
data is discarded and the page frames are used for something else.
No data is lost because a clean page in memory is identical to the
same page on disk. However, if the kernel later determines that it does
need that page of data after all, it must again be fetched from disk,
which is called a "refault." Since the kernel can't predict the future,
some pages are retained that will never be used again and some pages
are reclaimed that soon result in a refault.
Cleancache allows tmem to be used to store clean page cache pages resulting
in fewer refaults. When the kernel reclaims a page, rather than
discard the data, it places the data into tmem, tagged as "ephemeral",
which means that the page of data may be discarded if tmem chooses.
Later, if the kernel determines it needs that page of data after all,
it asks tmem to give it back. If tmem has retained the page, it
gives it back; if tmem hasn't retained the page, the kernel proceeds
with the refault, fetching the data from the disk as usual.
To function properly, cleancache "hooks" are placed where pages are
reclaimed and where the refault occurs. The kernel is also responsible
for ensuring coherency between the page cache, disk, and tmem, so hooks
are also present where ever the kernel might invalidate the data.
Since cleancache affects the kernel's VFS layer, and since not all
filesystems use all VFS features, a filesystem must "opt in" to use
cleancache whenever it mounts a filesystem.
One interesting note about cleancache is that clean pages may be retained
in tmem for a file that has no pages remaining in the kernel's page cache.
Thus the kernel must provide a name ("handle") for the page which is unique
across the entire filesystem. For some filesystems, the inode number is
sufficient, but for modern filesystems, the 192-bit "exportfs" handle is used.
Other tmem frontends
A common question is: can user code use tmem? For example, can enterprise
applications that otherwise circumvent the pagecache use tmem? Currently
the answer is no, but one could implement "tmem syscalls" to allow this.
Coherency issues may arise, and it remains to be seen if they could be
managed in user space.
What about other in-kernel uses? Some have suggested that the kernel dcache
might provide a useful source of data for tmem. This too deserves further
investigation.
4: Transcendent memory backends
The tmem interface allows multiple frontends to function with
different backends. Currently only one backend may be configured though,
in the future, some form of layering may be possible. Tmem backends share some
common characteristics: Although a tmem backend might seem similar to
a block device, it does not perform I/O and does not use the
block I/O (bio) subsystem. In fact, a tmem backend must perform its functions
fully synchronously, that is, it must not sleep and the scheduler
may not be called. When a "put" completes, the kernels's page of data
has been copied. And a successful "get" may not complete
until the page of data has been copied to the kernel's data page.
While these constraints create some difficulty for tmem backends,
they also ensure that the tmem backend meets the tmem's interface
requirements while also minimizing changes to the core kernel.
Zcache
Although tmem was conceived as a way to share a fixed resource (RAM)
among a number of clients with constantly varying memory appetites,
it also works nicely when the amount of RAM needed by a single kernel
to store some number, N, of pages of data is less than N*PAGE_SIZE
and when those pages of data need only be accessed only at a page
granularity. So zcache combines an in-kernel implementation of tmem
with in-kernel compression code to reduce the space requirements
for data provided through a tmem frontend. As a result, when the
kernel is under memory pressure, zcache can substantially increase
the number of clean page cache pages and swap cache pages stored in
RAM and thus significantly decrease disk I/O.
The zcache implementation is currently a staging driver so it is subject
to change; it handles both persistent pages (from frontswap) and ephemeral
pages (from cleancache) and, in both cases, uses the in-kernel lzo1x
routines to compress/decompress the data contained in the pages.
Space for persistent pages is obtained through a shim to xvmalloc,
a memory allocator in the zram staging driver designed to store
compressed pages. Space for ephemeral pages is obtained through
standard kernel get_free_page() calls, then pairs of compressed
ephemeral
pages are matched using an algorithm called "compression buddies".
This algorithm ensures that physical page frames containing two
compressed ephemeral pages can easily be reclaimed when necessary;
zcache provides a standard "shrinker" routine so those whole page frames
can be reclaimed when required by the kernel using the existing
kernel shrinker mechanism.
Zcache nicely demonstrates one of the flexibility features of tmem:
Recall that, although data may often compress nicely (i.e. by a factor of
two or more), it is possible that some workloads may produce long
sequences of data that compress poorly. Since tmem allows any page
to be rejected at the time of put, zcache policy (adjustable with
sysfs tuneables in 3.1) avoids storing this poorly compressible data, instead
passing it on to the original swap device for storage, thus dynamically
optimizing the density of pages stored in RAM.
RAMster
RAMster is still under development but a proof-of-concept exists today.
RAMster assumes that we have a cluster-like set of systems with some
high-speed communication layer, or "exofabric", connecting them. The
collected RAM of all the systems in the "collective" is the shared RAM
resource used by tmem. Each cluster node acts as both a tmem client
and a tmem server, and decides how much of its RAM to provide to the
collective. Thus RAMster is a "peer-to-peer" implementation of tmem.
Ideally this exofabric allows some form of synchronous remote DMA to allow
one system to read or write the RAM on another system, but in the initial
RAMster proof-of-concept (RAMster-POC), a standard Ethernet connection is
used instead. As long as the exofabric is sufficiently faster than
disk reads/writes, there is still a net performance win.
Interestingly, RAMster-POC demonstrates a useful dimension of tmem:
Once pages have been placed in tmem, the data can be transformed in
various ways as long as the pages can be reconstituted when required.
When pages are put to RAMster-POC, they are first compressed and cached
locally using a zcache-like tmem backend. As local memory constraints
increase, an asynchronous process attempts to "remotify" pages to another
cluster node; if one node rejects the attempt, another node can be
used as long as the local node tracks where the remote data resides.
Although the current RAMster-POC doesn't implement this, one could
even remotify multiple copies to achieve higher availability (i.e.
to recover from node failures).
While this multi-level mechanism in RAMster works nicely for puts,
there is currently no counterpart for gets. When a tmem frontend
requests a persistent get, the data must be fetched immediately
and synchronously; the thread requiring the data must busy-wait
for the data to arrive and the scheduler must not be called. As
a result current RAMster-POC is best suited for many-core processors,
where it is unusual for all cores to be simultaneously active.
Transcendent memory for Xen
Tmem was originally conceived for Xen and so the Xen implementation
is the most mature. The tmem backend in Xen utilizes spare hypervisor
memory to store data, supports a large number of guests, and optionally
implements both compression and deduplication (both within a guest
and across guests) to maximize the volume of data that can be stored.
The tmem frontends are converted to Xen hypercalls using a shim.
Individual guests may be equipped with "self-ballooning" and
"frontswap-self-shrinking" (both in Linux 3.1) to optimize their
interaction with Xen tmem. Xen tmem also supports shared ephemeral
pools, so that guests co-located on a physical server that share a
cluster filesystem need only keep one copy of a cleancache page in tmem.
The Xen control plane also fully implements tmem: An extensive set
of statistics is available; live migration and save/restore of
tmem-using guests is fully supported and limits, or "weights", may be
applied to tmem guests to avoid denial-of-service.
Transcendent memory for kvm
The in-kernel tmem code included in zcache has been updated in 3.1 to
support multiple tmem clients. With this in place, a KVM implementation
of tmem should be fairly easy to complete, at least in prototype form.
As with Xen, a shim would need to be placed in the guest to convert
the cleancache and frontswap frontend calls to KVM hypercalls. On the host
side, these hypercalls would need to be interfaced with the in-kernel
tmem backend code. Some additional control plane support would also be
necessary for this to be used in a KVM distribution.
Future tmem backends
The flexibility and dynamicity of tmem suggests that it may be useful
for other storage needs and other backends have been proposed. The
idiosyncrasies of some RAM-extension technologies, such as SSD
and phase-change (PRAM) have been observed to be a possible fit;
since page-size quantities are always used, writes can be
carefully controlled and accounted, and user code never
writes to tmem, memory technologies that could previously only be
used as a fast I/O device could now instead be used as slow RAM.
Some of these ideas are already under investigation.
Appendix A: Self-ballooning and frontswap-selfshrinking
After a system has been running for awhile, it is not uncommon for
the vast majority of its memory to be filled with clean pagecache
pages. With some tmem backends, especially Xen, it may make sense
for those pages to reside in tmem instead of in the guest. To achieve
this, Xen implements aggressive "self-ballooning", which artificially
creates memory pressure by driving the Xen balloon driver to claim
page frames, thus forcing the kernel to reclaim pages, which sends
them to tmem. The algorithm essentially uses control theory to drive
towards a memory target that approximates the current "working set"
of the workload using the "Committed_AS" kernel variable. Since
Committed_AS doesn't account for clean, mapped pages, these pages
end up residing in Xen tmem where, queueing theory assures us, Xen
can manage the pages more efficiently.
If the working set increases unexpectedly and faster than the
self-balloon driver is able to (or chooses to) provide usable RAM,
swapping occurs, but, in most cases, frontswap is able to absorb
this swapping into Xen tmem. However, due to the fact that the
kernel swap subsystem assumes that swapping occurs to a disk,
swapped pages may sit on the "disk" for a very long time, even if
the kernel knows the page will never be used again, because the
disk space costs very little and can be overwritten when necessary.
When such stale pages are in frontswap, however, they are taking
up valuable space.
Frontswap-self-shrinking works to resolve this problem:
when frontswap activity is stable and the guest kernel returns to
a state where it is not under memory pressure, pressure is provided
to remove some pages from frontswap, using a "partial" swapoff
interface, and return them to kernel memory, thus freeing tmem
space for more urgent needs, i.e. other guests that are currently
memory-constrained.
Both self-ballooning and frontswap-self-shrinking provide sysfs
tuneables to drive their control processes. Further experimentation
will be necessary to optimize them.
Appendix B: Subtle tmem implementation requirements
Although tmem places most coherency responsibility on its clients, a
tmem backend itself must enforce two coherency requirements. These are
called "get-get" coherency and "put-put-get" coherency. For the
former, a tmem backend guarantees that if a get fails, a subsequent
get to the same handle will also fail (unless, of course, there is an
intermediate put). For the latter, if a put places data "A" into
tmem and a subsequent put with the same handle places data "B" into
tmem, a subsequent "get" must never return "A".
This second coherency requirement results in an unusual corner-case
which affects the API/ABI specification: If a put with a handle "X" of
data "A" is accepted, and then a subsequent put is done to handle "X" with
data "B", this is referred to as a "duplicate put". In this case,
the API/ABI allows the backend implementation two options, and the
frontend must be prepared for either: (1) if the duplicate put is
accepted, the backend replaces data "A" with data "B" and success
is returned and (2) the duplicate put may be failed, and the backend
must flush the data associated with "X" so that a subsequent get
will fail. This is the only case where a persistent get of
a previously accepted put may fail; fortunately in this case
the frontend has the new data "B" which would have overwritten the
old data "A" anyway.
(
Log in to post comments)