Transcendent memory in a nutshell

August 12, 2011

This article was contributed by Dan Magenheimer

The Linux kernel carefully enumerates and tracks all of its memory and, for the most part, it can individually access every byte of it. The purpose of transcendent memory ("tmem") is to provide the kernel with the capability to utilize memory that it cannot enumerate, sometimes cannot track, and cannot directly address. This may sound counterintuitive, or even silly, to core kernel developers, but as we will see it is actually quite useful; indeed it adds a level of flexibility to the kernel that allows some rather complex functionalities to be implemented and layered on a handful of tiny changes to the core kernel. The end goal is that memory can be more efficiently utilized by one kernel and/or load-balanced between multiple kernels (in a virtualized OR a non-virtualized environment), resulting in higher performance and/or lower RAM costs in a system or across a data center. This article will provide an overview of transcendent memory and how it is used in the Linux kernel.

Exactly how the kernel talks to tmem will be described in Part 2, but there are certain classes of data maintained by the kernel that are suitable. Two of these are known to kernel developers as "clean pagecache pages" and "swap pages". The patch that deals with the former is known as "cleancache"; it was merged into the 3.0 kernel. The patch that deals with swap pages is known as frontswap and is still being reviewed on the Linux kernel mailing list, with a target of linux-3.2. There may well be other classes of data that will also work well with tmem. Collectively these sources of suitable data for tmem can be referred to as "frontends" for tmem and we will detail them in Part 3.

There are multiple implementations of tmem which store data using different methods. We can refer to these data stores as "backends" for tmem, and all frontends can be used by all backends (possibly using a shim to connect them). The initial tmem implementation, known as "Xen tmem," allows Xen hypervisor memory to be used to store data for one or more tmem-enabled guest kernels. Xen tmem has been implemented in Xen for over two years and has been shipping in Xen since Xen 4.0; the in-kernel shim for Xen tmem was merged into 3.0 (for cleancache only, updated to also support frontswap in 3.1). Another Xen driver component, the Xen self-ballooning driver, which helps encourage a guest kernel to use tmem efficiently, was merged for 3.1 and also includes the "frontswap-selfshrinker". See Appendix A for more information about these.

The second tmem implementation does not involve virtualization at all and is known as "zcache," it is an in-kernel driver that stores compressed pages. Zcache essentially "doubles RAM" for any class of kernel memory that the kernel can provide via a tmem frontend (e.g. cleancache, frontswap), thus reducing memory requirements in, for example, embedded kernels. Zcache was merged as a staging driver in 2.6.39 (though dependent on the cleancache and frontswap patchsets which were not yet upstream)

A third tmem implementation is underway; it is known as "RAMster." In RAMster, a "closely-connected" set of kernels effectively pool their RAM so that a RAM-hungry workload on one machine can temporarily and transparently utilize RAM on another machine which is presumably idle or running a non RAM-hungry workload. RAMster has also been dubbed "peer-to-peer transcendent memory" and is intended for non-virtualized kernels but is being tested also with virtualized kernels. While RAMster is best suited in an environment where multiple systems are connected by a high-speed "exofabric", in which one system can directly address another systems memory, the initial prototype is built on a standard ethernet connection.

Other tmem implementations have been proposed: For example, there has been some argument about how useful tmem might be for KVM and/or for containers. With recent changes to zcache merged in 3.1, it may be very easy to simply implement the necessary shims and try these out; nobody has yet stepped up to do it. As another example, it has been observed that the tmem protocols may be ideal for certain kinds of RAM-like technologies such as "phase-change" memory (PRAM); most of these technologies have certain idiosyncrasies, such as limited write-cycles, that can be managed effectively through a software interface such as tmem. Discussions have begun with certain vendors of such RAM-like technologies. Yet another example is a variation of RAMster: a single machine in a cluster acts as a "memory server" and memory is added solely to that machine; the memory may be RAM, may be RAM-like, or perhaps may be a fast SSD.

The existing tmem implementations will be described in Part 4 along with some speculation about future implementations.

2: How the kernel talks to transcendent memory

The kernel "talks" to tmem through a carefully defined interface, which was crafted to provide maximum flexibility for the tmem implementation while incurring low impact on the core kernel. The tmem interface may appear odd but there are good reasons for its peculiarities. Note that in some cases the tmem interface is completely internal to the kernel and is thus an "API"; in other cases it defines the boundary between two independent software components (e.g. Xen and a guest Linux kernel) so is properly called an "ABI".

(Casual readers may wish to skip this section.)

Tmem should be thought of as another "entity" that "owns" some memory. The entity might be an in-kernel driver, another kernel, or a hypervisor/host. As previously noted, tmem cannot be enumerated by the kernel; the size of tmem is unknowable to the kernel, may change dynamically, and may at any time be "full". As a result, the kernel must "ask" tmem on every individual page to accept data or to retrieve data.

Tmem is not byte-addressable -- only large chunks of data (exactly or approximately a page in size) are copied between kernel memory and tmem. Since the kernel cannot "see" tmem, it is the tmem side of the API/ABI that copies the data from/to kernel memory. Tmem organizes related chunks of data in a pool; within a pool, the kernel chooses a unique "handle" to represent the equivalent of an address for the chunk of data. When the kernel requests the creation of a pool, it specifies certain attributes to be described below. If pool creation is successful, tmem provides a "pool id". Handles are unique within pools, not across pools, and consist of a 192-bit "object id" and a 32-bit "index." The rough equivalent of an object is a "file" and the index is the rough equivalent of a page offset into the file.

The two basic operations of tmem are "put" and "get". If the kernel wishes to save a chunk of data in tmem, it uses the "put" operation, providing a pool id, a handle, and the location of the data; if the put returns success, tmem has copied the data. If the kernel wishes to retrieve data, it uses the "get" operation and provides the pool id, the handle, and a location for tmem to place the data; if the get succeeds, on return, the data will be present at the specified location. Note that, unlike I/O, the copying performed by tmem is fully synchronous. As a result, arbitrary locks can (and, to avoid races, often should!) be held by the caller.

There are two basic pool types: ephemeral and persistent. Pages successfully put to an ephemeral pool may or may not be present later when the kernel uses a subsequent get with a matching handle. Pages successfully put to a persistent pool are guaranteed to be present for a subsequent get. (Additionally, a pool may be "shared" or "private".)

The kernel is responsible for maintaining coherency between tmem and the kernel's own data, and tmem has two types of "flush" operations to assist with this: To disassociate a handle from any tmem data, the kernel uses a "flush" operation. To disassociate all chunks of data in an object, the kernel uses a "flush object" operation. After a flush, subsequent gets will fail. A get on an (unshared) ephemeral pool is destructive, i.e. implies a flush; otherwise, the get is non-destructive and an explicit flush is required. (There are two additional coherency guarantees that are described in Appendix B.)

3: Transcendent memory frontends: frontswap and cleancache

While other frontends are possible, the two existing tmem frontends, frontswap and cleancache, cover two of the primary types of kernel memory that are sensitive to memory pressure. These two frontends are complementary: cleancache handles (clean) mapped pages that would otherwise be reclaimed by the kernel; frontswap handles (dirty) anonymous pages that would otherwise be swapped out by the kernel. When a successful cleancache_get happens, a disk read has been avoided. When a successful frontswap_put (or get) happens, a swap device write (or read) had been avoided. Together, assuming tmem is significantly faster than disk paging/swapping, substantial performance gains may be obtained in a memory-constrained environment.

Frontswap

The total amount of "virtual memory" in a Linux system is the sum of the physical RAM plus the sum of all configured swap devices. When the "working set" of a workload exceeds the size of physical RAM, swapping occurs -- swap devices are essentially used to emulate physical RAM. But, generally, a swap device is several orders of magnitude slower than RAM so swapping has become synonymous with horrible performance. As a result, wise system administrators increase physical RAM and/or redistribute workloads to ensure that swapping is avoided. But what if swapping isn't always slow?

Frontswap allows the Linux swap subsystem to use transcendent memory, when available, in place of sending data to and from a swap device. Frontswap is not in itself a swap device and, thus, requires no swap-device-like configuration. It does not change the total virtual memory in the system; it just results in faster swapping... some/most/nearly all of the time, but not necessarily always. Remember that the quantity of transcendent memory is unknowable and dynamic. With frontswap, whenever a page needs to be swapped out the swap subsystem asks tmem if it is willing to take the page of data. If tmem rejects it, the swap subsystem writes the page, as normal, to the swap device. If tmem accepts it, the swap subsystem can request the page of data back at any time and it is guaranteed to be retrievable from tmem. And, later, if the swap subsystem is certain the data is no longer valid (e.g. if the owning process has exited), it can flush the page of data from tmem.

Note that tmem can reject any or every frontswap "put". Why would it? One example is if tmem is a resource shared between multiple kernels (aka tmem "clients"), as is the case for Xen tmem or for RAMster; another kernel may have already claimed the space, or perhaps this kernel has exceeded some tmem-managed quota. Another example is if tmem is compressing data as it does in zcache and it determines that the compressed page of data is too large; in this case, tmem might reject any page that isn't sufficiently compressible OR perhaps even if the mean compression ratio is growing unacceptably.

The frontswap patchset is non-invasive and does not impact the behavior of the swap subsystem at all when frontswap is disabled. Indeed, a key kernel maintainer has observed that frontswap appears to be "bolted on" to the swap subsystem. That is a good thing as the existing swap subsystem code is very stable, infrequently used (because swapping is so slow), yet critical to system correctness; dramatic change to the swap subsystem is probably unwise and frontswap only touches the fringes of it.

A few implementation notes: Frontswap requires one bit of metadata per page of enabled swap. (The Linux swap subsystem until recently required 16 bits, and now requires eight bits of metadata per page so frontswap increases this by 12.5%.) This bit-per-page records whether the page is in tmem or is on the physical swap device. Since, at any time, some pages may be in frontswap and some on the physical device, the swap subsystem "swapoff" code also requires some modification. And because in-use tmem is more valuable than swap device space, some additional modifications are provided by frontswap so that a "partial swapoff" can be performed. And, of course, hooks are at the read-page and write-page routines to divert data into tmem and a hook is added to flush the data when it is no longer needed. All told, the patch parts that affect core kernel components add up to less than 100 lines.

Cleancache

In most workloads, the kernel fetches pages from a slow disk and, when RAM is plentiful, the kernel retains copies of many of these pages in memory, assuming that a disk page used once is likely to be used again. There's no sense incurring two disk reads when one will do and there's nothing else to do with that plentiful RAM anyway. If any data is written to one of those pages, the changes must be written to disk but, in anticipation of future changes, the (now clean) page continues to be retained in memory. As a result, the number of clean pages in this "page cache" often grows to fill the vast majority of memory. Eventually, when memory is nearly filled, or perhaps if the workload grows to require more memory, the kernel "reclaims" some of those clean pages; the data is discarded and the page frames are used for something else. No data is lost because a clean page in memory is identical to the same page on disk. However, if the kernel later determines that it does need that page of data after all, it must again be fetched from disk, which is called a "refault." Since the kernel can't predict the future, some pages are retained that will never be used again and some pages are reclaimed that soon result in a refault.

Cleancache allows tmem to be used to store clean page cache pages resulting in fewer refaults. When the kernel reclaims a page, rather than discard the data, it places the data into tmem, tagged as "ephemeral", which means that the page of data may be discarded if tmem chooses. Later, if the kernel determines it needs that page of data after all, it asks tmem to give it back. If tmem has retained the page, it gives it back; if tmem hasn't retained the page, the kernel proceeds with the refault, fetching the data from the disk as usual.

To function properly, cleancache "hooks" are placed where pages are reclaimed and where the refault occurs. The kernel is also responsible for ensuring coherency between the page cache, disk, and tmem, so hooks are also present where ever the kernel might invalidate the data. Since cleancache affects the kernel's VFS layer, and since not all filesystems use all VFS features, a filesystem must "opt in" to use cleancache whenever it mounts a filesystem.

One interesting note about cleancache is that clean pages may be retained in tmem for a file that has no pages remaining in the kernel's page cache. Thus the kernel must provide a name ("handle") for the page which is unique across the entire filesystem. For some filesystems, the inode number is sufficient, but for modern filesystems, the 192-bit "exportfs" handle is used.

Other tmem frontends

A common question is: can user code use tmem? For example, can enterprise applications that otherwise circumvent the pagecache use tmem? Currently the answer is no, but one could implement "tmem syscalls" to allow this. Coherency issues may arise, and it remains to be seen if they could be managed in user space.

What about other in-kernel uses? Some have suggested that the kernel dcache might provide a useful source of data for tmem. This too deserves further investigation.

4: Transcendent memory backends

The tmem interface allows multiple frontends to function with different backends. Currently only one backend may be configured though, in the future, some form of layering may be possible. Tmem backends share some common characteristics: Although a tmem backend might seem similar to a block device, it does not perform I/O and does not use the block I/O (bio) subsystem. In fact, a tmem backend must perform its functions fully synchronously, that is, it must not sleep and the scheduler may not be called. When a "put" completes, the kernels's page of data has been copied. And a successful "get" may not complete until the page of data has been copied to the kernel's data page. While these constraints create some difficulty for tmem backends, they also ensure that the tmem backend meets the tmem's interface requirements while also minimizing changes to the core kernel.

Zcache

Although tmem was conceived as a way to share a fixed resource (RAM) among a number of clients with constantly varying memory appetites, it also works nicely when the amount of RAM needed by a single kernel to store some number, N, of pages of data is less than N*PAGE_SIZE and when those pages of data need only be accessed only at a page granularity. So zcache combines an in-kernel implementation of tmem with in-kernel compression code to reduce the space requirements for data provided through a tmem frontend. As a result, when the kernel is under memory pressure, zcache can substantially increase the number of clean page cache pages and swap cache pages stored in RAM and thus significantly decrease disk I/O.

The zcache implementation is currently a staging driver so it is subject to change; it handles both persistent pages (from frontswap) and ephemeral pages (from cleancache) and, in both cases, uses the in-kernel lzo1x routines to compress/decompress the data contained in the pages. Space for persistent pages is obtained through a shim to xvmalloc, a memory allocator in the zram staging driver designed to store compressed pages. Space for ephemeral pages is obtained through standard kernel get_free_page() calls, then pairs of compressed ephemeral pages are matched using an algorithm called "compression buddies". This algorithm ensures that physical page frames containing two compressed ephemeral pages can easily be reclaimed when necessary; zcache provides a standard "shrinker" routine so those whole page frames can be reclaimed when required by the kernel using the existing kernel shrinker mechanism.

Zcache nicely demonstrates one of the flexibility features of tmem: Recall that, although data may often compress nicely (i.e. by a factor of two or more), it is possible that some workloads may produce long sequences of data that compress poorly. Since tmem allows any page to be rejected at the time of put, zcache policy (adjustable with sysfs tuneables in 3.1) avoids storing this poorly compressible data, instead passing it on to the original swap device for storage, thus dynamically optimizing the density of pages stored in RAM.

RAMster

RAMster is still under development but a proof-of-concept exists today. RAMster assumes that we have a cluster-like set of systems with some high-speed communication layer, or "exofabric", connecting them. The collected RAM of all the systems in the "collective" is the shared RAM resource used by tmem. Each cluster node acts as both a tmem client and a tmem server, and decides how much of its RAM to provide to the collective. Thus RAMster is a "peer-to-peer" implementation of tmem.

Ideally this exofabric allows some form of synchronous remote DMA to allow one system to read or write the RAM on another system, but in the initial RAMster proof-of-concept (RAMster-POC), a standard Ethernet connection is used instead. As long as the exofabric is sufficiently faster than disk reads/writes, there is still a net performance win.

Interestingly, RAMster-POC demonstrates a useful dimension of tmem: Once pages have been placed in tmem, the data can be transformed in various ways as long as the pages can be reconstituted when required. When pages are put to RAMster-POC, they are first compressed and cached locally using a zcache-like tmem backend. As local memory constraints increase, an asynchronous process attempts to "remotify" pages to another cluster node; if one node rejects the attempt, another node can be used as long as the local node tracks where the remote data resides. Although the current RAMster-POC doesn't implement this, one could even remotify multiple copies to achieve higher availability (i.e. to recover from node failures).

While this multi-level mechanism in RAMster works nicely for puts, there is currently no counterpart for gets. When a tmem frontend requests a persistent get, the data must be fetched immediately and synchronously; the thread requiring the data must busy-wait for the data to arrive and the scheduler must not be called. As a result current RAMster-POC is best suited for many-core processors, where it is unusual for all cores to be simultaneously active.

Transcendent memory for Xen

Tmem was originally conceived for Xen and so the Xen implementation is the most mature. The tmem backend in Xen utilizes spare hypervisor memory to store data, supports a large number of guests, and optionally implements both compression and deduplication (both within a guest and across guests) to maximize the volume of data that can be stored. The tmem frontends are converted to Xen hypercalls using a shim. Individual guests may be equipped with "self-ballooning" and "frontswap-self-shrinking" (both in Linux 3.1) to optimize their interaction with Xen tmem. Xen tmem also supports shared ephemeral pools, so that guests co-located on a physical server that share a cluster filesystem need only keep one copy of a cleancache page in tmem. The Xen control plane also fully implements tmem: An extensive set of statistics is available; live migration and save/restore of tmem-using guests is fully supported and limits, or "weights", may be applied to tmem guests to avoid denial-of-service.

Transcendent memory for kvm

The in-kernel tmem code included in zcache has been updated in 3.1 to support multiple tmem clients. With this in place, a KVM implementation of tmem should be fairly easy to complete, at least in prototype form. As with Xen, a shim would need to be placed in the guest to convert the cleancache and frontswap frontend calls to KVM hypercalls. On the host side, these hypercalls would need to be interfaced with the in-kernel tmem backend code. Some additional control plane support would also be necessary for this to be used in a KVM distribution.

Future tmem backends

The flexibility and dynamicity of tmem suggests that it may be useful for other storage needs and other backends have been proposed. The idiosyncrasies of some RAM-extension technologies, such as SSD and phase-change (PRAM) have been observed to be a possible fit; since page-size quantities are always used, writes can be carefully controlled and accounted, and user code never writes to tmem, memory technologies that could previously only be used as a fast I/O device could now instead be used as slow RAM. Some of these ideas are already under investigation.

Appendix A: Self-ballooning and frontswap-selfshrinking

After a system has been running for awhile, it is not uncommon for the vast majority of its memory to be filled with clean pagecache pages. With some tmem backends, especially Xen, it may make sense for those pages to reside in tmem instead of in the guest. To achieve this, Xen implements aggressive "self-ballooning", which artificially creates memory pressure by driving the Xen balloon driver to claim page frames, thus forcing the kernel to reclaim pages, which sends them to tmem. The algorithm essentially uses control theory to drive towards a memory target that approximates the current "working set" of the workload using the "Committed_AS" kernel variable. Since Committed_AS doesn't account for clean, mapped pages, these pages end up residing in Xen tmem where, queueing theory assures us, Xen can manage the pages more efficiently.

If the working set increases unexpectedly and faster than the self-balloon driver is able to (or chooses to) provide usable RAM, swapping occurs, but, in most cases, frontswap is able to absorb this swapping into Xen tmem. However, due to the fact that the kernel swap subsystem assumes that swapping occurs to a disk, swapped pages may sit on the "disk" for a very long time, even if the kernel knows the page will never be used again, because the disk space costs very little and can be overwritten when necessary. When such stale pages are in frontswap, however, they are taking up valuable space.

Frontswap-self-shrinking works to resolve this problem: when frontswap activity is stable and the guest kernel returns to a state where it is not under memory pressure, pressure is provided to remove some pages from frontswap, using a "partial" swapoff interface, and return them to kernel memory, thus freeing tmem space for more urgent needs, i.e. other guests that are currently memory-constrained.

Both self-ballooning and frontswap-self-shrinking provide sysfs tuneables to drive their control processes. Further experimentation will be necessary to optimize them.

Appendix B: Subtle tmem implementation requirements

Although tmem places most coherency responsibility on its clients, a tmem backend itself must enforce two coherency requirements. These are called "get-get" coherency and "put-put-get" coherency. For the former, a tmem backend guarantees that if a get fails, a subsequent get to the same handle will also fail (unless, of course, there is an intermediate put). For the latter, if a put places data "A" into tmem and a subsequent put with the same handle places data "B" into tmem, a subsequent "get" must never return "A".

This second coherency requirement results in an unusual corner-case which affects the API/ABI specification: If a put with a handle "X" of data "A" is accepted, and then a subsequent put is done to handle "X" with data "B", this is referred to as a "duplicate put". In this case, the API/ABI allows the backend implementation two options, and the frontend must be prepared for either: (1) if the duplicate put is accepted, the backend replaces data "A" with data "B" and success is returned and (2) the duplicate put may be failed, and the backend must flush the data associated with "X" so that a subsequent get will fail. This is the only case where a persistent get of a previously accepted put may fail; fortunately in this case the frontend has the new data "B" which would have overwritten the old data "A" anyway.

Index entries for this article
Kernel	Memory management/Virtualization
Kernel	Transcendent memory
GuestArticles	Magenheimer, Dan

Transcendent memory in a nutshell

Posted Aug 18, 2011 11:39 UTC (Thu) by Velmont (guest, #46433) [Link] (1 responses)

Truly interesting stuff! :-)

But I wonder, what's the difference between compcache and zcache then? Is it only that zcache uses tmem? Are they the same?

Transcendent memory in a nutshell

Posted Aug 18, 2011 14:36 UTC (Thu) by djm1021 (guest, #31130) [Link]

Compcache was only for swap pages (doesn't handle clean pagecache pages) and required a swap "device" of a fixed size to be pre-configured. So essentially zcache is the next step in the evolution of compcache, using the tmem interface.

Transcendent memory in a nutshell

Posted Aug 19, 2011 23:18 UTC (Fri) by shmget (guest, #58347) [Link] (1 responses)

"The common core was targeted for reuse in environments where there was no compiler support for this syntax."

AKA Windows:

http://connect.microsoft.com/VisualStudio/feedback/detail...

"Unfortunately 1) There are many, many more users of the Microsoft C++ compiler than there are of the C compiler; 2) Anytime we do customers discussion and/or solicit feedback the overwhelming response is that we should focus on C++ (especially at the moment C++-0x); 3) We just don't have the resources to do everything we would like. "

They can't cope with the rate of change (yeah... 10+ years is such a tight schedule to support a new standard), maybe they should open-source their compiler ?

Transcendent memory in a nutshell

Posted Aug 19, 2011 23:19 UTC (Fri) by shmget (guest, #58347) [Link]

oops, sorry, posted that on the wrong article ...

Transcendent memory in a nutshell

Posted Aug 20, 2011 21:34 UTC (Sat) by eyal (subscriber, #949) [Link]

Excellent article: clearly written, from the general idea down to the gritty details.

Thanks!

Eyal.