User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.1-rc2, released on August 14. "Hey, nice calm first week after the merge window. Good job. Or maybe people are just being lazy, and everybody is on vacation. Whatever. Don't tell me. I'm reasonably happy, I want to stay that way." Details can be found in the full changelog. The code name for this kernel, incidentally, has been changed to "wet seal."

Stable updates: the,, and 3.0.2 stable updates were released on August 15. They contain the usual pile of fixes. All three updates also include a change how TCP sequence numbers are generated; a (relatively) insecure 24-bit MD4 algorithm has been replaced by 32-bit MD5. 3.0.3 was released on August 17 with another set of useful fixes.

Comments (none posted)

Quotes of the week

The truth to realize is that we have grown really good at decimating our user-base every year or so.
-- Ingo Molnar

As far as long-term kernels goes, from the Android perspective we strongly prefer to snap up to the most recent released kernel on every platform/device release. I prefer to be as up to date on bugfixes and features from mainline as possible and minimize the deltas on our stack 'o patches as much as possible.
-- Brian Swetland

Comments (12 posted)

Possible changes to longterm kernel maintenance

Greg Kroah-Hartman has posted a proposal for some changes to how the stable and (especially) longterm kernels are maintained. The changes are being driven by users other than the enterprise distributors. "Now that 2.6.32 is over a year and a half, and the enterprise distros are off doing their thing with their multi-year upgrade cycles, there's no real need from the distros for a new longterm kernel release. But it turns out that the distros are not the only user of the kernel, other groups and companies have been approaching me over the past year, asking how they could pick the next longterm kernel, or what the process is in determining this." The core idea is to pick a new longterm kernel once a year; that kernel would then be maintained for two years thereafter. There is some discussion on Google+; it should move to the mailing list around August 15.

Comments (34 posted)

Kernel development news

Sharing buffers between devices

By Jonathan Corbet
August 15, 2011
CPUs may not have gotten hugely faster in recent years, but they have gained in other ways; a typical system-on-chip (SoC) device now has a number of peripherals which would qualify as reasonably powerful CPUs in their own right. More powerful devices with direct access to the memory bus can take on more demanding tasks. For example, an image frame captured from a camera device can often be passed directly to the graphics processor for display without all of the user-space processing that was once necessary. Increasingly, the CPU's job looks like that of a shop foreman whose main concern is keeping all of the other processors busy.

The foreman's job will be easier if the various devices under its control can communicate easily with each other. One useful addition in this area might be the buffer sharing patch set recently posted by Marek Szyprowski. The idea here is to make it possible for multiple kernel subsystems to share buffers under the control of user space. With this type of feature, applications could wire kernel subsystems together in problem-specific ways then get out of the way, letting the devices involved process the data as it passes through.

There are (at least) a couple of challenges which must be dealt with to make this kind of functionality safe to export to applications. One is that the application should not be able to "create" buffers at arbitrary kernel addresses. Indeed, kernel-space addresses should not be visible to user space at all, so the kernel must provide some other way for an application to refer to a specific buffer. The other is that shared buffers must not go away until all users have let go of it. A buffer may be created by a specific device driver, but it must persist, even if the device is closed, until nobody else expects it to be there.

The mechanism added in this patch set (this part in particular is credited to Tomasz Stanislawski) is relatively simple - though it will probably get more complex in the future. Kernel code wanting to make a buffer available to other parts of the kernel via user space starts by filling in one of these structures:

    struct shrbuf {
    	void (*get)(struct shrbuf *);
    	void (*put)(struct shrbuf *);
    	unsigned long dma_addr;
    	unsigned long size;

One could immediately raise a number of complaints about this structure: the address should be a dma_addr_t, there's no reason not to put the kernel virtual address there, only physically-contiguous buffers are allowed, etc. It also seems like there could be value in the ability to annotate the state of the buffer (filled or empty, for example) and possibly signal another thread when that state changes. But it's worth remembering that this is an explicitly proof-of-concept patch posting and a lot of things will change. In particular, the eventual plan is to pass a scatterlist around instead of a single physical address.

The get() and put() functions are important: they manage reference counts to the buffer, which must continue to exist until that count goes to zero. Any subsystem depending on a buffer's continued existence should hold a reference to that buffer. The put() function should release the buffer when the last reference is dropped.

Once this structure exists, it can be passed to:

	int shrbuf_export(struct shrbuf *sb);

The return value (if all goes well) will be an integer file descriptor which can be handed to user space. This file descriptor embodies a reference to the buffer, which now will not be released before the file descriptor is closed. Other than closing it, there is very little that the application can do with the descriptor other than give it to another kernel subsystem; attempts to read from or write to it will fail, for example.

If a kernel subsystem receives a file descriptor which is purported to represent a kernel buffer, it can pass that descriptor to:

    struct shrbuf *shrbuf_import(int fd);

The return value will be the same shrbuf structure (or an ERR_PTR() error value for a file descriptor of the wrong type). A reference is taken on the structure before returning it, so the recipient should call put() at some future time to release it.

The patch set includes a new Video4Linux2 ioctl() command (VIDIOC_EXPBUF) enabling the exporting of buffers as file descriptors; a couple of capture drivers have been augmented to support this functionality. No examples of the other side (importing a buffer) have been posted yet.

There has been relatively little commentary on the patch set so far, possibly because it was posted to a couple of relatively obscure mailing lists. It has the look of functionality that could be useful beyond one or two kernel subsystems, though. It would probably make sense for the next iteration, which presumably will have more of the anticipated functionality built into it, to be distributed more widely for review.

Comments (12 posted)

Avoiding the OS abstraction trap

August 12, 2011

This article was contributed by Dan J. Williams

It is an innocent idea. After all, "all problems in computer science can be solved by another level of indirection." However, when the problem is developing a device driver for acceptance into the current mainline Linux kernel, OS abstraction (using a level of indirection to hide a kernel's internal API) is taking things a level too far. Seasoned Linux kernel developers will have already cringed at the premise of this article. But they are not my intended readers; instead, this text is aimed at those that find themselves in a similar position as the original authors of the isci driver: a team new to the process of getting a large driver accepted into the mainline, and tasked with enabling several environments at once. The isci driver fell into the OS abstraction trap. These are the lessons learned and attitudes your author developed about this trap while leading the effort to rework the driver for upstream acceptance.

As mentioned above, one would be hard pressed to find an experienced Linux kernel developer willing to accept OS abstraction as a general approach to driver design. So, a simplistic rule of thumb for those wanting to avoid the pain of reworking large amounts of code would be to not go it alone. Arrange for a developer with at least 100 upstream commits to be permanently assigned to the development team, and seek the advice of a developer with at least 500 commits early in the design phase. After the fact, it was one such developer, Arjan van de Ven, who set the expectation for the magnitude of rework effort. When your author was toying with ideas of Coccinelle and other automated ways to unwind the driver's abstractions Arjan presciently noted (paraphrasing): " isn't about the specific abstractions, it's about the wider assumptions that lead to the abstractions."

The fundamental problem with OS abstraction techniques is that they actively defeat the purpose of having an open driver in the first place. As a community we want drivers upstream so that we can refactor common code into generic infrastructure and drive uniformity across drivers of the same class. OS abstraction, in contrast, implies the development of driver-specific translations of common kernel constructs, "lowest common denominator" programming to avoid constructs that do not have a clear analogue in all environments, and overlooking the subtleties of interfaces that appear similar but have important semantic differences.

So what were the larger problematic assumptions that led to to the rework effort? It comes down to the following list that, given the recurrence of OS-abstracted drivers, may be expected among other developers new to the Linux mainline acceptance process.

  1. The programming interface of the the Linux kernel is a static contract to third party developers. The documentation is up to date, and the recourse for upper layer bugs is to add workarounds to the driver.

  2. The "community" is external to the development team. Many conversations about Linux kernel requirements reference the "community" as an anonymous body of developers and norms external to the driver's development process.

  3. An OS abstraction layer can cleanly handle the differences between operating systems.

In the case of the isci driver these assumptions resulted in a nearly half-year effort to rework the driver as measured from the first public release until the driver was ultimately deemed acceptable.

Who fixes the platform?

The kernel community runs on trust and reciprocation. To get things done in a timely manner one needs to build up a cache of trust capital. One quick way to build this capital is to fix bugs or otherwise lower the overall maintenance burden of the kernel. Fixing a bug in a core library, or spotting some duplicated patterns that can be unified are golden opportunities to demonstrate proficiency and build trust.

The attitude of aiming to become a co-implementer of common kernel infrastructure is an inherently foreign idea for a developer with a proprietary environment background. The proprietary OS vendor provides an interface sandbox for the driver to play in that is assumed to be rigid, documented and supported. A similar assumption was carried to the isci driver; for example, it initially contained workarounds for bugs (real and perceived) in libsas and the other upper layers. The assumption behind those workarounds seems to be that the "vendor's" (maintainer's) interface is broken and the vendor is on the hook for a fix. This is, of course, the well-known "platform problem."

In the particular case of libsas, SCSI maintainer James Bottomley noted: "there's no overall maintainer, it's jointly maintained by its users." Internal kernel interfaces evolve to meet the needs of their users, but the users that engender the most trust tend to have an easier time getting their needs met. In this case, root-causing the bugs or allowing time to clarify the documentation for libsas ahead of the driver's introduction might have streamlined the acceptance process; it certainly would have saved the effort of developing local workarounds to global problems.

We the community

Similar to how internal kernel interfaces evolve at a different pace than their documentation, so too do the expectations of the community for mainline-acceptable code versus the documented submission requirements. However, in contrast to the interface question where the current code can be used to clarify interface details, the same cannot be done to determine the current set of requirements for mainline acceptance. The reality is that code exists in the mainline tree that would not be acceptable if it were being re-submitted for inclusion today.

A driver with an OS-abstracted core can be found in the tree, but over time the maintenance burden incurred by that architecture has precluded future drivers from taking the same approach. As a result, attempting to understand "the community" from an external position is a sure-fire way to underestimate the current set of requirements for acceptable code. The only way to acquire this knowledge is ongoing participation. Read other drivers, read the code reviews from other developers, and try to answer the question "would someone external to the development team have a chance at maintaining the driver without assistance?".

One clear "no" answer to this question from the isci driver experience came from the simple usage of c99 structure initializers. The common core was targeted for reuse in environments where there was no compiler support for this syntax. However, the state machine implementation in the driver had dozens of tables filled with, in some cases, hundreds of function pointers. Modifying such tables by counting commas and trusting comments is error prone. The conversion to c99-style struct initialization made the code safer to edit (compiler verifiable), more readable and, consequently, allowed many of those tables to be removed. These initializations were a simple example of lowest-common-denominator programming and a nuance that an "external" Linux kernel developer need not care to understand when modifying the driver, especially when the next round of cleanups are enabled by the change.

Can the OS be hidden?

OS abstraction defenders may look at that last example and propose automated ways to convert the code for different environments. The dangerous assumptions of automated abstraction replacement engines are the same as listed above. The Linux kernel internal interface is not static, so the abstraction needs to keep up with the pace of API change, but that effort is better spent participating in the Linux interface evolution.

More dangerous is the assumption that similar looking interfaces from different environments can be properly captured by a unified abstraction. The prominent example from the isci driver was memory mapping: converting between virtual and physical addresses. As far as your author knows, Linux is one of the few environments that utilizes IOMMUs (I/O memory management units) to protect standard streaming DMA mappings requested by device drivers (via the DMA API). The isci abstraction had a virtual-to-physical abstraction that mapped to virt_to_phys() (broken but straightforward to fix), but it also had a physical-to-virtual abstraction mapped to phys_to_virt() which was not straightforward to fix. The assumption that physical-to-virtual translation was a viable mechanism lead to an implementation that not only missed the DMA API requirements, but also the need to use kmap() when accessing pages that may be located in high memory. The lesson is that convenient interfaces in other environments can lead to the diversionary search for equivalent functionality in Linux and magnify the eventual rework effort.


The initial patch added 60,896 lines to the kernel over 159 files. Once the rework was done, the number of new files was cut down to 34 and overall diffstat for the effort was:

    192 files changed, 23575 insertions(+), 60895 deletions(-)

There is no question that adherence to current Linux coding principles resulted in a simpler implementation of the isci driver. The community's mainline acceptance criteria are designed to maximize the portability and effectiveness of a kernel developer's skills across drivers. Any locally-developed convenience mechanisms that diminish that global portability will almost always result in requests for changes and prevent mainline acceptance. In the end participation and getting one's hands dirty in the evolution of the native interfaces is the requirement for mainline acceptance, and it is very difficult to achieve that through a level of indirection.

I want to thank Christoph Hellwig and James Bottomley for their review and recognize Dave Jiang, Jeff Skirvin, Ed Nadolski, Jacek Danecki, Maciej Trela, Maciej Patelczyk and the rest of the isci development team that accomplished this rework.

Comments (74 posted)

Transcendent memory in a nutshell

August 12, 2011

This article was contributed by Dan Magenheimer

The Linux kernel carefully enumerates and tracks all of its memory and, for the most part, it can individually access every byte of it. The purpose of transcendent memory ("tmem") is to provide the kernel with the capability to utilize memory that it cannot enumerate, sometimes cannot track, and cannot directly address. This may sound counterintuitive, or even silly, to core kernel developers, but as we will see it is actually quite useful; indeed it adds a level of flexibility to the kernel that allows some rather complex functionalities to be implemented and layered on a handful of tiny changes to the core kernel. The end goal is that memory can be more efficiently utilized by one kernel and/or load-balanced between multiple kernels (in a virtualized OR a non-virtualized environment), resulting in higher performance and/or lower RAM costs in a system or across a data center. This article will provide an overview of transcendent memory and how it is used in the Linux kernel.

Exactly how the kernel talks to tmem will be described in Part 2, but there are certain classes of data maintained by the kernel that are suitable. Two of these are known to kernel developers as "clean pagecache pages" and "swap pages". The patch that deals with the former is known as "cleancache"; it was merged into the 3.0 kernel. The patch that deals with swap pages is known as frontswap and is still being reviewed on the Linux kernel mailing list, with a target of linux-3.2. There may well be other classes of data that will also work well with tmem. Collectively these sources of suitable data for tmem can be referred to as "frontends" for tmem and we will detail them in Part 3.

There are multiple implementations of tmem which store data using different methods. We can refer to these data stores as "backends" for tmem, and all frontends can be used by all backends (possibly using a shim to connect them). The initial tmem implementation, known as "Xen tmem," allows Xen hypervisor memory to be used to store data for one or more tmem-enabled guest kernels. Xen tmem has been implemented in Xen for over two years and has been shipping in Xen since Xen 4.0; the in-kernel shim for Xen tmem was merged into 3.0 (for cleancache only, updated to also support frontswap in 3.1). Another Xen driver component, the Xen self-ballooning driver, which helps encourage a guest kernel to use tmem efficiently, was merged for 3.1 and also includes the "frontswap-selfshrinker". See Appendix A for more information about these.

The second tmem implementation does not involve virtualization at all and is known as "zcache," it is an in-kernel driver that stores compressed pages. Zcache essentially "doubles RAM" for any class of kernel memory that the kernel can provide via a tmem frontend (e.g. cleancache, frontswap), thus reducing memory requirements in, for example, embedded kernels. Zcache was merged as a staging driver in 2.6.39 (though dependent on the cleancache and frontswap patchsets which were not yet upstream)

A third tmem implementation is underway; it is known as "RAMster." In RAMster, a "closely-connected" set of kernels effectively pool their RAM so that a RAM-hungry workload on one machine can temporarily and transparently utilize RAM on another machine which is presumably idle or running a non RAM-hungry workload. RAMster has also been dubbed "peer-to-peer transcendent memory" and is intended for non-virtualized kernels but is being tested also with virtualized kernels. While RAMster is best suited in an environment where multiple systems are connected by a high-speed "exofabric", in which one system can directly address another systems memory, the initial prototype is built on a standard ethernet connection.

Other tmem implementations have been proposed: For example, there has been some argument about how useful tmem might be for KVM and/or for containers. With recent changes to zcache merged in 3.1, it may be very easy to simply implement the necessary shims and try these out; nobody has yet stepped up to do it. As another example, it has been observed that the tmem protocols may be ideal for certain kinds of RAM-like technologies such as "phase-change" memory (PRAM); most of these technologies have certain idiosyncrasies, such as limited write-cycles, that can be managed effectively through a software interface such as tmem. Discussions have begun with certain vendors of such RAM-like technologies. Yet another example is a variation of RAMster: a single machine in a cluster acts as a "memory server" and memory is added solely to that machine; the memory may be RAM, may be RAM-like, or perhaps may be a fast SSD.

The existing tmem implementations will be described in Part 4 along with some speculation about future implementations.

2: How the kernel talks to transcendent memory

The kernel "talks" to tmem through a carefully defined interface, which was crafted to provide maximum flexibility for the tmem implementation while incurring low impact on the core kernel. The tmem interface may appear odd but there are good reasons for its peculiarities. Note that in some cases the tmem interface is completely internal to the kernel and is thus an "API"; in other cases it defines the boundary between two independent software components (e.g. Xen and a guest Linux kernel) so is properly called an "ABI".

(Casual readers may wish to skip this section.)

Tmem should be thought of as another "entity" that "owns" some memory. The entity might be an in-kernel driver, another kernel, or a hypervisor/host. As previously noted, tmem cannot be enumerated by the kernel; the size of tmem is unknowable to the kernel, may change dynamically, and may at any time be "full". As a result, the kernel must "ask" tmem on every individual page to accept data or to retrieve data.

Tmem is not byte-addressable -- only large chunks of data (exactly or approximately a page in size) are copied between kernel memory and tmem. Since the kernel cannot "see" tmem, it is the tmem side of the API/ABI that copies the data from/to kernel memory. Tmem organizes related chunks of data in a pool; within a pool, the kernel chooses a unique "handle" to represent the equivalent of an address for the chunk of data. When the kernel requests the creation of a pool, it specifies certain attributes to be described below. If pool creation is successful, tmem provides a "pool id". Handles are unique within pools, not across pools, and consist of a 192-bit "object id" and a 32-bit "index." The rough equivalent of an object is a "file" and the index is the rough equivalent of a page offset into the file.

The two basic operations of tmem are "put" and "get". If the kernel wishes to save a chunk of data in tmem, it uses the "put" operation, providing a pool id, a handle, and the location of the data; if the put returns success, tmem has copied the data. If the kernel wishes to retrieve data, it uses the "get" operation and provides the pool id, the handle, and a location for tmem to place the data; if the get succeeds, on return, the data will be present at the specified location. Note that, unlike I/O, the copying performed by tmem is fully synchronous. As a result, arbitrary locks can (and, to avoid races, often should!) be held by the caller.

There are two basic pool types: ephemeral and persistent. Pages successfully put to an ephemeral pool may or may not be present later when the kernel uses a subsequent get with a matching handle. Pages successfully put to a persistent pool are guaranteed to be present for a subsequent get. (Additionally, a pool may be "shared" or "private".)

The kernel is responsible for maintaining coherency between tmem and the kernel's own data, and tmem has two types of "flush" operations to assist with this: To disassociate a handle from any tmem data, the kernel uses a "flush" operation. To disassociate all chunks of data in an object, the kernel uses a "flush object" operation. After a flush, subsequent gets will fail. A get on an (unshared) ephemeral pool is destructive, i.e. implies a flush; otherwise, the get is non-destructive and an explicit flush is required. (There are two additional coherency guarantees that are described in Appendix B.)

3: Transcendent memory frontends: frontswap and cleancache

While other frontends are possible, the two existing tmem frontends, frontswap and cleancache, cover two of the primary types of kernel memory that are sensitive to memory pressure. These two frontends are complementary: cleancache handles (clean) mapped pages that would otherwise be reclaimed by the kernel; frontswap handles (dirty) anonymous pages that would otherwise be swapped out by the kernel. When a successful cleancache_get happens, a disk read has been avoided. When a successful frontswap_put (or get) happens, a swap device write (or read) had been avoided. Together, assuming tmem is significantly faster than disk paging/swapping, substantial performance gains may be obtained in a memory-constrained environment.


The total amount of "virtual memory" in a Linux system is the sum of the physical RAM plus the sum of all configured swap devices. When the "working set" of a workload exceeds the size of physical RAM, swapping occurs -- swap devices are essentially used to emulate physical RAM. But, generally, a swap device is several orders of magnitude slower than RAM so swapping has become synonymous with horrible performance. As a result, wise system administrators increase physical RAM and/or redistribute workloads to ensure that swapping is avoided. But what if swapping isn't always slow?

Frontswap allows the Linux swap subsystem to use transcendent memory, when available, in place of sending data to and from a swap device. Frontswap is not in itself a swap device and, thus, requires no swap-device-like configuration. It does not change the total virtual memory in the system; it just results in faster swapping... some/most/nearly all of the time, but not necessarily always. Remember that the quantity of transcendent memory is unknowable and dynamic. With frontswap, whenever a page needs to be swapped out the swap subsystem asks tmem if it is willing to take the page of data. If tmem rejects it, the swap subsystem writes the page, as normal, to the swap device. If tmem accepts it, the swap subsystem can request the page of data back at any time and it is guaranteed to be retrievable from tmem. And, later, if the swap subsystem is certain the data is no longer valid (e.g. if the owning process has exited), it can flush the page of data from tmem.

Note that tmem can reject any or every frontswap "put". Why would it? One example is if tmem is a resource shared between multiple kernels (aka tmem "clients"), as is the case for Xen tmem or for RAMster; another kernel may have already claimed the space, or perhaps this kernel has exceeded some tmem-managed quota. Another example is if tmem is compressing data as it does in zcache and it determines that the compressed page of data is too large; in this case, tmem might reject any page that isn't sufficiently compressible OR perhaps even if the mean compression ratio is growing unacceptably.

The frontswap patchset is non-invasive and does not impact the behavior of the swap subsystem at all when frontswap is disabled. Indeed, a key kernel maintainer has observed that frontswap appears to be "bolted on" to the swap subsystem. That is a good thing as the existing swap subsystem code is very stable, infrequently used (because swapping is so slow), yet critical to system correctness; dramatic change to the swap subsystem is probably unwise and frontswap only touches the fringes of it.

A few implementation notes: Frontswap requires one bit of metadata per page of enabled swap. (The Linux swap subsystem until recently required 16 bits, and now requires eight bits of metadata per page so frontswap increases this by 12.5%.) This bit-per-page records whether the page is in tmem or is on the physical swap device. Since, at any time, some pages may be in frontswap and some on the physical device, the swap subsystem "swapoff" code also requires some modification. And because in-use tmem is more valuable than swap device space, some additional modifications are provided by frontswap so that a "partial swapoff" can be performed. And, of course, hooks are at the read-page and write-page routines to divert data into tmem and a hook is added to flush the data when it is no longer needed. All told, the patch parts that affect core kernel components add up to less than 100 lines.


In most workloads, the kernel fetches pages from a slow disk and, when RAM is plentiful, the kernel retains copies of many of these pages in memory, assuming that a disk page used once is likely to be used again. There's no sense incurring two disk reads when one will do and there's nothing else to do with that plentiful RAM anyway. If any data is written to one of those pages, the changes must be written to disk but, in anticipation of future changes, the (now clean) page continues to be retained in memory. As a result, the number of clean pages in this "page cache" often grows to fill the vast majority of memory. Eventually, when memory is nearly filled, or perhaps if the workload grows to require more memory, the kernel "reclaims" some of those clean pages; the data is discarded and the page frames are used for something else. No data is lost because a clean page in memory is identical to the same page on disk. However, if the kernel later determines that it does need that page of data after all, it must again be fetched from disk, which is called a "refault." Since the kernel can't predict the future, some pages are retained that will never be used again and some pages are reclaimed that soon result in a refault.

Cleancache allows tmem to be used to store clean page cache pages resulting in fewer refaults. When the kernel reclaims a page, rather than discard the data, it places the data into tmem, tagged as "ephemeral", which means that the page of data may be discarded if tmem chooses. Later, if the kernel determines it needs that page of data after all, it asks tmem to give it back. If tmem has retained the page, it gives it back; if tmem hasn't retained the page, the kernel proceeds with the refault, fetching the data from the disk as usual.

To function properly, cleancache "hooks" are placed where pages are reclaimed and where the refault occurs. The kernel is also responsible for ensuring coherency between the page cache, disk, and tmem, so hooks are also present where ever the kernel might invalidate the data. Since cleancache affects the kernel's VFS layer, and since not all filesystems use all VFS features, a filesystem must "opt in" to use cleancache whenever it mounts a filesystem.

One interesting note about cleancache is that clean pages may be retained in tmem for a file that has no pages remaining in the kernel's page cache. Thus the kernel must provide a name ("handle") for the page which is unique across the entire filesystem. For some filesystems, the inode number is sufficient, but for modern filesystems, the 192-bit "exportfs" handle is used.

Other tmem frontends

A common question is: can user code use tmem? For example, can enterprise applications that otherwise circumvent the pagecache use tmem? Currently the answer is no, but one could implement "tmem syscalls" to allow this. Coherency issues may arise, and it remains to be seen if they could be managed in user space.

What about other in-kernel uses? Some have suggested that the kernel dcache might provide a useful source of data for tmem. This too deserves further investigation.

4: Transcendent memory backends

The tmem interface allows multiple frontends to function with different backends. Currently only one backend may be configured though, in the future, some form of layering may be possible. Tmem backends share some common characteristics: Although a tmem backend might seem similar to a block device, it does not perform I/O and does not use the block I/O (bio) subsystem. In fact, a tmem backend must perform its functions fully synchronously, that is, it must not sleep and the scheduler may not be called. When a "put" completes, the kernels's page of data has been copied. And a successful "get" may not complete until the page of data has been copied to the kernel's data page. While these constraints create some difficulty for tmem backends, they also ensure that the tmem backend meets the tmem's interface requirements while also minimizing changes to the core kernel.


Although tmem was conceived as a way to share a fixed resource (RAM) among a number of clients with constantly varying memory appetites, it also works nicely when the amount of RAM needed by a single kernel to store some number, N, of pages of data is less than N*PAGE_SIZE and when those pages of data need only be accessed only at a page granularity. So zcache combines an in-kernel implementation of tmem with in-kernel compression code to reduce the space requirements for data provided through a tmem frontend. As a result, when the kernel is under memory pressure, zcache can substantially increase the number of clean page cache pages and swap cache pages stored in RAM and thus significantly decrease disk I/O.

The zcache implementation is currently a staging driver so it is subject to change; it handles both persistent pages (from frontswap) and ephemeral pages (from cleancache) and, in both cases, uses the in-kernel lzo1x routines to compress/decompress the data contained in the pages. Space for persistent pages is obtained through a shim to xvmalloc, a memory allocator in the zram staging driver designed to store compressed pages. Space for ephemeral pages is obtained through standard kernel get_free_page() calls, then pairs of compressed ephemeral pages are matched using an algorithm called "compression buddies". This algorithm ensures that physical page frames containing two compressed ephemeral pages can easily be reclaimed when necessary; zcache provides a standard "shrinker" routine so those whole page frames can be reclaimed when required by the kernel using the existing kernel shrinker mechanism.

Zcache nicely demonstrates one of the flexibility features of tmem: Recall that, although data may often compress nicely (i.e. by a factor of two or more), it is possible that some workloads may produce long sequences of data that compress poorly. Since tmem allows any page to be rejected at the time of put, zcache policy (adjustable with sysfs tuneables in 3.1) avoids storing this poorly compressible data, instead passing it on to the original swap device for storage, thus dynamically optimizing the density of pages stored in RAM.


RAMster is still under development but a proof-of-concept exists today. RAMster assumes that we have a cluster-like set of systems with some high-speed communication layer, or "exofabric", connecting them. The collected RAM of all the systems in the "collective" is the shared RAM resource used by tmem. Each cluster node acts as both a tmem client and a tmem server, and decides how much of its RAM to provide to the collective. Thus RAMster is a "peer-to-peer" implementation of tmem.

Ideally this exofabric allows some form of synchronous remote DMA to allow one system to read or write the RAM on another system, but in the initial RAMster proof-of-concept (RAMster-POC), a standard Ethernet connection is used instead. As long as the exofabric is sufficiently faster than disk reads/writes, there is still a net performance win.

Interestingly, RAMster-POC demonstrates a useful dimension of tmem: Once pages have been placed in tmem, the data can be transformed in various ways as long as the pages can be reconstituted when required. When pages are put to RAMster-POC, they are first compressed and cached locally using a zcache-like tmem backend. As local memory constraints increase, an asynchronous process attempts to "remotify" pages to another cluster node; if one node rejects the attempt, another node can be used as long as the local node tracks where the remote data resides. Although the current RAMster-POC doesn't implement this, one could even remotify multiple copies to achieve higher availability (i.e. to recover from node failures).

While this multi-level mechanism in RAMster works nicely for puts, there is currently no counterpart for gets. When a tmem frontend requests a persistent get, the data must be fetched immediately and synchronously; the thread requiring the data must busy-wait for the data to arrive and the scheduler must not be called. As a result current RAMster-POC is best suited for many-core processors, where it is unusual for all cores to be simultaneously active.

Transcendent memory for Xen

Tmem was originally conceived for Xen and so the Xen implementation is the most mature. The tmem backend in Xen utilizes spare hypervisor memory to store data, supports a large number of guests, and optionally implements both compression and deduplication (both within a guest and across guests) to maximize the volume of data that can be stored. The tmem frontends are converted to Xen hypercalls using a shim. Individual guests may be equipped with "self-ballooning" and "frontswap-self-shrinking" (both in Linux 3.1) to optimize their interaction with Xen tmem. Xen tmem also supports shared ephemeral pools, so that guests co-located on a physical server that share a cluster filesystem need only keep one copy of a cleancache page in tmem. The Xen control plane also fully implements tmem: An extensive set of statistics is available; live migration and save/restore of tmem-using guests is fully supported and limits, or "weights", may be applied to tmem guests to avoid denial-of-service.

Transcendent memory for kvm

The in-kernel tmem code included in zcache has been updated in 3.1 to support multiple tmem clients. With this in place, a KVM implementation of tmem should be fairly easy to complete, at least in prototype form. As with Xen, a shim would need to be placed in the guest to convert the cleancache and frontswap frontend calls to KVM hypercalls. On the host side, these hypercalls would need to be interfaced with the in-kernel tmem backend code. Some additional control plane support would also be necessary for this to be used in a KVM distribution.

Future tmem backends

The flexibility and dynamicity of tmem suggests that it may be useful for other storage needs and other backends have been proposed. The idiosyncrasies of some RAM-extension technologies, such as SSD and phase-change (PRAM) have been observed to be a possible fit; since page-size quantities are always used, writes can be carefully controlled and accounted, and user code never writes to tmem, memory technologies that could previously only be used as a fast I/O device could now instead be used as slow RAM. Some of these ideas are already under investigation.

Appendix A: Self-ballooning and frontswap-selfshrinking

After a system has been running for awhile, it is not uncommon for the vast majority of its memory to be filled with clean pagecache pages. With some tmem backends, especially Xen, it may make sense for those pages to reside in tmem instead of in the guest. To achieve this, Xen implements aggressive "self-ballooning", which artificially creates memory pressure by driving the Xen balloon driver to claim page frames, thus forcing the kernel to reclaim pages, which sends them to tmem. The algorithm essentially uses control theory to drive towards a memory target that approximates the current "working set" of the workload using the "Committed_AS" kernel variable. Since Committed_AS doesn't account for clean, mapped pages, these pages end up residing in Xen tmem where, queueing theory assures us, Xen can manage the pages more efficiently.

If the working set increases unexpectedly and faster than the self-balloon driver is able to (or chooses to) provide usable RAM, swapping occurs, but, in most cases, frontswap is able to absorb this swapping into Xen tmem. However, due to the fact that the kernel swap subsystem assumes that swapping occurs to a disk, swapped pages may sit on the "disk" for a very long time, even if the kernel knows the page will never be used again, because the disk space costs very little and can be overwritten when necessary. When such stale pages are in frontswap, however, they are taking up valuable space.

Frontswap-self-shrinking works to resolve this problem: when frontswap activity is stable and the guest kernel returns to a state where it is not under memory pressure, pressure is provided to remove some pages from frontswap, using a "partial" swapoff interface, and return them to kernel memory, thus freeing tmem space for more urgent needs, i.e. other guests that are currently memory-constrained.

Both self-ballooning and frontswap-self-shrinking provide sysfs tuneables to drive their control processes. Further experimentation will be necessary to optimize them.

Appendix B: Subtle tmem implementation requirements

Although tmem places most coherency responsibility on its clients, a tmem backend itself must enforce two coherency requirements. These are called "get-get" coherency and "put-put-get" coherency. For the former, a tmem backend guarantees that if a get fails, a subsequent get to the same handle will also fail (unless, of course, there is an intermediate put). For the latter, if a put places data "A" into tmem and a subsequent put with the same handle places data "B" into tmem, a subsequent "get" must never return "A".

This second coherency requirement results in an unusual corner-case which affects the API/ABI specification: If a put with a handle "X" of data "A" is accepted, and then a subsequent put is done to handle "X" with data "B", this is referred to as a "duplicate put". In this case, the API/ABI allows the backend implementation two options, and the frontend must be prepared for either: (1) if the duplicate put is accepted, the backend replaces data "A" with data "B" and success is returned and (2) the duplicate put may be failed, and the backend must flush the data associated with "X" so that a subsequent get will fail. This is the only case where a persistent get of a previously accepted put may fail; fortunately in this case the frontend has the new data "B" which would have overwritten the old data "A" anyway.

Comments (5 posted)

Patches and updates

Kernel trees


Core kernel code

Device drivers

Filesystems and block I/O

Memory management


Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds