Kernel development

Brief items

Kernel release status

The current development kernel is 3.9-rc5, released on March 31. Linus says: "Nothing really peculiar stands out. Exynos DRM updates, IBM RamSan driver updates are a bit larger, l2tp update... The rest is pretty much small patches spread out all over. Mostly drivers (block, net, media, tty, usb), networking, and some filesystem updates (btrfs, nfs). Some arch updates (x86, arc). Things seem to be calming down a bit, and everything seems largely on track for a 3.9 release in a few weeks."

Stable updates: 3.8.5, 3.4.38, and 3.0.71 were released on March 28 with the usual set of fixes. 3.5.7.9 was also released on the 28th. Another big set of fixes is coming with 3.8.6, 3.4.39, and 3.0.72, which are due on or after April 4.

Comments (none posted)

Quotes of the week

It would be nice if we can avoid the feature-removal.txt step. I mean, I heard some doleful whispers once when my pointer selected that file. Instead of opening the file I just closed the directory instantly. I never told anybody about that before, this is the first time. I also heard that somebody added an entry to that file to schedule the removal of that file? This is devilry.

— Frederic Weisbecker

< system in single state : everyone sees cat = alive >

insert_into_box(cat);

< system in dual state : new calls see cat == dead, but
  current calls see cat == alive >

open_box();

< system is back to single state: everyone sees cat = dead >

funeral(cat);

— Steven Rostedt explains RCU

Comments (none posted)

ZFS on Linux 0.6.1

On behalf of the ZFS-on-Linux project, Brian Behlendorf has announced the availability of version 0.6.1 of this Solaris-derived filesystem. "Over two years of use by real users has convinced us ZoL is ready for wide scale deployment on everything from desktops to super computers." The project's home page offers binary modules for a wide variety of distributions. (See the FAQ for the project's take on licensing issues.)

Comments (17 posted)

Kernel development news

Avoiding game-score loss with per-process reclaim

By Michael Kerrisk
April 3, 2013

Minchan Kim's recent patch series to provide user-space-triggered reclaim of a process's pages represents one more point in a spectrum that increasingly sees memory management on Linux as a task that is indirectly influenced or even directly controlled from user space.

Approaches such as Android's low-memory killer represent one end of the page-reclaim spectrum, where memory management is primarily under kernel control. When memory is low, the low-memory killer picks a victim process and kills it outright; applications that live in such an environment have to work with the possibility that they may disappear from one moment to the next. As Minchan pointed out via an amusing example, the effects of the low-memory killer's approach to page reclaim can be extreme:

[Having a process killed to free memory] was really terrible experience because I lost my best score of game I had ever after I switch the phone call while I enjoyed the game

Jon Stultz's volatile ranges work and Minchan's own work on a similar feature (both described in this article) represent a middle point in the spectrum. The volatile ranges approach, inspired by Android's ashmem, provides a process with a way to inform the kernel that a certain range of its own virtual address space can be preferentially reclaimed if memory pressure is high. Under this approach, the kernel takes no responsibility for the contents of the reclaimed pages: if the kernel needs the memory, the page contents are discarded, and it is assumed that the application has sufficient information available to re-create those pages with the right contents if they are needed. As with the low-memory killer, the decision about if and when to reclaim the pages remains with the kernel.

By contrast, Minchan's patch set places the decision about when to reclaim pages directly under the control of user space. The interface provided by these patches is simple. A /proc/PID/reclaim file is provided for each process. A process with suitable permissions—that is, a process owned by root or one with the same user ID as the process PID—can write one of the following values to the file, to cause some or all of the process's pages to be reclaimed:

Reclaim all process pages in file-backed mappings.

Reclaim all process pages in anonymous (MAP_ANONYMOUS) mappings.

Reclaim all pages of the process (i.e., the combination of 1 and 2).

As currently implemented, all of the process's pages that match the specified criterion are reclaimed. Your editor wondered whether there might be benefit in allowing some control over the range of pages that are reclaimed from the target process, by allowing an address range to be written to the /proc/PID/reclaim file.

By contrast with volatile ranges and the low-memory killer, modifications in pages reclaimed via /proc/PID/reclaim are not lost. Modified pages are written to the underlying file in the case of shared (MAP_SHARED) file mappings or to swap in other cases. Thus, if the process touches the reclaimed page later, it will be faulted into memory with the contents at the time it was reclaimed. The patches also include some logic to handle the case where multiple processes are sharing the same pages; in that case, the pages are reclaimed only after all of the processes have marked them for reclaim. Like the low-memory killer, /proc/PID/reclaim can be used to reclaim all of the pages in a process, but without needing to kill the process to do so.

The idea behind Minchan's proposal is that a user-space task manager could take over some part of the job of memory management. In some cases, this may be more effective than allowing the kernel to make memory-management decisions, since the user-space task manager can bring application-specific intelligence to decisions about whether to reclaim a process's pages. For example, some application processes may be in the foreground while others are in the background. It may desirable to preferentially reclaim pages from one of the background processes, even if it has some frequently accessed pages. Of course, the task manager would somehow need to know when the system is under memory pressure. To that end, a mechanism like Anton Vorontsov's proposed vmpressure_fd() API might be useful.

Minchan's patches apply against Michal Hocko's MMOTM (memory management of the moment) tree. The patches came out on March 25, but have so far seen little review. Nevertheless, they present an idea that will probably be of particular interest for the developers of mobile and embedded devices and thus it seems likely that they will get some attention at some point in the future.

Comments (4 posted)

A VFS deadlock post-mortem

By Michael Kerrisk
April 3, 2013

Dave Jones continues to exercise his Trinity fuzz tester and uncover interesting bugs in kernel code. One recent find was a long-standing bug in the implementation of network namespaces.

The discussion of the bug began when Dave posted a note to the linux-kernel mailing list with stack traces that showed a kernel deadlock in the VFS code. Dave's report prompted Al Viro to wonder how a Trinity instance was managing to sit blocked on two locks (a situation that should never be able to happen), as shown in the lockdep output posted by Dave (the output has some key pieces highlighted):

    Showing all locks held in the system:
    4 locks on stack by trinity-child2/7669:
     #0: blocked:  (sb_writers#4){.+.+.+}, 
         instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
     #1: held:     (&type->s_vfs_rename_key){+.+.+.}, 
         instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
     #2: held:     (&type->i_mutex_dir_key#2/1){+.+.+.}, 
         instance: ffff880110b3a858, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
     #3: blocked:  (&type->i_mutex_dir_key#2/2){+.+.+.}, 
         instance: ffff880110b3a858, at: [<ffffffff811c7034>] lock_rename+0x114/0x120

Al also noted that the output suggested that a directory inode in the inode cache was mapped by two different dentries, since lockdep showed two i_mutex_dir_key locks on the same address. A dentry (directory entry) is a data structure representing a filename in the kernel directory entry cache (dcache); a brief overview of dentries and the dcache can be found in this article. As will become clear shortly, it should normally never happen that a directory inode is mapped twice in the dcache.

Some suggestions ensued regarding suitable debugging statements to add to the kernel's lock_rename() function to further investigate the problem. In particular, when two locks were held on the same inode address, Linus wanted to see the filenames corresponding to the inode and Al was interested to know the name of the filesystem holding the two inodes.

Further runs of Trinity with those debugging statements in place revealed that the locks in question were occurring for various entries under the /proc tree. At that point Linus refined the observation to note that the entries in question were for directories under /proc/net, but, like Al, he was puzzled as to how that could occur.

Here, a little background is probably in order. Once upon a time, /proc/net was single directory. But, with the invention of network namespaces, it is now a symbolic link to the /proc/self/net directory; in other words, each process now has its own network-namespace-specific view of networking information under /proc.

With the output from the kernel debugging statements, the pieces started falling rapidly into place. Dave realized that he had started seeing the Trinity failure reports after he had enabled kernel namespaces support following a recent bug fix by Eric Biederman. Al began looking more closely at some of the subdirectories under the /proc/PID/net directories, and made an unhappy discovery:

    al <at> duke:~/linux/trees/vfs$ ls -lid /proc/{1,2}/net/stat
    4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/1/net/stat
    4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/2/net/stat

That discovery prompted a small explosion:

WE CAN NOT HAVE SEVERAL DENTRIES OVER THE SAME DIRECTORY INODE. […] Sigh... Namespace kinds - there should've been only one...

Those with a long memory, or at least careful attention when reading a recent LWN article, might smile with the realization that, to begin with and for many years thereafter, there was only one class of namespace—mount namespaces, as implemented by one Al Viro.

Humor aside, Al had discovered the origin of the problem. The directory listing above shows two directory entries linked to the same inode. More generally, for all of the processes that share a network namespace, each of the corresponding entries in /proc/PID/net is implemented as a hard link to the same (virtual) /proc file.

Implementing corresponding /proc entries as hard links to the same inode is a technique used in various places in the implementation of namespaces. Indeed, allowing multiple hard links to a file is a normal feature of UNIX-type systems. Except in one case: Linux, like other UNIX systems, forbids multiple hard links to a directory. The reliability of various pieces of kernel and user-space code is predicated on that assumption. However, a Linux 2.6.25 patch made early in the implementation of network namespaces set in train some changes that quietly broke the assumption for the directories under /proc/PID/net.

Having determined the cause of the problem, the developers then needed devise a suitable fix. At this point, pragmatic factors come into play, since the task is not only to fix the kernel going forward, but also going backward. In other words, the ideal solution would be one that could be applied not only to the current kernel source tree and but also to the stable and long-term kernel series. That led Linus to speculate about the possibility of allowing an exception to the rule that directory inodes are not allowed to have multiple links. Since the locks in question are placed at the inode level, why not change lock_rename() to replace the check on whether that function is dealing with the same dentries with a check on whether it is dealing with the same inodes?

However, Al was quick to point out that while modifying the check would solve the particular deadlock problem found by Dave, other problems would remain. The kernel code that deals with those locks depends upon a topological sort based on the hierarchical relationship between entries in the dcache; the presence of multiple directory entries that link to the same inode renders that sort unreliable.

Al went on to describe what he considered to be the full and proper solution: creating /proc/PID/net files as symbolic links to per-network-namespace directories of the form /proc/netns-ID/net, where netns-ID is a per-namespace identifier. Alternatively, the existing /proc/PID/net trees could be kept, but the subdirectories could be created as duplicate subtrees rather than hard links to a single directory subtree. Al was, however, unsure about the feasibility of implementing this solution as a patch that could be backported to past stable kernel series.

In the meantime, Linus came up with another proposal. proc_get_inode(), the kernel function for allocating inodes in the /proc filesystem, has the following form:

    struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
    {
        struct inode *inode = iget_locked(sb, de->low_ino);

        if (inode && (inode->i_state & I_NEW)) {

            ...
            /* Populate fields in newly allocated cache entry pointed
               to by 'inode' */
               ...

            unlock_new_inode(inode);
        } else
            pde_put(de);
        return inode;
    }

The iget_locked() function searches the kernel's inode cache for an inode whose number corresponds to that recorded in the dentry structure de. It returns either a pointer to an existing entry, or, if no entry could be found, it allocates a new uninitialized cache entry that it returns to the caller. The proc_get_inode() function then populates the fields of the newly allocated inode cache entry using information from the dentry.

The deadlock problem is a result of the fact that—because multiple dentries map to the same inode—multiple locks may be placed on the same entry in the inode cache. Conversely, deadlocks could be avoided if it was possible to avoid placing multiple locks on the inode entries returned from the cache. As Linus noted, in the case of /proc files, it is not really necessary to find an existing entry in the cache, because there is no on-disk representation for the inodes under /proc. Instead, proc_get_inode() could simply always create a new cache entry via a call to new_inode_pseudo() and populate that cache entry. Since a new cache entry is always created, it will not be visible to any other process, so that there will be no possibility of lock conflicts and deadlocks. In other words, the logic of proc_get_inode() can be modified to be:

    struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
    {
        struct inode *inode = new_inode_pseudo(sb);

        if (inode) {
            inode->i_ino = de->low_ino;

            ...
            /* Populate fields in newly allocated cache entry pointed
               to by 'inode' */
            ...

        } else
           pde_put(de);
        return inode;
    }

Here, it is worth noting that the kernel uses two different allocation schemes for the inodes under /proc: one scheme that is generally employed for inodes under the /proc/PID directories and another for the inodes in the remainder of /proc. Linus's patch affects only inode allocations for entries in the second category. However, as a consequence of the implementation history, whereby /proc/net was migrated to /proc/PID/net, the inodes under /proc/PID/net are allocated in the same fashion as inodes outside /proc/PID, and so the patch also affects those inodes.

In the subsequent commit message, Linus noted that the patch could have been refined so that the new behavior was applied only to directory entries, rather than all entries, under /proc. However, in the interests of keeping the change simple, no such differentiation was made.

The effect of Linus's patch is to prevent multiple locks (and thus deadlocks) on the same inode. Al agreed that the change should not be a problem from a correctness perspective. On the other hand, this change also has the effect of nullifying the benefits of inode caching for /proc files outside /proc/PID. Al wondered about the performance impact of that change. However, some casual instrumentation of the kernel suggested that the benefits of inode caching for /proc are low in any case. In addition, Dave reported that with the fix applied, Trinity was no longer hitting the deadlock problem.

Comments (none posted)

In-kernel memory compression

April 3, 2013

This article was contributed by Dan Magenheimer

Amdahl's law tells us that there is always a bottleneck in any computing system. Historically, the bottleneck in many workloads on many systems is the CPU and so system designers have made CPUs faster and more efficient and also continue to increase the number of CPU cores even in low-end systems. So now, increasingly, RAM is the bottleneck; CPUs wait idly while data is moved from disk to RAM and back again. Adding more RAM is not always a timely or cost-effective option and sometimes not an option at all. Faster I/O buses and solid-state disks reduce the bottleneck but don't eliminate it.

Wouldn't it be nice if it were possible to increase the effective amount of data stored in RAM? And, since those CPUs are waiting anyway, perhaps we could use those spare CPU cycles to contribute towards that objective? This is the goal of in-kernel compression: We keep more data — compressed — in RAM and use otherwise idle CPU cycles to execute compression and decompression algorithms.

With the recent posting of zswap, there are now three in-kernel compression solutions proposed for merging into the kernel memory management (MM) subsystem: zram, zcache, and zswap. While a casual observer might think that only one is necessary, the three have significant differences and may target different user bases. So, just as there are many filesystems today co-existing in the kernel, someday there may be multiple compression solutions. Or maybe not... this will be decided by Linus and key kernel developers. To help inform that decision, this article compares and contrasts the different solutions, which we will call, for brevity, the "zprojects". We will first introduce some of the key concepts and challenges of compression. Then we will identify three design layers and elaborate on the different design choices made by the zprojects. Finally, we will discuss how the zprojects may interact with the rest of the kernel and then conclude.

Compression basics

For in-kernel compression to work, the kernel must take byte sequences in memory and compress them, and then keep the compressed version in RAM until some point in the future when the data is needed again. While the data is in a compressed state, it is impossible to read any individual bytes from it or write any individual bytes to it. Then, when the data is needed again, the compressed sequence must be decompressed so that individual bytes can again be directly accessed.

It is possible to compress any number of sequential bytes but it is convenient to use a fixed "unit" of compression. A standard storage unit used throughout the kernel is a "page" which is made up of a fixed constant PAGE_SIZE bytes (4KB on most architectures supported by Linux). If this page is aligned at a PAGE_SIZE address boundary, it is called a "page frame"; the kernel maintains a corresponding "struct page" for every page frame in system RAM. All three zprojects use a page as the unit of compression and allocate and manage page frames to store compressed pages.

There are many possible compression algorithms. In general, achieving a higher "compression ratio" takes a larger number of CPU cycles, whereas less-effective compression can be done faster. It is important to achieve a good balance between time and compression ratio. All three zprojects, by default, use the LZO(1X) algorithm in the kernel's lib/ directory, which is known to deliver a good balance. However, it is important that the choice of algorithm remain flexible; possibly the in-CPU algorithm might be replaced in some cases by an architecture-specific hardware compression engine.

In general, the number of cycles required to compress, and to later decompress, a sequence of bytes is roughly proportional to the number of bytes in the sequence. Since the size of a page is fairly large, page compression and decompression are expensive operations and, so, we wish to limit the number of those operations. Thus we must carefully select which pages to compress, choosing pages that are likely to be used again but not likely to be used in the near future, lest we spend all the CPU's time repeatedly compressing and then decompressing pages. Since individual bytes in a compressed page are inaccessible, we must not only ensure that normal CPU linear byte addressing is not attempted on any individual byte, but also ensure that the kernel can clearly identify the compressed page so that it can be found and decompressed when the time comes to do so.

When a page is compressed, the compression algorithm is applied and the result is a sequence of bytes which we can refer to as a "zpage". The size of a zpage, its "zsize", is highly data-dependent, hard to predict, and highly variable. Indeed the expected distribution of zsize drives a number of key zproject design decisions so it is worthwhile to understand it further. The "compression ratio" for a page is zsize divided by PAGE_SIZE. For nearly all pages, zsize is less than PAGE_SIZE so the compression ratio is less than one, but odd cases may occur where compression actually results in the zsize being larger than PAGE_SIZE. A good compression solution must have a contingency plan to deal with such outliers. At the other extreme, a data page containing mostly zeroes or ones may compress by a factor of 100x or more. For example, LZO applied to an all-zero page results in a zpage with zsize equal to 28, for a compression ratio of 0.0068.

As a rough rule of thumb, "on average", compression has a ratio of about 0.5 (i.e. it reduces a page of data by about a factor of two), so across a wide set of workloads, it is useful to envision the distribution as a bell curve, centered at PAGE_SIZE/2. We will refer to zpages with zsize less than PAGE_SIZE/2 as "thin" zpages, and zpages with zsize greater than PAGE_SIZE/2 as "fat" zpages. For any given workload, of course, the distribution may "skew thin", (if many mostly-zero pages are compressed, for example) or "skew fat", as is true if the data is difficult to compress (already-compressed data like a JPEG image would be an example). A good compression solution must thus be able to handle zsize distributions from a wide range of workloads.

With this overview and terminology, we can now compare and contrast the three zprojects. At a high level, each uses a "data source layer" to provide pages to compress and a "data management layer" to organize and store the zpages. This data management layer has two sublayers, the first for data organization, which we will call the "metadata layer", and the second for determining how best to utilize existing kernel memory and kernel allocation code (such as alloc_page()) to optimally store the zpages, which we will call the "zpage allocator layer".

Data source layer — overview

As previously noted, the nature of compression limits the types and quantities of pages to which it can be applied. It doesn't make sense to compress a page for which there is a high probability of imminent reuse. Nor does it make sense to compress a page for which there is a high probability the data it contains will never again be reused. All three zprojects assume one or more data sources that provide a "pre-qualified" sequence of data pages. In one zproject, an existing kernel subsystem is the data source. For the other two zprojects, previous kernel hooks added as part of the "transcendent memory" projects known as cleancache and frontswap, source the data page stream.

Each data source must also provide a unique identifier, or "key", for each page of data so that the page, once stored and associated with that unique key, can be later retrieved by providing the key. It's also important to note that different data sources may result in streams of pages with dramatically different contents and thus, when compressed, different zsize distributions.

Swap/anonymous pages as a data source

The Linux swap subsystem provides a good opportunity for compression because, when the system is under memory pressure, swap acts as a gatekeeper between (a) frequently used anonymous pages which are kept in RAM, and (b) presumably less frequently used anonymous pages that "overflow", or can be "swapped", to a very-much-slower-than-RAM swap disk. If, instead of using a slow swap disk, we can cause the gatekeeper to store the overflow pages in RAM, compressed, we may be able to avoid swapping them entirely. Further, the swap subsystem already provides a unique key for each page so that the page can be fetched from the swap disk. So, unsurprisingly, all three zprojects consider swap pages as an ideal data source for compression.

One zproject, zram, utilizes the existing swap device model. A zram swap device is explicitly created and enabled in user space (via mkswap and swapon) and this zram device is prioritized (presumably highest) against other configured swap devices. When the swap subsystem sends a data page to the zram device, the page goes through the block I/O subsystem to the zram "driver". All swap pages sent to zram are compressed and associated with a key created by concatenating the swap device identifier and page offset. Later, when the swap subsystem determines (via a page fault) that the page must be brought back in from the swap device to fill a specific page frame, the zram device is notified; it then decompresses the zpage matching the swap device number and page offset into that page frame.

The other two zprojects, zcache and zswap, use the kernel's frontswap hooks (available since Linux 3.5) in the swap subsystem. Frontswap acts as a "fronting store" — a type of a cache — to an existing swap device. It avoids the block I/O subsystem completely; swapped pages instead go directly to zcache/zswap where they are compressed. On the fetch path, when the page fault occurs, the block I/O subsystem is again bypassed and then, as with zram, the page is decompressed directly into the faulting page frame.

There are some subtle but important side effects of the different approaches:

The block I/O subsystem is designed to work with fixed-size block devices and is not very happy when a block write results in a failure. As a result, when a poorly-compressible page ("very fat" zpage) is presented to zram, all PAGE_SIZE bytes of the uncompressed page are stored by zram in RAM, for no net space savings. The frontswap-based zprojects have more flexibility; if a very fat zpage is presented, zcache/zswap can simply reject the request, in which case the swap subsystem will pass on the original data page to the real swap disk.
Since zram presents itself as a swap disk, user-space configuration is required, but this also means that zram can work on systems with no physical swap device, a configuration common for embedded Linux systems. On the other hand, zcache and zswap depend entirely on already-configured swap devices and so are not functional unless at least one physical swap device is configured and sized adequately. As a result, one might think of zram as being a good match for an embedded environment, while the others are better suited for a more traditional server/data-center environment.
It is important to note that all three zprojects must never discard any of these data pages unless explicitly instructed, otherwise user-space programs may experience data loss. Since all three zprojects have a finite capacity on any given system, it is prudent to have a "pressure relief valve". Both zcache and zswap have the ability to "writeback" data to the swapdisk to which the data was originally intended to be written; this cannot be done with zram.

Clean page cache pages as a data source

It is not uncommon for the vast majority of page frames in a running system to contain data pages that are identical to pages already existing on a disk in a filesystem. In Linux, these pages are stored in the page cache; the data in these pages is kept in RAM in anticipation that it will be used again in the future. When memory pressure occurs and the kernel is looking to free RAM, the kernel will be quick to discard some of this duplicate "clean" data because it can be fetched from the filesystem later if needed. A "cleancache" hook, present in the kernel since Linux 3.0, can divert the data in these page frames, sending the pages to zcache where the data can be compressed and preserved in RAM. Then, if the kernel determines the data is again needed in RAM, a page fault will be intercepted by another cleancache hook which results in the decompression of the data.

Because modern filesystems can be large, unique identification of a page cache page is more problematic than is unique identification of swap pages. Indeed, the key provided via cleancache must uniquely identify: a filesystem; an "exportable" inode value representing a file within that filesystem; and a page offset within the file. This combination requires up to 240 bits.

Zcache was designed to handle cleancache pages, including the full range of required keys. As a result, the data management layer is much more complex, consisting of a combination of different data structures which we will describe below. Further, the location of some cleancache hooks in the VFS part of the kernel results in some calls to the data management layer with interrupts disabled which must be properly handled.

Even more than with swap data, filesystem data can proliferate rapidly and, even compressed, can quickly fill up RAM. So, as with swap data, it is prudent to have a pressure-relief valve for page cache data. Unlike swap data, however, we know that page cache data can simply be dropped whenever necessary. Since zcache was designed from the beginning to manage page cache pages, its data management layer was also designed to efficiently handle the eviction of page cache zpages.

Zram can maintain entire fixed-size filesystems in compressed form in RAM, but not as a cache for a non-RAM-based filesystem. In essence, part of RAM can be pre-allocated as a "fast disk" for a filesystem; even filesystem metadata is compressed and stored in zram. While zram may be useful for small filesystems that can entirely fit in RAM, the caching capability provided via cleancache combined with the compression provided by zcache is useful even for large filesystems.

Zswap is singularly focused on swap and so does not make use of the page cache at all or filesystem data as a data source. As a result, its design is simpler because it can ignore the complexities required to handle the tougher page cache data source.

The metadata layer

Once the zproject receives a data page for compression, it must set up and maintain data structures so that the zpage can be stored and associated with a key for that page, and then later found and decompressed. Concurrency must also be considered to ensure that unnecessary serialization bottlenecks do not occur. The three zprojects use very different data structures for different reasons, in part driven by the requirements of data sources to be maintained.

Since the number of swap devices in a system is small, and the number of pages in each is predetermined and fixed, a small, unsigned integer representing the swap device combined with a page offset is sufficient to identify a specific page. So zram simply uses a direct table lookup (one table per configured zram swap device) to find and manage its data. For example, for each data page stored, the zram table contains a zsize, since zsize is useful for decompression validation, as well as information to get to the stored data. Concurrency is restricted, per-device, by a reader-writer semaphore.

Zswap must manage the same swap-driven address space, but it utilizes one dynamic red-black tree (rbtree) per swap device to manage its data. During tree access, a spinlock on the tree must be held; this is not a bottleneck because simultaneous access within any one swap device is greatly limited by other swap subsystem factors. Zswap's metadata layer is intentionally minimalist.

Zcache must manage both swap data and pagecache data, and the latter has a much more extensive key space which drives the size and complexity of zcache data structures. For each filesystem, zcache creates a "pool" with a hash table of red-black tree root pointers; each node in the rbtree represents a filesystem inode — the inode space in an exportable filesystem is often very sparsely populated which lends itself to management by an rbtree. Then each rbtree node contains the head of a radix tree — the data structure used throughout the kernel for page offsets — and this radix tree is used to look up a specific page offset. The leaf of the radix tree points to a descriptor which, among other things, references the zpage. Each hash table entry has a spinlock to manage concurrency; unlike swap and frontswap, filesystem accesses via cleancache may be highly concurrent so contention must be aggressively avoided.

Impact of writeback on the metadata layer

Earlier it was noted that zcache and zswap support "writeback"; this adds an important twist to the data structures in that, ideally, we'd like to write back pages in some reasonable order, such as least recently used (LRU). To that end, both zcache and zswap maintain a queue to order the stored data; zswap queues each individual zpage and zcache queues the page frames used to store the zpages. While lookup requires a key and searches the data structures from the root downward, writeback chooses one or more zpages from the leaves of the data structure and then must not only remove or decompress the data, but must also remove its metadata from the leaf upward. This requirement has two ramifications: First, the leaf nodes of data structures must retain the key along with information necessary to remove the metadata. Second, if writeback is allowed to execute concurrently with other data structure accesses (lookup/insert/remove), challenging new race conditions arise which must be understood and avoided.

Zswap currently implements writeback only as a side effect of storing data (when kernel page allocation fails) which limits the possible race conditions. Zcache implements writeback as a shrinker routine that can be run completely independently and, thus, must handle a broader set of potential race conditions. Zcache must also handle this same broader set when discarding page cache pages.

Zpage allocator layer

Memory allocation for zpages can present some unique challenges. First, we wish to maximize the number of zpages stored in a given number of kernel page frames, a ratio which we will call "density". Maximizing the density can be challenging as we do not know a priori the number of zpages to be stored, the distribution of zsize of those zpages, or the access patterns which insert and remove those zpages. As a result, fragmentation can be a concern. Second, allocating memory to store zpages may often occur in an environment of high-to-severe memory pressure; an effective allocator must handle frequent failure and should minimize large (i.e. multiple-contiguous-page) allocations. Third, if possible, we wish the allocator to support some mechanism to enable ordered writeback and/or eviction as described above.

One obvious alternative is to use an allocator already present and widely used in the kernel: kmalloc(), which is optimized for allocations that nicely fit into default sizes of 2^N byte chunks. For allocations of size 2^N+x, however, as much as 50% of the allocated memory is wasted. kmalloc() provides an option for "odd" sizes but this requires pre-allocated caches, often of contiguous sets of pages ("higher order" pages) which are difficult to obtain under memory pressure. Early studies showed that kmalloc() allocation failures were unacceptably frequent and density was insufficient, so kmalloc() was discarded and other options were pursued, all based on alloc_page(), which always allocates a single page frame.

To improve density, the xvmalloc allocator was written, based on the two-level sequential fit (TLSF) algorithm. Xvmalloc provided high density for some workloads and was the initial default allocator for zram and for the frontswap-driven portion of the initial zcache. It was found, however, that the high density led to poor fragmentation characteristics, especially for zsize distributions that skewed fat. Xvmalloc has since been discarded (though remnants of it survive through Linux 3.8.)

A new allocator, zbud, was written for the cleancache-driven portion of the original zcache code. The initial version of zbud had less focus on high density and more on the ability to efficiently evict cleancache-provided zpages. With zbud, no more than two zpages are contained in a page frame, buddying up a pair of zpages, one fat and one thin. Page frames, rather than zpages, are entered into an LRU queue so that entire page frames can be easily "freed" to the system. On workloads where zsize skews thin, a significant amount of space is wasted, but fragmentation is limited, and eviction is still easy.

Zsmalloc, an allocator intended to "rule them all", was then written. Zsmalloc takes the best of kmalloc() but with the ability to provide the complete range of "size classes" suitable to store zpages; zsmalloc also never requires higher-order allocations. All zpages with zsize within a "class" (e.g. between 33-48 bytes) are grouped together and stored in the same "zspage". A clever feature (especially useful for fat zpages) allows discontiguous page frames to be "stitched" together, so that the last bytes in the first page frame contain the first part of the zpage and the first bytes of the second page frame contain the second part. A group of N of these discontiguous pages are allocated on demand for each size class, with N chosen by the zsmalloc code to optimize density.

As a result of these innovations, zsmalloc achieves great density across a wide range of zsize distributions: balanced, fat, and thin. Alas, zsmalloc's strengths also proves to be its Achilles heel: high density and page crossings make it very difficult for eviction and writeback to concurrently free page frames, resulting in a different kind of fragmentation (and consequent lower density) and rendering it unsuitable for management of cleancache-provided pages in zcache. So zbud was revised to utilize some of the techniques pioneered by zsmalloc. Zbud is still limited to two zpages per page frame, but when under heavy eviction or writeback, its density compares favorably with zsmalloc. The decision by zcache to use zbud for swap pages sadly led to a fork of zcache, first to "old zcache" (aka zcache1) and "new zcache" (aka zcache2) and then to a separate zproject entirely, zswap, because the author prefers the density of zsmalloc over the ability to support cleancache pages and the predictability and page frame-reclaim of zbud.

So, today, there are two zpage allocators, both still evolving. The choice for best zpage allocator depends on use models and unpredictable workloads. It may be possible that some hybrid of the two will emerge, or perhaps some yet unwritten allocator may arise to "rule them all".

Memory management subsystem interaction

In some ways, a zproject is like a cache for pages temporarily removed from the MM subsystem. However it is a rather unusual cache in that, to store its data, it steals capacity (page frames) from its backing store, the MM subsystem. From the point-of-view of the MM subsystem, the RAM is just gone, just as if the page frames were absorbed by a RAM-voracious device driver. This is a bit unfortunate, given that both the zproject and the MM subsystem are respectively, but separately, managing compressed and uncompressed anonymous pages (and possibly compressed and uncompressed page cache pages as well). So it is educational to consider how a zproject may "load balance" its memory needs with the MM subsystem, and with the rest of the kernel.

Currently, all three zprojects obtain page frames using the standard kernel alloc_page() call, with the "GFP flags" parameter chosen to ensure that the kernel's emergency reserve pages are never used. So each zproject must be (and is) resilient to the fairly frequent situation where an alloc_page() call returns failure.

Zram makes no other attempt to limit its total use of page frames but, as we've seen, it is configured by the administrator as a swap device, and swap parameters do limit the number of swap pages, and thus zpages, that can be accepted. This indirectly but unpredictably limits the number of page frames used. Also, the configuration is independent of and not cognizant of the total RAM installed in the system, so configuration errors are very possible.

Zswap uses a slightly modified version of zsmalloc to track the total number of page frames allocated. It ensures this number does not exceed a certain percent of total RAM in the system, and this limit is configurable via sysfs. As previously noted, using writeback, zswap can reduce the number of (anonymous) zpages stored and this, indirectly, can reduce the number of page frames, though at the likely cost of fragmentation.

Zcache and zbud were designed from the beginning with much more dynamic constraints on page frame utilization in mind, intended eventually to flexibly mesh with the dramatically time-varying memory balancing algorithms that the MM subsystem uses. While the exact interaction model and best future control policy are not yet determined, zcache's and zbud's underlying page frame-based writeback and eviction mechanisms support a shrinker-like interface that can be told to, for example, reduce the number of page frames used for storing anonymous zpages from 10000 to 8937 and the number of page frames used for storing page cache zpages from 3300 to 463, and zcache can do that. Since zbud is much more predictable (two zpages per page frame), the MM subsystem can estimate and manage both the total number of anonymous and pagecache pages (compressed and uncompressed) in the system and the total number of page frames used to manage them.

Status and conclusions

Zram, zcache, and zswap are all advancing the concept of in-kernel compression in different ways. Zram and zcache are in-tree but both still in the staging tree. While the staging tree has served to expose the concept of in-kernel compression to a wide group of kernel developers and even to the users of a few leading-edge distributions, as Andrew Morton says, the staging tree "is where code goes to be ignored". Staging tree maintainer Greg Kroah-Hartman has become reluctant to accept any new zproject functionality into the staging tree. This creates a conundrum for these zprojects: evolution in the staging tree has resulted in rapid improvements in the design and implementation and exposure of the zprojects, but because of this evolution they are not stable enough for promotion into the core kernel and further evolution is now stymied.

Zswap is attempting to circumvent the staging tree entirely by proposing what is essentially a simplified frontswap-only fork of zcache using zsmalloc instead of zbud, for direct merging into the MM subsystem. Since it is also much simpler than zcache, it has garnered more reviewers which is a valuable advantage for to-be-merged code. But it is entirely dependent on the still-in-staging zsmalloc, does not support page cache pages at all, and does not support page-frame-based writeback, so its simplicity comes at a cost. If zswap is merged, it remains to be seen if it will ever be extended adequately.

So, in-kernel compression has clear advantages and real users. It remains unclear though when (or if) it will be merged into the core kernel and how it should interact with the core kernel. If you are intrigued, the various zproject developers encourage your ideas and contributions.

Acknowledgments

Nitin Gupta was the original author of compcache and I was the original author of transcendent memory. While both projects were initially targeted to improve memory utilization in a multi-tenant virtualized environment, their application to compression in a single kernel quickly became apparent. Nitin designed and wrote the Linux code for zram, xvmalloc, and zsmalloc, and wrote an early prototype implementation of zcache. I wrote frontswap and cleancache (with the cleancache hook placement originally done by Chris Mason), and the Linux code for zcache, zbud, and ramster. Seth Jennings contributed a number of improvements to zcache and is the author of zswap. Kernel mm developer Minchan Kim was a helpful influence for all the zprojects, and none of the zprojects would have been possible without Greg Kroah-Hartman's support — and occasional whippings — as maintainer of the staging drivers tree.

As with any open-source project, many others have contributed ideas, bug fixes, and code improvements and we are thankful for everyone's efforts.

Comments (31 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.9-rc5 ?

Greg KH Linux 3.8.5 ?

Steven Rostedt Linux 3.6.11.1 ?

Luis Henriques Linux 3.5.7.9 ?

Greg KH Linux 3.4.38 ?

Steven Rostedt 3.4.37-rt51 ?

Steven Rostedt 3.2.42-rt62 ?

Greg KH Linux 3.0.71 ?

Steven Rostedt 3.0.71-rt98 ?

Core kernel code

Tejun Heo Subject: [PATCHSET v2 wq/for-3.10] workqueue: NUMA affinity for unbound workqueues ?

Oleg Nesterov uprobes/tracing: uretprobes, initial preparations ?

Oleg Nesterov uprobes/tracing: uretprobes ?

Alex Shi sched: power aware scheduling ?

Alex Shi sched: use runnable avg in load balance ?

Jonghwa Lee cpufreq: Introduce LAB cpufreq governor. ?

Device drivers

Kishon Vijay Abraham I Generic PHY Framework ?

Varun Sethi iommu/fsl: Freescale PAMU driver and IOMMU API implementation. ?

Hongbo Zhang Add ST-Ericsson AB8500 HWMON driver ?

Jacob Pan RAPL (Running Average Power Limit) driver ?

Sylwester Nawrocki V4L2 driver for Exynos4x12 Imaging Subsystem ?

Linus Walleij pinctrl/pinconfig: add debug interface ?

Documentation

Michael Kerrisk (man-pages) For review (v2): user_namespaces(7) man page ?

Filesystems and block I/O

Maxim V. Patlasov [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy ?

Eric Sandeen [PATCH, RFC V3] ext3: add ioctl to force 32-bit hashes from indexed dirs ?

Tejun Heo writeback: convert writeback to unbound workqueue ?

Rob Landley rootmpfs ?

Memory management

Glauber Costa memcg-aware slab shrinking ?

Wanpeng Li staging: zcache: Support zero-filled pages more efficiently ?

Toshi Kani Support memory hot-delete to boot memory ?

Li Zefan memcg: make memcg's life cycle the same as cgroup ?

Networking

pablo@netfilter.org nf_tables: atomic transactional updates (v2) ?

Security-related

Steve Dickson lnfs: 3.9-rc5 release ?

Page editor: Jonathan Corbet
Next page: Distributions>>

Kernel development

Brief items

Kernel release status

Quotes of the week

ZFS on Linux 0.6.1

Kernel development news

Avoiding game-score loss with per-process reclaim

A VFS deadlock post-mortem

In-kernel memory compression

Compression basics

Data source layer — overview

Swap/anonymous pages as a data source

Clean page cache pages as a data source

The metadata layer

Impact of writeback on the metadata layer

Other metadata layer topics

Zpage allocator layer

Memory management subsystem interaction

Status and conclusions

Acknowledgments

Patches and updates

Kernel trees

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related