LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.8-rc7, released on February 8. Linus says: "Anyway, here it is. Mostly driver updates (usb, networking, radeon, regulator, sound) with a random smattering of other stuff (btrfs, networking, so on. And most everything is pretty small."

Stable updates: the 3.7.7, 3.4.30 and 3.0.63 updates were released on February 11; 3.5.7.5 was released on February 8.

The 3.7.8, 3.4.31, and 3.0.64 updates are in the review process as of this writing; they can be expected on or after February 14.

Comments (none posted)

Quotes of the week

As far as I'm concerned, *the* *only* interface stability warranties in VFS are those for syscalls. Which means that no tracepoints are going to be acceptable there. End of story.
Al Viro

Hey ARM!

We are not going away, we are here to stay. We cannot be silenced or stopped anymore, and we are becoming harder and harder to ignore.

It is only a matter of time before we produce an open source graphics driver stack which rivals your binary in performance. And that time is measured in weeks and months now. The requests from your own customers, for support for this open source stack, will only grow louder and louder.

So please, stop fighting us. Embrace us. Work with us. Your customers and shareholders will love you for it.

Luc Verhaegen

Yeah, a plan, I know it goes against normal kernel development procedures, but hey, we're in our early 20's now, it's about time we started getting responsible.
Greg Kroah-Hartman

Comments (18 posted)

Kroah-Hartman: AF_BUS, D-Bus, and the Linux kernel

Greg Kroah-Hartman writes about plans to get D-Bus functionality into the kernel (a topic last covered here in July, 2012). "Our goal (and I use 'goal' in a very rough term, I have 8 pages of scribbled notes describing what we want to try to implement here), is to provide a reliable multicast and point-to-point messaging system for the kernel, that will work quickly and securely. On top of this kernel feature, we will try to provide a 'libdbus' interface that allows existing D-Bus users to work without ever knowing the D-Bus daemon was replaced on their system."

Comments (121 posted)

Kernel development news

Some 3.8 development statistics

By Jonathan Corbet
February 13, 2013
The release of 3.8-rc7 suggests that the 3.8 development cycle is nearing its close. This has been a busy cycle indeed, with, as of this writing, just over 12,300 non-merge changesets finding their way into the mainline. That makes 3.8 the most active development cycle ever, edging out 2.6.25 and its mere 12,243 changesets. Like it or not, the time for the traditional statistics article has come around; this time, though, your editor has tried looking at things in a different way.

But, before getting to that, here's the usual numbers. As of this writing, some 1,253 developers have contributed code to the 3.8 kernel. The most active of those were:

Most active 3.8 developers
By changesets
H Hartley Sweeten4263.5%
Bill Pemberton3813.1%
Philipp Reisner2381.9%
Andreas Gruenbacher2101.7%
Lars Ellenberg1461.2%
Mark Brown1431.2%
Sachin Kamat1351.1%
Al Viro1271.0%
Tomi Valkeinen1150.9%
Wei Yongjun1140.9%
Axel Lin1120.9%
Johannes Berg1040.8%
Kevin McKinney1030.8%
YAMANE Toshiaki1010.8%
Ben Skeggs1000.8%
Paulo Zanoni1000.8%
Ian Abbott980.8%
Mauro Carvalho Chehab910.7%
Andrei Emeltchenko840.7%
Daniel Vetter820.7%
By changed lines
Greg Kroah-Hartman424485.8%
Sreekanth Reddy304154.2%
H Hartley Sweeten225813.1%
Naresh Kumar Inna193782.7%
Larry Finger167982.3%
Paul Walmsley167202.3%
Jaegeuk Kim134701.9%
Rajendra Nayak103981.4%
David Howells99461.4%
Wei WANG97751.3%
Ben Skeggs93951.3%
Jussi Kivilinna87841.2%
Philipp Reisner85961.2%
Eunchul Kim85331.2%
Bill Pemberton82931.1%
Nobuhiro Iwamatsu77951.1%
Peter Hurley76711.1%
Laxman Dewangan68980.9%
Lars-Peter Clausen65370.9%
Lars Ellenberg63200.9%

H. Hartley Sweeten's position at the top of the changeset list should be unsurprising by now; he continues the seemingly endless task of cleaning up the Comedi data acquisition drivers. Bill Pemberton has been working to rid the kernel of the __devinit markings (and variants), reflecting the fact that we all live in a hotplug world now. Philipp Reisner, Andreas Gruenbacher, and Lars Ellenberg all contributed long lists of changes to the DRBD distributed block driver; the resulting code dump caused block maintainer Jens Axboe to promise Linus that "Following that, it was both made perfectly clear that there is going to be no more over-the-wall pulls and how the situation on individual pulls can be improved."

On the lines-changed side, Greg Kroah-Hartman worked on the __devinit removal, but also removed over 37,000 lines of code from the staging tree. Sreekanth Reddy made a number of additions to the mpt3sas SCSI driver, Naresh Kumar Inna contributed the Chelsio FCoE offload driver, and Larry Finger added the rtl8723ae wireless driver.

Some 205 employers (that we know about) supported development on the 3.8 kernel. The most active of these were:

Most active 3.8 employers
By changesets
(None)158012.8%
Red Hat11129.0%
Intel10768.7%
(Unknown)9177.4%
LINBIT5954.8%
Linaro5724.6%
Texas Instruments4924.0%
Vision Engraving Systems4263.5%
Samsung4103.3%
SUSE3102.5%
IBM2872.3%
Google2542.1%
Broadcom1901.5%
(Consultant)1711.4%
Wolfson Microelectronics1611.3%
Freescale1291.0%
Free Electrons1281.0%
Parallels1231.0%
NVidia1211.0%
NetApp1211.0%
By lines changed
(None)7995411.0%
Red Hat605158.3%
Intel463266.4%
Linux Foundation431905.9%
(Unknown)410975.7%
Samsung365965.0%
(Consultant)331754.6%
LSI Logic304154.2%
Linaro290304.0%
Vision Engraving Systems260743.6%
LINBIT224873.1%
Chelsio215343.0%
Texas Instruments212762.9%
IBM142332.0%
Broadcom122361.7%
Renesas Electronics115701.6%
NVidia103691.4%
Realsil Microelectronics97971.3%
Qualcomm93451.3%
SUSE91391.3%

Red Hat remains in its traditional position at the top of the list — but not by much. Perhaps more significant is that some companies that have long shown up in the top 20 have fallen off the list this time; those companies include AMD and Oracle. Meanwhile, we continue to see an increasingly strong showing from companies in the mobile and embedded area.

What are they working on?

Many of the companies in the above list have obvious objectives for their work in the kernel; LINBIT, for example, is a business built around DRBD, and Wolfson Microelectronics is in the business of selling a lot of audio hardware. But if companies just focused on driver work, there would be nobody left to do the core kernel work; thus, a look at what parts of the kernel any specific company is working on will say something about how broad its objectives are. To that end, your editor set out to hack on the gitdm tool to focus on one company at a time. So, for example, from the 3.3 kernel onward (essentially, from the beginning of 2012 to the present), Red Hat's changes clustered in these areas:

Red Hat
%SubsystemNotes
34%drivers/ 9% gpu, 6% media, 6% net, 3% md
20%fs/ 3% xfs, 3% nfsd, 2% cifs, 2% gfs2, 1% btrfs, 1% ext4
14%include/
8%net/
8%tools/
7%arch/x86/
7%kernel/
2%mm/

(Patches touching more than one subsystem are counted in each, so the percentages can add up to over 100%.)

Red Hat puts a lot of effort into making drivers work, but also has a strong interest in the filesystem subtree. The large proportion of patches going into tools/ reflects Red Hat's continued development of the perf tool.

Intel's focus during the same time period is somewhat different:

Intel
%SubsystemNotes
66%drivers/ 22% net, 17% gpu, 4% scsi, 3% acpi, 3% usb
17%net/ 7% bluetooth, 5% mac80211, 3% nfc
13%include/
7%arch/x86
3%fs/

Intel is a hardware company, so the bulk of its effort is focused on making its products work well in the Linux kernel. Improving memory management or general-purpose filesystems is mostly left for others.

Google's presence in the kernel development community has grown considerably in the last few years. In this case, the pattern of development is different yet again:

Google
%SubsystemNotes
27%drivers/ 4% net, 4% pci, 3% staging, 3% input, 3% gpu
22%net/ 11% ipv4, 5% core, 5% ipv6
21%include/
11%mm/
10%fs/ 6% ext4, 1% proc
8%kernel/
6%arch/arm
5%arch/x86
4%Documentation/

Google has an obvious interest in making the Internet work better, and much of its work in the kernel is aimed toward that goal. But the company also wants Android to work better (thus more driver work, ARM architecture work) and better scalability in general, leading to a lot of core kernel work. Much of Google's work is visible to the outside world in one way or another, so it is nice to see that the company has been reasonably diligent about keeping the relevant documentation current.

While we are on the subject of ARM, what about Linaro? This consortium is very much about hardware enablement, so it would not be surprising to see a focus on the ARM architecture subsystem. And, indeed, that's how it looks:

Linaro
%SubsystemNotes
47%drivers/ 5% pinctrl, 4% clk, 4% mmc, 4% mfd, 3% gpu, 3% media
36%arch/arm
12%include/
9%kernel/
6%sound/
5%Documentation/
2%fs/ 1.5% pstore

Almost everything Linaro does is focused on making the hardware work better; even much of the work on the core kernel is dedicated to timekeeping. And while lots of work in Documentation/ is always welcome, in this case, it mostly consists of device tree snippets.

Finally, what about the largest group of all — developers who are working on their own time? Here is where those developers put their energies:

Unaffiliated developers
%SubsystemNotes
68%drivers/ 13% staging, 12% net, 10% gpu, 8% media, 6% usb, 2% hid
14%arch/ 5% arm, 2% mips, 2% x86, 2% sparc
8%include/
6%net/ 2% batman-adv
3%fs/
2%Documentation/
2%sound/
1%kernel/

Volunteer developers, it seems, share a strong interest in making their own hardware work; they are also the source of many of the patches going into the staging tree. That suggests that, in a time when much of the kernel is becoming more complex and less approachable, the staging tree is providing a way for new developers to get into the kernel and learn the ropes in a relatively low-pressure setting. The continued health of the community depends on a steady flow of new developers, so providing an easy path for developers to get into kernel development can only be a good thing.

And, certainly, from the information found here, one should be able to conclude that the development community remains in good health overall. We are about to complete our busiest development cycle ever with no real signs of strain. For the time being, things seem to be functioning quite well.

Comments (9 posted)

Rationalizing CPU hotplugging

By Jonathan Corbet
February 12, 2013
One of the leading sources of code churn in the 3.8 development cycle was the removal of the __devinit family of macros. These macros marked code and data that were only needed during device initialization and which, thus, could be disposed of once initialization was complete. These macros are being removed for a simple reason: hardware has become so dynamic that initialization is never complete; something new can always show up, and there is no longer any point in building a kernel that cannot cope with transient devices. Even in this world, though, CPUs are generally seen as being static. But CPUs, too, can come and go, and that is motivating changes in how the kernel manages them.

Hotplugging is a familiar concept when one thinks about keyboards, printers, or storage devices, but it is a bit less so for CPUs: USB-attached add-on processors are still relatively rare in the market. Even so, the kernel has had support for CPU hotplug for some time; the original version of Documentation/cpu-hotplug.txt was added in 2006 for the 2.6.16 kernel. That document mentioned a couple of use cases for this feature: high-end NUMA hardware that truly has runtime-pluggable processors, and the ability to disable a faulty CPU in a high-reliability system. Other uses have since come along, including system suspend operations (where all CPUs but one are "unplugged" prior to suspending the system) and virtualization, where virtual CPUs can be given to (or taken from) guests at will.

So CPU hotplug is a useful feature, but the current implementation in the kernel is not well loved; in a recent patch set intended to improve the situation, Thomas Gleixner remarked that "the current CPU hotplug implementation has become an increasing nightmare full of races and undocumented behaviour." CPU hotplug shows a lot of the signs of a feature that has evolved significantly over time without high-level oversight; among other things, the sequence of steps followed for an unplug operation is not the reverse of the steps to plug in a new CPU. But much of the trouble associated with CPU hotplug is blamed on its extensive use of notifiers.

The kernel's notifier mechanism is a way for kernel code to request a callback when an event of interest happens. They are, in a sense, general-purpose hooks that anybody in the kernel can use — and, it seems, just about anybody does. There have been a lot of complaints about notifiers, as is typified by this comment from Linus in response to Thomas's patch set:

Notifiers are a disgrace, and almost all of them are a major design mistake. They all have locking problems, [they] introduce internal arbitrary API's that are hard to fix later (because you have random people who decided to hook into them, which is the whole *point* of those notifier chains).

Notifiers also make the code hard to understand because there is no easy way to know what will happen when a notifier chain (which is a run-time construct) is invoked: there could be an arbitrary set of notifiers in the chain, in any order. The ordering requirements of specific notifiers can add some fun challenges of their own.

The process of unplugging a CPU requires a surprisingly long list of actions. The scheduler must be informed so it can migrate processes off the affected CPU and shut down the relevant run queue. Per-CPU kernel threads need to be told to exit or "park" themselves. CPU frequency governors need to be told to stop worrying about that processor. Almost anything with per-CPU variables will need to make arrangements for one CPU to go away. Timers running on the outgoing CPU need to be relocated. The read-copy-update subsystem must be told to stop tracking the CPU and to ensure that any RCU callbacks for that CPU get taken care of. Every architecture has its own low-level details to take care of. The perf events subsystem has an impressive set of requirements of its own. And so on; this list is nowhere near comprehensive.

All of these actions are currently accomplished by way of a set of notifier callbacks which, with luck, get called in the right order. Meanwhile, plugging in a new CPU requires an analogous set of operations, but those are handled in an asymmetric manner with a different set of callbacks. The end result is that the mechanism is fragile and that few people have any real understanding of all the steps needed to plug or unplug a CPU.

Thomas's objective is not to rewrite all those notifier functions or fundamentally change what is done to implement a CPU hotplug operation — at least, not yet. Instead, he is focused on imposing some order on the whole process so that it can be understood by looking at the code. To that end, he has replaced the current set of notifier chains with a linear sequence of states to be worked through when bringing up or shutting down a CPU. There is a single array of cpuhp_step structures, one per state:

    struct cpuhp_step {
	int (*startup)(unsigned int cpu);
	int (*teardown)(unsigned int cpu);
    };

The startup() function will be called when passing through the state as a new CPU is brought online, while teardown() is called when things are moving in the other direction. Many states only have one function or the other in the current implementation; the eventual goal is to make the process more symmetrical. In the initial patch set, the set of states is:

Statestartupteardown
CPUHP_CREATE_THREADS
CPUHP_PERF_X86_UNCORE_PREP
CPUHP_PERF_X86_PREPARE
CPUHP_PERF_BFIN
CPUHP_PERF_POWER
CPUHP_PERF_SUPERH
CPUHP_PERF_PREPARE
CPUHP_SCHED_MIGRATE_PREP
CPUHP_WORKQUEUE_PREP
CPUHP_RCUTREE_PREPARE
CPUHP_HRTIMERS_PREPARE
CPUHP_TIMERS_PREPARE
CPUHP_PROFILE_PREPARE
CPUHP_X2APIC_PREPARE
CPUHP_SMPCFD_PREPARE
CPUHP_SMPCFD_PREPARE
CPUHP_SLAB_PREPARE
CPUHP_NOTIFY_PREPARE
CPUHP_NOTIFY_DEAD
CPUHP_CPUFREQ_DEAD
CPUHP_SCHED_DEAD
CPUHP_CLOCKEVENTS_DEAD
CPUHP_BRINGUP_CPU
CPUHP_AP_OFFLINE Application processor states
CPUHP_AP_SCHED_STARTING
CPUHP_AP_PERF_X86_UNCORE_STARTING
CPUHP_AP_PERF_X86_AMD_IBS_STARTING
CPUHP_AP_PERF_X86_STARTING
CPUHP_AP_PERF_ARM_STARTING
CPUHP_AP_ARM_VFP_STARTING
CPUHP_AP_ARM64_TIMER_STARTING
CPUHP_AP_KVM_STARTING
CPUHP_AP_X86_TBOOT_DYING
CPUHP_AP_S390_VTIME_DYING
CPUHP_AP_CLOCKEVENTS_DYING
CPUHP_AP_RCUTREE_DYING
CPUHP_AP_SCHED_NOHZ_DYING
CPUHP_AP_SCHED_MIGRATE_DYING
CPUHP_AP_MAZ End marker for AP states
CPUHP_TEARDOWN_CPU
CPUHP_PERCPU_THREADS
CPUHP_SCHED_ONLINE
CPUHP_PERF_ONLINE
CPUHP_SCHED_MIGRATE_ONLINE
CPUHP_WORKQUEUE_ONLINE
CPUHP_CPUFREQ_ONLINE
CPUHP_RCUTREE_ONLINE
CPUHP_NOTIFY_ONLINE
CPUHP_PROFILE_ONLINE
CPUHP_SLAB_ONLINE
CPUHP_NOTIFY_DOWN_PREPARE
CPUHP_PERF_X86_UNCORE_ONLINE
CPUHP_PERF_X86_ONLINE
CPUHP_PERF_S390_ONLINE

Looking at that list, one begins to see why the current CPU hotplug mechanism is hard to understand. Things are messy enough that Thomas is not really trying to change anything fundamental in how CPU hotplug works; most of the existing notifier callbacks are still there, they are just invoked in a different way. The purpose of the exercise, Thomas said, was:

It's about making the ordering constraints clear. It's about documenting the existing horror in a way, that one can understand the hotplug process w/o hallucinogenic drugs.

Once some high-level order has been brought to the CPU hotplug mechanism, one can think about trying to clean things up. The eventual goal is to have a much smaller set of externally visible states; for drivers and filesystems, there will only be "prepare" and "enable" states available, with no ordering between subsystems. Also, notably, drivers and filesystems will not be allowed to cause a hotplug operation (in either direction) to fail. When the process is complete, the hotplug subsystem should be much more predictable, with a lot more of the details hidden from the rest of the kernel.

That is all work for a future series, though; the first step is to get the infrastructure set up. Chances are that will require at least one more iteration of Thomas's "Episode 1" patch set, meaning that it is unlikely to be 3.9 material. Starting around 3.10, though, we may well see significant changes to how CPU hotplugging is handled; the result should be more comprehensible and reliable code.

Comments (none posted)

The zswap compressed swap cache

February 12, 2013

This article was contributed by Seth Jennings

Swapping is one of the biggest threats to performance. The latency gap between RAM and swap, even on a fast SSD, can be four orders of magnitude. The throughput gap is two orders of magnitude. In addition to the speed gap, storage on which a swap area resides is becoming more shared and virtualized, which can cause additional I/O latency and nondeterministic workload performance. The zswap subsystem exists to mitigate these undesirable effects of swapping through a reduction in I/O activity.

Zswap is a lightweight, write-behind compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. If this process is successful, the writeback to the swap device is deferred and, in many cases, avoided completely. This results in a significant I/O reduction and performance gains for systems that are swapping.

Zswap basics

Zswap intercepts pages in the middle of swap writeback and caches them using the frontswap API. Frontswap has been in the kernel since v3.5 and has been covered by LWN before. It allows a backend driver, like zswap, to intercept both swap page writeback and the page faults for swapped out pages. Zswap also makes use of the "zsmalloc" allocator (discussed below) for compressed page storage.

Zswap seeks to be as simple as possible in its structure and operation. There are two primary data structures. The first is the zswap_entry structure, which contains information about a single compressed page stored in zswap:

    struct zswap_entry {
	struct rb_node rbnode;
	int refcount;
	pgoff_t offset;
	unsigned long handle; /* zsmalloc allocation */
	unsigned int length;
	/* ... */
    };

The second is the zswap_tree structure which contains a red-black tree of zswap entries indexed by the offset value:

    struct zswap_tree {
	struct rb_root rbroot;
	struct list_head lru;
	spinlock_t lock;
	struct zs_pool *pool;
    };

At the highest level, there is an array of zswap_tree structures indexed by the swap device number.

There is a single lock per zswap_tree to protect the tree structure during lookups and modifications. The higher-level swap code provides certain protections that simplify the zswap implementation by not having to design for concurrent store, load, and invalidate operations on the same swap entry. While this single-lock design might seem like a likely source for contention, actual execution demonstrates that the swap path is largely bottlenecked by other locks at higher levels, such as the anon_vma mutex or swap_lock. In comparison, the zswap_tree lock is very lightly contended. Writeback support, covered in the next section, also led to this single-lock design.

For page compression, zswap uses compressor modules provided by the kernel's cryptographic API. This allows users to select the compressor dynamically at boot time, and gives easy access to hardware compression accelerators or any other future compression engines.

A zswap store operation occurs when a page is selected for swapping by the reclaim system and frontswap intercepts the page in swap_writepage(). The operation begins by compressing the page into a per-CPU temporary buffer. Compressing into the temporary buffer is required because the compressed size, and thus the size of the permanent allocation needed to hold it, isn't known until the compression is actually done. Once the compressed size is known, an object is allocated and the temporary buffer is copied into the object. Lastly, a zswap_entry structure is allocated, populated, and inserted into the tree for that swap device.

If the store fails for any reason, most likely because of an object allocation failure, zswap returns an error which is propagated up through frontswap into swap_writepage(). The page is then swapped out to the swap device as usual.

A load operation occurs when a program page faults on a page table entry (PTE) that contains a swap entry and is intercepted by frontswap in swap_readpage(). The swap entry contains the device and offset information needed to look up the zswap entry in the appropriate tree. Once the entry is located, the data is decompressed directly into the page allocated by the page fault code. The entry is not removed from the tree during a load; it remains up-to-date until the entry is invalidated.

An invalidate operation occurs when the reference count for a particular swap offset becomes zero in swap_entry_free(). In this case, the zswap entry is removed from the appropriate tree, and the entry and the zsmalloc allocation that it references are freed.

To be preemption-friendly, interrupts are never disabled. Preemption is only disabled during compression while accessing the per-cpu temporary buffer page, and during decompression while accessing a mapped zsmalloc allocation.

Zswap writeback

To operate optimally as a cache, zswap should hold the most recently used pages. With frontswap, there is, unfortunately, a real potential for an inverse least recently used (LRU) condition in which the cache fills with older pages, and newer pages are forced out to the slower swap device. To address this, zswap is designed with "resumed" writeback in mind.

As background, the process for swapping pages follows these steps:

  1. First, an anonymous memory page is selected for swapping and a slot is allocated in the swap device.

  2. Next, the page is unmapped from all processes using that page. The PTEs referencing that page are filled with the swap entry that consists of the swap type and offset where the page can be found.

  3. Lastly, the page is scheduled for writeback to the swap device.

When frontswap_store() in swap_writepage() is successful, the writeback step is not performed. However, the slot in the swap device has been allocated and is still reserved for the page even though the page only resides in the frontswap backend. Resumed writeback in zswap forces pages out of the compressed cache into their previously reserved swap slots in the swap device. Currently, the policy is basic and forces pages out from the cache in two cases: (1) when the cache has reached its maximum size according to the max_pool_percent sysfs tunable or, (2) when zswap is unable to allocate new space for the compressed pool.

During resumed writeback, zswap decompresses the page, adds it back to the swap cache, and schedules writeback into the swap slot that was previously reserved. By splitting swap_writepage() into two functions after frontswap_store() is called, zswap can resume writeback from the point where the initial writeback terminated in frontswap. The new function is called __swap_writepage().

Freeing zswap entries becomes more complex with writeback. Without writeback, pages would only be freed during invalidate operations (zswap_frontswap_invalidate page()). With writeback, pages can also be freed in zswap_writeback_pages(). These invalidate and writeback functions can run concurrently for the same zswap entry. To guarantee that entries are not freed while being accessed by another thread, a reference count field (called refcount) is used the zswap_entry structure.

Zsmalloc rationale

One really can't talk about zswap without mentioning zsmalloc, the allocator it uses for compressed page storage, which currently resides in the Linux Staging tree.

Zsmalloc is a slab-based allocator used by zswap; it provides more reliable allocation of large objects in a memory constrained environment than does the kernel slab allocator. Zsmalloc has already been discussed on LWN, so this section will focus more on the need for zsmalloc in the presence of the kernel slab allocator.

The objects that zswap stores are compressed pages. The default compressor is lzo1x-1, which is known for speed, but not so much for high compression. As a result, zswap objects can frequently be large relative to typical slab objects (>1/8th PAGE_SIZE). This is a problem for the kernel slab allocator under memory pressure.

The kernel slab allocator requires high-order page allocations to back slabs for large objects. For example, on a system with a 4K page size, the kmalloc-512 cache has slabs that are backed by two contiguous pages. kmalloc-2048 requires eight contiguous pages per slab. These high-order page allocations are very likely to fail when the system is under memory pressure.

Zsmalloc addresses this problem by allowing the pages backing a slab (or “size class” in zsmalloc terms) to be both non-contiguous and variable in number. They are variable in number because zsmalloc allows a slab to be composed of less than the target number of backing pages. A set of non-contiguous pages backing a slab are stitched together using fields of struct page to create a “zspage”. This allows zsmalloc to service large object allocations, up to PAGE_SIZE, without requiring high-order page allocations.

Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively uses an entire page, resulting in ~50% waste. Hence there are no kmalloc() cache sizes between PAGE_SIZE/2 and PAGE_SIZE. Zswap frequently needs allocations in this range, however. Using the kernel slab allocator causes the memory savings achieved through compression to be lost in fragmentation.

In order to satisfy these larger allocations while not wasting an entire page, zsmalloc allows objects to span page boundaries at the cost of having to map the allocations before accessing them. This mapping is needed because the object might be contained in two non-contiguous pages. For example, in a zsmalloc size class for objects that are 2/3 of PAGE_SIZE, three objects could be stored in a zspage with two non-contiguous backing pages with no waste. The object stored in the second of the three object positions in the zspage would be split between two different pages.

Zsmalloc is a good fit for zswap. Zswap was evaluated using the kernel slab allocator and these issues did have a significant impact on the frontswap_store() success rate. This was due to kmalloc() allocation failures and a need to reject pages that compressed to sizes greater than PAGE_SIZE/2.

Performance

In order to produce a performance comparison, kernel builds were conducted with an increasing number of threads per run in a constant and constrained amount of memory. The results indicate a runtime reduction of 53% and an I/O reduction of 76% with zswap compared to normal swapping. The testing system was configured with:

  • Gentoo running v3.7-rc7
  • Quad-core i5-2500 @ 3.3GHz
  • 512MB DDR3 1600MHz (limited with mem=512m on boot)
  • Filesystem and swap on 80GB HDD (about 58MB/s with hdparm -t)

The table below summarizes the test runs.

BaselinezswapChange
NpswpinpswpoutmajfltI/O sumpswpinpswpoutmajfltI/O sum%I/OMB
8133529162700249249-60%1
1236881431552902329312386059546937-70%64
1612711461791680375693293673904609256418-25%75
20421781337814989822585794602838292951130793-42%371
2496079357280105242558601771918484109309135512-76%1653

The 'N' column indicates the maximum number of concurrent threads for the kernel build (make -jN) for each run. The next four columns are the statistics for the baseline run without zswap, followed by the same for the zswap run. The I/O sum column for each run is a sum of pswpin (pages swapped in), pswpout (pages swapped out), and majflt (major page faults). The difference between the baseline and zswap runs is shown both in relative terms, as a percentage of I/O reduction, and in absolute terms, as a reduction of X megabytes of I/O related to swapping activity.

A compressed swap cache reduces the efficiency of the page reclaim process. For any store operation, the cache may allocate some pages to store the compressed page. This results in an reduction of overall page reclaim efficiency. This reduction in efficiency results in additional shrinking pressure on the page cache causing an increase in major page faults where pages must be re-read from disk. In order to have a complete picture of the I/O impact, the major page faults must be considered in the sum of I/O.

The next table shows the total runtime of the kernel builds:

Runtime (in seconds)
Nbasezswap%change
81071070%
12128110-14%
16191179-6%
20371240-35%
24570267-53%

The runtime impact of swap activity is decreased when comparing runs with the same number of threads. The rate of degradation is reduced for increasingly constrained runs when comparing baseline and zswap.

The measurements of average CPU utilization during the builds are:

%CPU utilization (out of 400% on 4 cpus)
Nbasezswap%change
83173191%
1226731116%
161791917%
209414352%
2460128113%

The CPU utilization table shows that with zswap, the kernel build is able to make more productive use of the CPUs, as is expected from the runtime results.

Additional performance testing was performed using SPECjbb. Metrics regarding the performance improvements and I/O reductions that can be achieved using zswap on both x86 and Power7+ (with and without hardware compression acceleration), can be found on this page.

Conclusion

Zswap is a compressed swap cache, able to evict pages from the compressed cache, on an LRU basis, to the backing swap device when the compressed pool reaches it size limit or the pool is unable to obtain additional pages from the buddy allocator. Its approach trades CPU cycles for reduced swap I/O. This trade-off can result in a significant performance improvement as reads to and writes from to the compressed cache are almost always faster that reading from a swap device which incurs the latency of an asynchronous block I/O read.

Comments (7 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds