Kernel development
Brief items
Kernel release status
The current development kernel is 3.8-rc7, released on February 8. Linus says: "Anyway, here it is. Mostly driver updates (usb, networking, radeon, regulator, sound) with a random smattering of other stuff (btrfs, networking, so on. And most everything is pretty small."
Stable updates: the 3.7.7, 3.4.30 and 3.0.63 updates were released on February 11; 3.5.7.5 was released on February 8.
The 3.7.8, 3.4.31, and 3.0.64 updates are in the review process as of this writing; they can be expected on or after February 14.
Quotes of the week
We are not going away, we are here to stay. We cannot be silenced or stopped anymore, and we are becoming harder and harder to ignore.
It is only a matter of time before we produce an open source graphics driver stack which rivals your binary in performance. And that time is measured in weeks and months now. The requests from your own customers, for support for this open source stack, will only grow louder and louder.
So please, stop fighting us. Embrace us. Work with us. Your customers and shareholders will love you for it.
Kroah-Hartman: AF_BUS, D-Bus, and the Linux kernel
Greg Kroah-Hartman writes about plans to get D-Bus functionality into the kernel (a topic last covered here in July, 2012). "Our goal (and I use 'goal' in a very rough term, I have 8 pages of scribbled notes describing what we want to try to implement here), is to provide a reliable multicast and point-to-point messaging system for the kernel, that will work quickly and securely. On top of this kernel feature, we will try to provide a 'libdbus' interface that allows existing D-Bus users to work without ever knowing the D-Bus daemon was replaced on their system."
Kernel development news
Some 3.8 development statistics
The release of 3.8-rc7 suggests that the 3.8 development cycle is nearing its close. This has been a busy cycle indeed, with, as of this writing, just over 12,300 non-merge changesets finding their way into the mainline. That makes 3.8 the most active development cycle ever, edging out 2.6.25 and its mere 12,243 changesets. Like it or not, the time for the traditional statistics article has come around; this time, though, your editor has tried looking at things in a different way.But, before getting to that, here's the usual numbers. As of this writing, some 1,253 developers have contributed code to the 3.8 kernel. The most active of those were:
Most active 3.8 developers
By changesets H Hartley Sweeten 426 3.5% Bill Pemberton 381 3.1% Philipp Reisner 238 1.9% Andreas Gruenbacher 210 1.7% Lars Ellenberg 146 1.2% Mark Brown 143 1.2% Sachin Kamat 135 1.1% Al Viro 127 1.0% Tomi Valkeinen 115 0.9% Wei Yongjun 114 0.9% Axel Lin 112 0.9% Johannes Berg 104 0.8% Kevin McKinney 103 0.8% YAMANE Toshiaki 101 0.8% Ben Skeggs 100 0.8% Paulo Zanoni 100 0.8% Ian Abbott 98 0.8% Mauro Carvalho Chehab 91 0.7% Andrei Emeltchenko 84 0.7% Daniel Vetter 82 0.7%
By changed lines Greg Kroah-Hartman 42448 5.8% Sreekanth Reddy 30415 4.2% H Hartley Sweeten 22581 3.1% Naresh Kumar Inna 19378 2.7% Larry Finger 16798 2.3% Paul Walmsley 16720 2.3% Jaegeuk Kim 13470 1.9% Rajendra Nayak 10398 1.4% David Howells 9946 1.4% Wei WANG 9775 1.3% Ben Skeggs 9395 1.3% Jussi Kivilinna 8784 1.2% Philipp Reisner 8596 1.2% Eunchul Kim 8533 1.2% Bill Pemberton 8293 1.1% Nobuhiro Iwamatsu 7795 1.1% Peter Hurley 7671 1.1% Laxman Dewangan 6898 0.9% Lars-Peter Clausen 6537 0.9% Lars Ellenberg 6320 0.9%
H. Hartley Sweeten's position at the top of the changeset list should be
unsurprising by now; he continues the seemingly endless task of cleaning up
the Comedi data acquisition drivers. Bill Pemberton has been working to
rid the kernel of the __devinit markings (and variants),
reflecting the fact that we all live in a hotplug world now. Philipp
Reisner, Andreas Gruenbacher, and Lars Ellenberg all contributed long lists
of changes to the DRBD distributed block
driver; the resulting code dump caused block maintainer Jens Axboe to promise Linus that "Following that, it was both made
perfectly clear that there is going to be no more over-the-wall pulls and
how the situation on individual pulls can be improved.
"
On the lines-changed side, Greg Kroah-Hartman worked on the __devinit removal, but also removed over 37,000 lines of code from the staging tree. Sreekanth Reddy made a number of additions to the mpt3sas SCSI driver, Naresh Kumar Inna contributed the Chelsio FCoE offload driver, and Larry Finger added the rtl8723ae wireless driver.
Some 205 employers (that we know about) supported development on the 3.8 kernel. The most active of these were:
Most active 3.8 employers
By changesets (None) 1580 12.8% Red Hat 1112 9.0% Intel 1076 8.7% (Unknown) 917 7.4% LINBIT 595 4.8% Linaro 572 4.6% Texas Instruments 492 4.0% Vision Engraving Systems 426 3.5% Samsung 410 3.3% SUSE 310 2.5% IBM 287 2.3% 254 2.1% Broadcom 190 1.5% (Consultant) 171 1.4% Wolfson Microelectronics 161 1.3% Freescale 129 1.0% Free Electrons 128 1.0% Parallels 123 1.0% NVidia 121 1.0% NetApp 121 1.0%
By lines changed (None) 79954 11.0% Red Hat 60515 8.3% Intel 46326 6.4% Linux Foundation 43190 5.9% (Unknown) 41097 5.7% Samsung 36596 5.0% (Consultant) 33175 4.6% LSI Logic 30415 4.2% Linaro 29030 4.0% Vision Engraving Systems 26074 3.6% LINBIT 22487 3.1% Chelsio 21534 3.0% Texas Instruments 21276 2.9% IBM 14233 2.0% Broadcom 12236 1.7% Renesas Electronics 11570 1.6% NVidia 10369 1.4% Realsil Microelectronics 9797 1.3% Qualcomm 9345 1.3% SUSE 9139 1.3%
Red Hat remains in its traditional position at the top of the list — but not by much. Perhaps more significant is that some companies that have long shown up in the top 20 have fallen off the list this time; those companies include AMD and Oracle. Meanwhile, we continue to see an increasingly strong showing from companies in the mobile and embedded area.
What are they working on?
Many of the companies in the above list have obvious objectives for their work in the kernel; LINBIT, for example, is a business built around DRBD, and Wolfson Microelectronics is in the business of selling a lot of audio hardware. But if companies just focused on driver work, there would be nobody left to do the core kernel work; thus, a look at what parts of the kernel any specific company is working on will say something about how broad its objectives are. To that end, your editor set out to hack on the gitdm tool to focus on one company at a time. So, for example, from the 3.3 kernel onward (essentially, from the beginning of 2012 to the present), Red Hat's changes clustered in these areas:
Red Hat % Subsystem Notes 34% drivers/ 9% gpu, 6% media, 6% net, 3% md 20% fs/ 3% xfs, 3% nfsd, 2% cifs, 2% gfs2, 1% btrfs, 1% ext4 14% include/ 8% net/ 8% tools/ 7% arch/x86/ 7% kernel/ 2% mm/
(Patches touching more than one subsystem are counted in each, so the percentages can add up to over 100%.)
Red Hat puts a lot of effort into making drivers work, but also has a strong interest in the filesystem subtree. The large proportion of patches going into tools/ reflects Red Hat's continued development of the perf tool.
Intel's focus during the same time period is somewhat different:
Intel % Subsystem Notes 66% drivers/ 22% net, 17% gpu, 4% scsi, 3% acpi, 3% usb 17% net/ 7% bluetooth, 5% mac80211, 3% nfc 13% include/ 7% arch/x86 3% fs/
Intel is a hardware company, so the bulk of its effort is focused on making its products work well in the Linux kernel. Improving memory management or general-purpose filesystems is mostly left for others.
Google's presence in the kernel development community has grown considerably in the last few years. In this case, the pattern of development is different yet again:
% Subsystem Notes 27% drivers/ 4% net, 4% pci, 3% staging, 3% input, 3% gpu 22% net/ 11% ipv4, 5% core, 5% ipv6 21% include/ 11% mm/ 10% fs/ 6% ext4, 1% proc 8% kernel/ 6% arch/arm 5% arch/x86 4% Documentation/
Google has an obvious interest in making the Internet work better, and much of its work in the kernel is aimed toward that goal. But the company also wants Android to work better (thus more driver work, ARM architecture work) and better scalability in general, leading to a lot of core kernel work. Much of Google's work is visible to the outside world in one way or another, so it is nice to see that the company has been reasonably diligent about keeping the relevant documentation current.
While we are on the subject of ARM, what about Linaro? This consortium is very much about hardware enablement, so it would not be surprising to see a focus on the ARM architecture subsystem. And, indeed, that's how it looks:
Linaro % Subsystem Notes 47% drivers/ 5% pinctrl, 4% clk, 4% mmc, 4% mfd, 3% gpu, 3% media 36% arch/arm 12% include/ 9% kernel/ 6% sound/ 5% Documentation/ 2% fs/ 1.5% pstore
Almost everything Linaro does is focused on making the hardware work better; even much of the work on the core kernel is dedicated to timekeeping. And while lots of work in Documentation/ is always welcome, in this case, it mostly consists of device tree snippets.
Finally, what about the largest group of all — developers who are working on their own time? Here is where those developers put their energies:
Unaffiliated developers % Subsystem Notes 68% drivers/ 13% staging, 12% net, 10% gpu, 8% media, 6% usb, 2% hid 14% arch/ 5% arm, 2% mips, 2% x86, 2% sparc 8% include/ 6% net/ 2% batman-adv 3% fs/ 2% Documentation/ 2% sound/ 1% kernel/
Volunteer developers, it seems, share a strong interest in making their own hardware work; they are also the source of many of the patches going into the staging tree. That suggests that, in a time when much of the kernel is becoming more complex and less approachable, the staging tree is providing a way for new developers to get into the kernel and learn the ropes in a relatively low-pressure setting. The continued health of the community depends on a steady flow of new developers, so providing an easy path for developers to get into kernel development can only be a good thing.
And, certainly, from the information found here, one should be able to conclude that the development community remains in good health overall. We are about to complete our busiest development cycle ever with no real signs of strain. For the time being, things seem to be functioning quite well.
Rationalizing CPU hotplugging
One of the leading sources of code churn in the 3.8 development cycle was the removal of the __devinit family of macros. These macros marked code and data that were only needed during device initialization and which, thus, could be disposed of once initialization was complete. These macros are being removed for a simple reason: hardware has become so dynamic that initialization is never complete; something new can always show up, and there is no longer any point in building a kernel that cannot cope with transient devices. Even in this world, though, CPUs are generally seen as being static. But CPUs, too, can come and go, and that is motivating changes in how the kernel manages them.Hotplugging is a familiar concept when one thinks about keyboards, printers, or storage devices, but it is a bit less so for CPUs: USB-attached add-on processors are still relatively rare in the market. Even so, the kernel has had support for CPU hotplug for some time; the original version of Documentation/cpu-hotplug.txt was added in 2006 for the 2.6.16 kernel. That document mentioned a couple of use cases for this feature: high-end NUMA hardware that truly has runtime-pluggable processors, and the ability to disable a faulty CPU in a high-reliability system. Other uses have since come along, including system suspend operations (where all CPUs but one are "unplugged" prior to suspending the system) and virtualization, where virtual CPUs can be given to (or taken from) guests at will.
So CPU hotplug is a useful feature, but the current implementation in the
kernel is not well loved; in a recent patch
set intended to improve the situation, Thomas Gleixner remarked that
"the current CPU hotplug implementation has become an increasing
nightmare full of races and undocumented behaviour.
" CPU hotplug
shows a lot of the signs of a feature that has evolved significantly over
time without high-level oversight; among other things, the sequence of
steps followed for an unplug
operation is not the reverse of the steps to plug in a new CPU. But much
of the trouble associated with CPU hotplug is blamed on its extensive use
of notifiers.
The kernel's notifier mechanism is a way for kernel code to request a callback when an event of interest happens. They are, in a sense, general-purpose hooks that anybody in the kernel can use — and, it seems, just about anybody does. There have been a lot of complaints about notifiers, as is typified by this comment from Linus in response to Thomas's patch set:
Notifiers also make the code hard to understand because there is no easy way to know what will happen when a notifier chain (which is a run-time construct) is invoked: there could be an arbitrary set of notifiers in the chain, in any order. The ordering requirements of specific notifiers can add some fun challenges of their own.
The process of unplugging a CPU requires a surprisingly long list of actions. The scheduler must be informed so it can migrate processes off the affected CPU and shut down the relevant run queue. Per-CPU kernel threads need to be told to exit or "park" themselves. CPU frequency governors need to be told to stop worrying about that processor. Almost anything with per-CPU variables will need to make arrangements for one CPU to go away. Timers running on the outgoing CPU need to be relocated. The read-copy-update subsystem must be told to stop tracking the CPU and to ensure that any RCU callbacks for that CPU get taken care of. Every architecture has its own low-level details to take care of. The perf events subsystem has an impressive set of requirements of its own. And so on; this list is nowhere near comprehensive.
All of these actions are currently accomplished by way of a set of notifier callbacks which, with luck, get called in the right order. Meanwhile, plugging in a new CPU requires an analogous set of operations, but those are handled in an asymmetric manner with a different set of callbacks. The end result is that the mechanism is fragile and that few people have any real understanding of all the steps needed to plug or unplug a CPU.
Thomas's objective is not to rewrite all those notifier functions or fundamentally change what is done to implement a CPU hotplug operation — at least, not yet. Instead, he is focused on imposing some order on the whole process so that it can be understood by looking at the code. To that end, he has replaced the current set of notifier chains with a linear sequence of states to be worked through when bringing up or shutting down a CPU. There is a single array of cpuhp_step structures, one per state:
struct cpuhp_step {
int (*startup)(unsigned int cpu);
int (*teardown)(unsigned int cpu);
};
The startup() function will be called when passing through the state as a new CPU is brought online, while teardown() is called when things are moving in the other direction. Many states only have one function or the other in the current implementation; the eventual goal is to make the process more symmetrical. In the initial patch set, the set of states is:
State startup teardown CPUHP_CREATE_THREADS ✔ CPUHP_PERF_X86_UNCORE_PREP ✔ ✔ CPUHP_PERF_X86_PREPARE ✔ ✔ CPUHP_PERF_BFIN ✔ CPUHP_PERF_POWER ✔ CPUHP_PERF_SUPERH ✔ CPUHP_PERF_PREPARE ✔ ✔ CPUHP_SCHED_MIGRATE_PREP ✔ ✔ CPUHP_WORKQUEUE_PREP ✔ CPUHP_RCUTREE_PREPARE ✔ ✔ CPUHP_HRTIMERS_PREPARE ✔ ✔ CPUHP_TIMERS_PREPARE ✔ ✔ CPUHP_PROFILE_PREPARE ✔ ✔ CPUHP_X2APIC_PREPARE ✔ ✔ CPUHP_SMPCFD_PREPARE ✔ ✔ CPUHP_SMPCFD_PREPARE ✔ CPUHP_SLAB_PREPARE ✔ ✔ CPUHP_NOTIFY_PREPARE ✔ CPUHP_NOTIFY_DEAD ✔ CPUHP_CPUFREQ_DEAD ✔ CPUHP_SCHED_DEAD ✔ CPUHP_CLOCKEVENTS_DEAD ✔ CPUHP_BRINGUP_CPU ✔ CPUHP_AP_OFFLINE Application processor states CPUHP_AP_SCHED_STARTING ✔ CPUHP_AP_PERF_X86_UNCORE_STARTING ✔ CPUHP_AP_PERF_X86_AMD_IBS_STARTING ✔ ✔ CPUHP_AP_PERF_X86_STARTING ✔ ✔ CPUHP_AP_PERF_ARM_STARTING ✔ CPUHP_AP_ARM_VFP_STARTING ✔ ✔ CPUHP_AP_ARM64_TIMER_STARTING ✔ ✔ CPUHP_AP_KVM_STARTING ✔ ✔ CPUHP_AP_X86_TBOOT_DYING ✔ CPUHP_AP_S390_VTIME_DYING ✔ CPUHP_AP_CLOCKEVENTS_DYING ✔ CPUHP_AP_RCUTREE_DYING ✔ CPUHP_AP_SCHED_NOHZ_DYING ✔ CPUHP_AP_SCHED_MIGRATE_DYING ✔ CPUHP_AP_MAZ End marker for AP states CPUHP_TEARDOWN_CPU ✔ CPUHP_PERCPU_THREADS ✔ ✔ CPUHP_SCHED_ONLINE ✔ ✔ CPUHP_PERF_ONLINE ✔ ✔ CPUHP_SCHED_MIGRATE_ONLINE ✔ CPUHP_WORKQUEUE_ONLINE ✔ ✔ CPUHP_CPUFREQ_ONLINE ✔ ✔ CPUHP_RCUTREE_ONLINE ✔ ✔ CPUHP_NOTIFY_ONLINE ✔ CPUHP_PROFILE_ONLINE ✔ CPUHP_SLAB_ONLINE ✔ ✔ CPUHP_NOTIFY_DOWN_PREPARE ✔ CPUHP_PERF_X86_UNCORE_ONLINE ✔ ✔ CPUHP_PERF_X86_ONLINE ✔ CPUHP_PERF_S390_ONLINE ✔ ✔
Looking at that list, one begins to see why the current CPU hotplug mechanism is hard to understand. Things are messy enough that Thomas is not really trying to change anything fundamental in how CPU hotplug works; most of the existing notifier callbacks are still there, they are just invoked in a different way. The purpose of the exercise, Thomas said, was:
Once some high-level order has been brought to the CPU hotplug mechanism, one can think about trying to clean things up. The eventual goal is to have a much smaller set of externally visible states; for drivers and filesystems, there will only be "prepare" and "enable" states available, with no ordering between subsystems. Also, notably, drivers and filesystems will not be allowed to cause a hotplug operation (in either direction) to fail. When the process is complete, the hotplug subsystem should be much more predictable, with a lot more of the details hidden from the rest of the kernel.
That is all work for a future series, though; the first step is to get the infrastructure set up. Chances are that will require at least one more iteration of Thomas's "Episode 1" patch set, meaning that it is unlikely to be 3.9 material. Starting around 3.10, though, we may well see significant changes to how CPU hotplugging is handled; the result should be more comprehensible and reliable code.
The zswap compressed swap cache
Swapping is one of the biggest threats to performance. The latency gap between RAM and swap, even on a fast SSD, can be four orders of magnitude. The throughput gap is two orders of magnitude. In addition to the speed gap, storage on which a swap area resides is becoming more shared and virtualized, which can cause additional I/O latency and nondeterministic workload performance. The zswap subsystem exists to mitigate these undesirable effects of swapping through a reduction in I/O activity.Zswap is a lightweight, write-behind compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. If this process is successful, the writeback to the swap device is deferred and, in many cases, avoided completely. This results in a significant I/O reduction and performance gains for systems that are swapping.
Zswap basics
Zswap intercepts pages in the middle of swap writeback and caches them using the frontswap API. Frontswap has been in the kernel since v3.5 and has been covered by LWN before. It allows a backend driver, like zswap, to intercept both swap page writeback and the page faults for swapped out pages. Zswap also makes use of the "zsmalloc" allocator (discussed below) for compressed page storage.
Zswap seeks to be as simple as possible in its structure and operation. There are two primary data structures. The first is the zswap_entry structure, which contains information about a single compressed page stored in zswap:
struct zswap_entry {
struct rb_node rbnode;
int refcount;
pgoff_t offset;
unsigned long handle; /* zsmalloc allocation */
unsigned int length;
/* ... */
};
The second is the zswap_tree structure which contains a red-black tree of zswap entries indexed by the offset value:
struct zswap_tree {
struct rb_root rbroot;
struct list_head lru;
spinlock_t lock;
struct zs_pool *pool;
};
At the highest level, there is an array of zswap_tree structures indexed by the swap device number.
There is a single lock per zswap_tree to protect the tree structure during lookups and modifications. The higher-level swap code provides certain protections that simplify the zswap implementation by not having to design for concurrent store, load, and invalidate operations on the same swap entry. While this single-lock design might seem like a likely source for contention, actual execution demonstrates that the swap path is largely bottlenecked by other locks at higher levels, such as the anon_vma mutex or swap_lock. In comparison, the zswap_tree lock is very lightly contended. Writeback support, covered in the next section, also led to this single-lock design.
For page compression, zswap uses compressor modules provided by the kernel's cryptographic API. This allows users to select the compressor dynamically at boot time, and gives easy access to hardware compression accelerators or any other future compression engines.
A zswap store operation occurs when a page is selected for swapping by the reclaim system and frontswap intercepts the page in swap_writepage(). The operation begins by compressing the page into a per-CPU temporary buffer. Compressing into the temporary buffer is required because the compressed size, and thus the size of the permanent allocation needed to hold it, isn't known until the compression is actually done. Once the compressed size is known, an object is allocated and the temporary buffer is copied into the object. Lastly, a zswap_entry structure is allocated, populated, and inserted into the tree for that swap device.
If the store fails for any reason, most likely because of an object allocation failure, zswap returns an error which is propagated up through frontswap into swap_writepage(). The page is then swapped out to the swap device as usual.
A load operation occurs when a program page faults on a page table entry (PTE) that contains a swap entry and is intercepted by frontswap in swap_readpage(). The swap entry contains the device and offset information needed to look up the zswap entry in the appropriate tree. Once the entry is located, the data is decompressed directly into the page allocated by the page fault code. The entry is not removed from the tree during a load; it remains up-to-date until the entry is invalidated.
An invalidate operation occurs when the reference count for a particular swap offset becomes zero in swap_entry_free(). In this case, the zswap entry is removed from the appropriate tree, and the entry and the zsmalloc allocation that it references are freed.
To be preemption-friendly, interrupts are never disabled. Preemption is only disabled during compression while accessing the per-cpu temporary buffer page, and during decompression while accessing a mapped zsmalloc allocation.
Zswap writeback
To operate optimally as a cache, zswap should hold the most recently used pages. With frontswap, there is, unfortunately, a real potential for an inverse least recently used (LRU) condition in which the cache fills with older pages, and newer pages are forced out to the slower swap device. To address this, zswap is designed with "resumed" writeback in mind.
As background, the process for swapping pages follows these steps:
- First, an anonymous memory page is selected for swapping and a slot is
allocated in the swap device.
- Next, the page is unmapped from all processes using that page. The
PTEs referencing that page are filled with the swap entry that consists of
the swap type and offset where the page can be found.
- Lastly, the page is scheduled for writeback to the swap device.
When frontswap_store() in swap_writepage() is successful, the writeback step is not performed. However, the slot in the swap device has been allocated and is still reserved for the page even though the page only resides in the frontswap backend. Resumed writeback in zswap forces pages out of the compressed cache into their previously reserved swap slots in the swap device. Currently, the policy is basic and forces pages out from the cache in two cases: (1) when the cache has reached its maximum size according to the max_pool_percent sysfs tunable or, (2) when zswap is unable to allocate new space for the compressed pool.
During resumed writeback, zswap decompresses the page, adds it back to the swap cache, and schedules writeback into the swap slot that was previously reserved. By splitting swap_writepage() into two functions after frontswap_store() is called, zswap can resume writeback from the point where the initial writeback terminated in frontswap. The new function is called __swap_writepage().
Freeing zswap entries becomes more complex with writeback. Without writeback, pages would only be freed during invalidate operations (zswap_frontswap_invalidate page()). With writeback, pages can also be freed in zswap_writeback_pages(). These invalidate and writeback functions can run concurrently for the same zswap entry. To guarantee that entries are not freed while being accessed by another thread, a reference count field (called refcount) is used the zswap_entry structure.
Zsmalloc rationale
One really can't talk about zswap without mentioning zsmalloc, the allocator it uses for compressed page storage, which currently resides in the Linux Staging tree.
Zsmalloc is a slab-based allocator used by zswap; it provides more reliable allocation of large objects in a memory constrained environment than does the kernel slab allocator. Zsmalloc has already been discussed on LWN, so this section will focus more on the need for zsmalloc in the presence of the kernel slab allocator.
The objects that zswap stores are compressed pages. The default compressor is lzo1x-1, which is known for speed, but not so much for high compression. As a result, zswap objects can frequently be large relative to typical slab objects (>1/8th PAGE_SIZE). This is a problem for the kernel slab allocator under memory pressure.
The kernel slab allocator requires high-order page allocations to back slabs for large objects. For example, on a system with a 4K page size, the kmalloc-512 cache has slabs that are backed by two contiguous pages. kmalloc-2048 requires eight contiguous pages per slab. These high-order page allocations are very likely to fail when the system is under memory pressure.
Zsmalloc addresses this problem by allowing the pages backing a slab (or “size class” in zsmalloc terms) to be both non-contiguous and variable in number. They are variable in number because zsmalloc allows a slab to be composed of less than the target number of backing pages. A set of non-contiguous pages backing a slab are stitched together using fields of struct page to create a “zspage”. This allows zsmalloc to service large object allocations, up to PAGE_SIZE, without requiring high-order page allocations.
Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively uses an entire page, resulting in ~50% waste. Hence there are no kmalloc() cache sizes between PAGE_SIZE/2 and PAGE_SIZE. Zswap frequently needs allocations in this range, however. Using the kernel slab allocator causes the memory savings achieved through compression to be lost in fragmentation.
In order to satisfy these larger allocations while not wasting an entire page, zsmalloc allows objects to span page boundaries at the cost of having to map the allocations before accessing them. This mapping is needed because the object might be contained in two non-contiguous pages. For example, in a zsmalloc size class for objects that are 2/3 of PAGE_SIZE, three objects could be stored in a zspage with two non-contiguous backing pages with no waste. The object stored in the second of the three object positions in the zspage would be split between two different pages.
Zsmalloc is a good fit for zswap. Zswap was evaluated using the kernel slab allocator and these issues did have a significant impact on the frontswap_store() success rate. This was due to kmalloc() allocation failures and a need to reject pages that compressed to sizes greater than PAGE_SIZE/2.
Performance
In order to produce a performance comparison, kernel builds were conducted with an increasing number of threads per run in a constant and constrained amount of memory. The results indicate a runtime reduction of 53% and an I/O reduction of 76% with zswap compared to normal swapping. The testing system was configured with:
- Gentoo running v3.7-rc7
- Quad-core i5-2500 @ 3.3GHz
- 512MB DDR3 1600MHz (limited with mem=512m on boot)
- Filesystem and swap on 80GB HDD (about 58MB/s with hdparm -t)
The table below summarizes the test runs.
Baseline zswap Change N pswpin pswpout majflt I/O sum pswpin pswpout majflt I/O sum %I/O MB 8 1 335 291 627 0 0 249 249 -60% 1 12 3688 14315 5290 23293 123 860 5954 6937 -70% 64 16 12711 46179 16803 75693 2936 7390 46092 56418 -25% 75 20 42178 133781 49898 225857 9460 28382 92951 130793 -42% 371 24 96079 357280 105242 558601 7719 18484 109309 135512 -76% 1653
The 'N' column indicates the maximum number of concurrent threads for the kernel build (make -jN) for each run. The next four columns are the statistics for the baseline run without zswap, followed by the same for the zswap run. The I/O sum column for each run is a sum of pswpin (pages swapped in), pswpout (pages swapped out), and majflt (major page faults). The difference between the baseline and zswap runs is shown both in relative terms, as a percentage of I/O reduction, and in absolute terms, as a reduction of X megabytes of I/O related to swapping activity.
A compressed swap cache reduces the efficiency of the page reclaim process. For any store operation, the cache may allocate some pages to store the compressed page. This results in an reduction of overall page reclaim efficiency. This reduction in efficiency results in additional shrinking pressure on the page cache causing an increase in major page faults where pages must be re-read from disk. In order to have a complete picture of the I/O impact, the major page faults must be considered in the sum of I/O.
The next table shows the total runtime of the kernel builds:
Runtime (in seconds) N base zswap %change 8 107 107 0% 12 128 110 -14% 16 191 179 -6% 20 371 240 -35% 24 570 267 -53%
The runtime impact of swap activity is decreased when comparing runs with the same number of threads. The rate of degradation is reduced for increasingly constrained runs when comparing baseline and zswap.
The measurements of average CPU utilization during the builds are:
%CPU utilization (out of 400% on 4 cpus) N base zswap %change 8 317 319 1% 12 267 311 16% 16 179 191 7% 20 94 143 52% 24 60 128 113%
The CPU utilization table shows that with zswap, the kernel build is able to make more productive use of the CPUs, as is expected from the runtime results.
Additional performance testing was performed using SPECjbb. Metrics regarding the performance improvements and I/O reductions that can be achieved using zswap on both x86 and Power7+ (with and without hardware compression acceleration), can be found on this page.
Conclusion
Zswap is a compressed swap cache, able to evict pages from the compressed cache, on an LRU basis, to the backing swap device when the compressed pool reaches it size limit or the pool is unable to obtain additional pages from the buddy allocator. Its approach trades CPU cycles for reduced swap I/O. This trade-off can result in a significant performance improvement as reads to and writes from to the compressed cache are almost always faster that reading from a swap device which incurs the latency of an asynchronous block I/O read.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
