Brief items
The current development kernel is 3.8-rc7,
released on February 8. Linus says:
"
Anyway, here it is. Mostly driver updates (usb, networking, radeon,
regulator, sound) with a random smattering of other stuff (btrfs,
networking, so on. And most everything is pretty small."
Stable updates: the 3.7.7,
3.4.30 and 3.0.63 updates were released on
February 11; 3.5.7.5 was released on
February 8.
The 3.7.8, 3.4.31, and 3.0.64 updates are in the review process as of
this writing; they can be expected on or after February 14.
Comments (none posted)
As far as I'm concerned, *the* *only* interface stability
warranties in VFS are those for syscalls. Which means that no
tracepoints are going to be acceptable there. End of story.
—
Al Viro
Hey ARM!
We are not going away, we are here to stay. We cannot be silenced
or stopped anymore, and we are becoming harder and harder to
ignore.
It is only a matter of time before we produce an open source
graphics driver stack which rivals your binary in performance. And
that time is measured in weeks and months now. The requests from
your own customers, for support for this open source stack, will
only grow louder and louder.
So please, stop fighting us. Embrace us. Work with us. Your
customers and shareholders will love you for it.
—
Luc Verhaegen
Yeah, a plan, I know it goes against normal kernel development
procedures, but hey, we're in our early 20's now, it's about time
we started getting responsible.
—
Greg Kroah-Hartman
Comments (18 posted)
Greg Kroah-Hartman
writes about plans to
get D-Bus functionality into the kernel (a topic last
covered here in July, 2012). "
Our goal
(and I use 'goal' in a very rough term, I have 8 pages of scribbled notes
describing what we want to try to implement here), is to provide a reliable
multicast and point-to-point messaging system for the kernel, that will
work quickly and securely. On top of this kernel feature, we will try to
provide a 'libdbus' interface that allows existing D-Bus users to work
without ever knowing the D-Bus daemon was replaced on their system."
Comments (121 posted)
Kernel development news
By Jonathan Corbet
February 13, 2013
The release of
3.8-rc7 suggests that the
3.8 development cycle is nearing its close. This has been a busy cycle
indeed, with, as of this writing, just over 12,300 non-merge changesets
finding their way into the mainline. That makes 3.8 the most active
development cycle ever, edging out 2.6.25 and its mere 12,243 changesets.
Like it or not, the time for the traditional statistics article has come
around; this time, though, your editor has tried looking at things in a
different way.
But, before getting to that, here's the usual numbers. As of this writing,
some 1,253 developers have contributed code to the 3.8 kernel. The most
active of those were:
| Most active 3.8 developers |
| By changesets |
| H Hartley Sweeten | 426 | 3.5% |
| Bill Pemberton | 381 | 3.1% |
| Philipp Reisner | 238 | 1.9% |
| Andreas Gruenbacher | 210 | 1.7% |
| Lars Ellenberg | 146 | 1.2% |
| Mark Brown | 143 | 1.2% |
| Sachin Kamat | 135 | 1.1% |
| Al Viro | 127 | 1.0% |
| Tomi Valkeinen | 115 | 0.9% |
| Wei Yongjun | 114 | 0.9% |
| Axel Lin | 112 | 0.9% |
| Johannes Berg | 104 | 0.8% |
| Kevin McKinney | 103 | 0.8% |
| YAMANE Toshiaki | 101 | 0.8% |
| Ben Skeggs | 100 | 0.8% |
| Paulo Zanoni | 100 | 0.8% |
| Ian Abbott | 98 | 0.8% |
| Mauro Carvalho Chehab | 91 | 0.7% |
| Andrei Emeltchenko | 84 | 0.7% |
| Daniel Vetter | 82 | 0.7% |
|
| By changed lines |
| Greg Kroah-Hartman | 42448 | 5.8% |
| Sreekanth Reddy | 30415 | 4.2% |
| H Hartley Sweeten | 22581 | 3.1% |
| Naresh Kumar Inna | 19378 | 2.7% |
| Larry Finger | 16798 | 2.3% |
| Paul Walmsley | 16720 | 2.3% |
| Jaegeuk Kim | 13470 | 1.9% |
| Rajendra Nayak | 10398 | 1.4% |
| David Howells | 9946 | 1.4% |
| Wei WANG | 9775 | 1.3% |
| Ben Skeggs | 9395 | 1.3% |
| Jussi Kivilinna | 8784 | 1.2% |
| Philipp Reisner | 8596 | 1.2% |
| Eunchul Kim | 8533 | 1.2% |
| Bill Pemberton | 8293 | 1.1% |
| Nobuhiro Iwamatsu | 7795 | 1.1% |
| Peter Hurley | 7671 | 1.1% |
| Laxman Dewangan | 6898 | 0.9% |
| Lars-Peter Clausen | 6537 | 0.9% |
| Lars Ellenberg | 6320 | 0.9% |
|
H. Hartley Sweeten's position at the top of the changeset list should be
unsurprising by now; he continues the seemingly endless task of cleaning up
the Comedi data acquisition drivers. Bill Pemberton has been working to
rid the kernel of the __devinit markings (and variants),
reflecting the fact that we all live in a hotplug world now. Philipp
Reisner, Andreas Gruenbacher, and Lars Ellenberg all contributed long lists
of changes to the DRBD distributed block
driver; the resulting code dump caused block maintainer Jens Axboe to promise Linus that "Following that, it was both made
perfectly clear that there is going to be no more over-the-wall pulls and
how the situation on individual pulls can be improved."
On the lines-changed side, Greg Kroah-Hartman worked on the
__devinit removal, but also removed over 37,000 lines of code from
the staging tree. Sreekanth Reddy made a number of additions to the
mpt3sas SCSI driver, Naresh Kumar Inna contributed the Chelsio FCoE offload
driver, and Larry Finger added the rtl8723ae wireless driver.
Some 205 employers (that we know about) supported development on the 3.8
kernel. The most active of these were:
| Most active 3.8 employers |
| By changesets |
| (None) | 1580 | 12.8% |
| Red Hat | 1112 | 9.0% |
| Intel | 1076 | 8.7% |
| (Unknown) | 917 | 7.4% |
| LINBIT | 595 | 4.8% |
| Linaro | 572 | 4.6% |
| Texas Instruments | 492 | 4.0% |
| Vision Engraving Systems | 426 | 3.5% |
| Samsung | 410 | 3.3% |
| SUSE | 310 | 2.5% |
| IBM | 287 | 2.3% |
| Google | 254 | 2.1% |
| Broadcom | 190 | 1.5% |
| (Consultant) | 171 | 1.4% |
| Wolfson Microelectronics | 161 | 1.3% |
| Freescale | 129 | 1.0% |
| Free Electrons | 128 | 1.0% |
| Parallels | 123 | 1.0% |
| NVidia | 121 | 1.0% |
| NetApp | 121 | 1.0% |
|
| By lines changed |
| (None) | 79954 | 11.0% |
| Red Hat | 60515 | 8.3% |
| Intel | 46326 | 6.4% |
| Linux Foundation | 43190 | 5.9% |
| (Unknown) | 41097 | 5.7% |
| Samsung | 36596 | 5.0% |
| (Consultant) | 33175 | 4.6% |
| LSI Logic | 30415 | 4.2% |
| Linaro | 29030 | 4.0% |
| Vision Engraving Systems | 26074 | 3.6% |
| LINBIT | 22487 | 3.1% |
| Chelsio | 21534 | 3.0% |
| Texas Instruments | 21276 | 2.9% |
| IBM | 14233 | 2.0% |
| Broadcom | 12236 | 1.7% |
| Renesas Electronics | 11570 | 1.6% |
| NVidia | 10369 | 1.4% |
| Realsil Microelectronics | 9797 | 1.3% |
| Qualcomm | 9345 | 1.3% |
| SUSE | 9139 | 1.3% |
|
Red Hat remains in its traditional position at the top of the list — but
not by much. Perhaps more significant is that some companies that have
long shown up in the top 20 have fallen off the list this time; those
companies include AMD and Oracle. Meanwhile, we continue to see an
increasingly strong showing from companies in the mobile and embedded
area.
What are they working on?
Many of the companies in the above list have obvious objectives for their
work in the kernel; LINBIT, for example, is a business built around DRBD,
and Wolfson Microelectronics is in the business of selling a lot of audio
hardware. But if companies just focused on driver work, there would be
nobody left to do the core kernel work; thus, a look at what parts of the
kernel any specific company is working on will say something about how
broad its objectives are. To that end, your editor set out to hack on the
gitdm tool to focus on one company at a time. So, for example, from the
3.3 kernel onward (essentially, from the beginning of 2012 to the present),
Red Hat's changes clustered in these areas:
| Red Hat |
| % | Subsystem | Notes |
| 34% | drivers/ |
9% gpu, 6% media, 6%
net, 3% md |
| 20% | fs/ |
3% xfs, 3% nfsd, 2% cifs,
2% gfs2, 1% btrfs, 1% ext4 |
| 14% | include/ | |
| 8% | net/ | |
| 8% | tools/ | |
| 7% | arch/x86/ | |
| 7% | kernel/ | |
| 2% | mm/ | |
(Patches touching more than one subsystem are counted in each, so the
percentages can add up to over 100%.)
Red Hat puts a lot of effort into making drivers work, but also has a
strong interest in the filesystem subtree. The large proportion of patches
going into tools/ reflects Red Hat's continued development
of the perf tool.
Intel's focus during the same time period is somewhat different:
| Intel |
| % | Subsystem | Notes |
| 66% | drivers/ |
22% net, 17% gpu, 4%
scsi, 3% acpi, 3% usb |
| 17% | net/ |
7% bluetooth, 5% mac80211, 3% nfc |
| 13% | include/ | |
| 7% | arch/x86 | |
| 3% | fs/ | |
Intel is a hardware company, so the bulk of its effort is focused
on making its products work well in the Linux kernel. Improving memory
management or general-purpose filesystems is mostly left for others.
Google's presence in the kernel development community has grown
considerably in the last few years. In this case, the pattern of
development is different yet again:
| Google |
| % | Subsystem | Notes |
| 27% | drivers/ |
4% net, 4% pci, 3%
staging, 3% input, 3% gpu |
| 22% | net/ |
11% ipv4, 5% core, 5% ipv6 |
| 21% | include/ | |
| 11% | mm/ | |
| 10% | fs/ |
6% ext4, 1% proc |
| 8% | kernel/ | |
| 6% | arch/arm | |
| 5% | arch/x86 | |
| 4% | Documentation/ | |
Google has an obvious interest in making the Internet work better, and much
of its work in the kernel is aimed toward that goal. But the company also
wants Android to work better (thus more driver work, ARM architecture work)
and better scalability in general, leading to a lot of core kernel work.
Much of Google's work is visible to the outside world in one way or
another, so it is nice to see that the company has been reasonably diligent
about keeping the relevant documentation current.
While we are on the subject of ARM, what about Linaro? This
consortium is very much about hardware
enablement, so it would not be surprising to see a focus on the ARM
architecture subsystem. And, indeed, that's how it looks:
| Linaro |
| % | Subsystem | Notes |
| 47% | drivers/ |
5% pinctrl, 4% clk, 4%
mmc, 4% mfd, 3% gpu,
3% media |
| 36% | arch/arm | |
| 12% | include/ | |
| 9% | kernel/ | |
| 6% | sound/ | |
| 5% | Documentation/ | |
| 2% | fs/ |
1.5% pstore |
Almost everything Linaro does is focused on making the hardware work
better; even much of the work on the core kernel is dedicated to timekeeping. And
while lots of work in Documentation/ is always welcome, in this
case, it mostly consists of device tree snippets.
Finally, what about the largest group of all — developers who are working
on their own time? Here is where those developers put their energies:
| Unaffiliated developers |
| % | Subsystem | Notes |
| 68% | drivers/ |
13% staging, 12% net, 10%
gpu, 8% media, 6% usb,
2% hid |
| 14% | arch/ |
5% arm, 2% mips, 2% x86,
2% sparc |
| 8% | include/ | |
| 6% | net/ |
2% batman-adv |
| 3% | fs/ | |
| 2% | Documentation/ | |
| 2% | sound/ | |
| 1% | kernel/ | |
Volunteer developers, it seems, share a strong interest in making their own
hardware work; they are also the source of many of the patches going into
the staging tree. That suggests that, in a time when much of the kernel is
becoming more complex and less approachable, the staging tree is providing
a way for new developers to get into the kernel and learn the ropes in a
relatively low-pressure setting. The continued health of the community
depends on a steady flow of new developers, so providing an easy path for
developers to get into kernel development can only be a good thing.
And, certainly, from the information found here, one should be able to
conclude that the development community remains in good health overall. We
are about to complete our busiest development cycle ever with no real signs
of strain. For the time being, things seem to be functioning quite well.
Comments (9 posted)
By Jonathan Corbet
February 12, 2013
One of the leading sources of code churn in the 3.8 development cycle was
the removal of the
__devinit family of macros. These macros
marked code and data that were only needed during device initialization and
which, thus, could be disposed of once initialization was complete. These macros
are being removed for a simple reason: hardware has become so dynamic that
initialization is
never complete; something new can always show up,
and there is no longer any point in building a kernel that cannot cope with
transient devices. Even in this world, though, CPUs are generally seen as
being static. But CPUs, too, can come and go, and that is motivating
changes in how the kernel manages them.
Hotplugging is a familiar concept when one thinks about keyboards,
printers, or storage devices, but it is a bit less so for CPUs:
USB-attached add-on processors are still relatively rare in the market.
Even so, the kernel has had support for CPU hotplug for some time; the
original version of Documentation/cpu-hotplug.txt was added in
2006 for the 2.6.16 kernel. That document mentioned a couple of use cases
for this feature: high-end NUMA hardware that truly has runtime-pluggable
processors, and the ability to disable a faulty CPU in a high-reliability
system. Other uses have since come along, including system suspend operations (where all
CPUs but one are "unplugged" prior to suspending the system) and
virtualization, where virtual CPUs can be given to (or taken from) guests
at will.
So CPU hotplug is a useful feature, but the current implementation in the
kernel is not well loved; in a recent patch
set intended to improve the situation, Thomas Gleixner remarked that
"the current CPU hotplug implementation has become an increasing
nightmare full of races and undocumented behaviour." CPU hotplug
shows a lot of the signs of a feature that has evolved significantly over
time without high-level oversight; among other things, the sequence of
steps followed for an unplug
operation is not the reverse of the steps to plug in a new CPU. But much
of the trouble associated with CPU hotplug is blamed on its extensive use
of notifiers.
The kernel's notifier mechanism is a way
for kernel code to request a callback when an event of interest happens.
They are, in a sense, general-purpose hooks that anybody in the kernel can
use — and, it seems, just about anybody does. There have been a lot of
complaints about notifiers, as is typified by this comment from Linus in response to
Thomas's patch set:
Notifiers are a disgrace, and almost all of them are a major design
mistake. They all have locking problems, [they] introduce internal
arbitrary API's that are hard to fix later (because you have random
people who decided to hook into them, which is the whole *point* of
those notifier chains).
Notifiers also make the code hard to understand because there is no easy
way to know what will happen when a notifier chain (which is a run-time
construct) is invoked: there could be an arbitrary set of notifiers in the
chain, in any order. The
ordering requirements of specific notifiers can add some fun challenges of
their own.
The process of unplugging a CPU requires a surprisingly long list of actions. The
scheduler must be informed so it can migrate processes off the affected CPU
and shut down the relevant run queue. Per-CPU kernel threads need to be
told to exit or "park" themselves. CPU frequency governors need to be told
to stop worrying about that processor. Almost anything with per-CPU
variables will need to make arrangements for one CPU to go away. Timers
running on the outgoing CPU need to be relocated. The read-copy-update
subsystem must be told to stop tracking the CPU and to ensure that any RCU
callbacks for that CPU get taken care of. Every architecture has its own
low-level details to take care of. The perf events subsystem has an
impressive set of requirements of its own. And so on; this list is nowhere
near comprehensive.
All of these actions are currently accomplished by way of a set of notifier
callbacks which, with luck, get called in the right order.
Meanwhile, plugging in a new CPU requires an analogous set of operations,
but those are handled in an asymmetric manner with a different set of
callbacks. The end result is that the mechanism is fragile and that few
people have any real understanding of all the steps needed to plug or
unplug a CPU.
Thomas's objective is not to rewrite all those notifier functions or
fundamentally change what is done to implement a CPU hotplug operation — at
least, not yet. Instead, he is focused on imposing some order on the whole
process so that it can be understood by looking at the code. To that end,
he has replaced the current set of notifier chains with a linear sequence
of states to be worked through when bringing up or shutting down a CPU.
There is a single array of cpuhp_step structures, one per state:
struct cpuhp_step {
int (*startup)(unsigned int cpu);
int (*teardown)(unsigned int cpu);
};
The startup() function will be called when passing through the
state as a new CPU is brought online, while teardown() is called
when things are moving in the other direction. Many states only have one
function or the other in the current implementation; the eventual goal is
to make the process more symmetrical. In the initial patch set, the set of
states is:
| State | startup | teardown |
|
| CPUHP_CREATE_THREADS |
✔ |
| |
| CPUHP_PERF_X86_UNCORE_PREP |
✔ |
✔ | |
| CPUHP_PERF_X86_PREPARE |
✔ |
✔ | |
| CPUHP_PERF_BFIN |
✔ |
| |
| CPUHP_PERF_POWER |
✔ |
| |
| CPUHP_PERF_SUPERH |
✔ |
| |
| CPUHP_PERF_PREPARE |
✔ |
✔ | |
| CPUHP_SCHED_MIGRATE_PREP |
✔ |
✔ | |
| CPUHP_WORKQUEUE_PREP |
✔ |
| |
| CPUHP_RCUTREE_PREPARE |
✔ |
✔ | |
| CPUHP_HRTIMERS_PREPARE |
✔ |
✔ | |
| CPUHP_TIMERS_PREPARE |
✔ |
✔ | |
| CPUHP_PROFILE_PREPARE |
✔ |
✔ | |
| CPUHP_X2APIC_PREPARE |
✔ |
✔ | |
| CPUHP_SMPCFD_PREPARE |
✔ |
✔ | |
| CPUHP_SMPCFD_PREPARE |
✔ |
| |
| CPUHP_SLAB_PREPARE |
✔ |
✔ | |
| CPUHP_NOTIFY_PREPARE |
✔ |
| |
| CPUHP_NOTIFY_DEAD |
|
✔ | |
| CPUHP_CPUFREQ_DEAD |
|
✔ | |
| CPUHP_SCHED_DEAD |
|
✔ | |
| CPUHP_CLOCKEVENTS_DEAD |
|
✔ | |
| CPUHP_BRINGUP_CPU |
✔ |
| |
| CPUHP_AP_OFFLINE |
|
| Application processor states |
| CPUHP_AP_SCHED_STARTING |
✔ |
| |
| CPUHP_AP_PERF_X86_UNCORE_STARTING |
✔ |
| |
| CPUHP_AP_PERF_X86_AMD_IBS_STARTING |
✔ |
✔ | |
| CPUHP_AP_PERF_X86_STARTING |
✔ |
✔ | |
| CPUHP_AP_PERF_ARM_STARTING |
✔ |
| |
| CPUHP_AP_ARM_VFP_STARTING |
✔ |
✔ | |
| CPUHP_AP_ARM64_TIMER_STARTING |
✔ |
✔ | |
| CPUHP_AP_KVM_STARTING |
✔ |
✔ | |
| CPUHP_AP_X86_TBOOT_DYING |
|
✔ | |
| CPUHP_AP_S390_VTIME_DYING |
|
✔ | |
| CPUHP_AP_CLOCKEVENTS_DYING |
|
✔ | |
| CPUHP_AP_RCUTREE_DYING |
|
✔ | |
| CPUHP_AP_SCHED_NOHZ_DYING |
|
✔ | |
| CPUHP_AP_SCHED_MIGRATE_DYING |
|
✔ | |
| CPUHP_AP_MAZ |
|
| End marker for AP states |
| CPUHP_TEARDOWN_CPU |
|
✔ | |
| CPUHP_PERCPU_THREADS |
✔ |
✔ | |
| CPUHP_SCHED_ONLINE |
✔ |
✔ | |
| CPUHP_PERF_ONLINE |
✔ |
✔ | |
| CPUHP_SCHED_MIGRATE_ONLINE |
✔ |
| |
| CPUHP_WORKQUEUE_ONLINE |
✔ |
✔ | |
| CPUHP_CPUFREQ_ONLINE |
✔ |
✔ | |
| CPUHP_RCUTREE_ONLINE |
✔ |
✔ | |
| CPUHP_NOTIFY_ONLINE |
✔ |
| |
| CPUHP_PROFILE_ONLINE |
✔ |
| |
| CPUHP_SLAB_ONLINE |
✔ |
✔ | |
| CPUHP_NOTIFY_DOWN_PREPARE |
|
✔ | |
| CPUHP_PERF_X86_UNCORE_ONLINE |
✔ |
✔ | |
| CPUHP_PERF_X86_ONLINE |
✔ |
| |
| CPUHP_PERF_S390_ONLINE |
✔ |
✔ | |
Looking at that list, one begins to see why the current CPU hotplug
mechanism is hard to understand. Things are messy enough that Thomas is
not really trying to change anything fundamental in how CPU hotplug works;
most of the existing notifier callbacks are still there, they are just
invoked in a different way. The purpose of the exercise, Thomas said, was:
It's about making the ordering constraints clear. It's about
documenting the existing horror in a way, that one can understand
the hotplug process w/o hallucinogenic drugs.
Once some high-level order has been brought to the CPU hotplug mechanism,
one can think about trying to clean things up. The eventual goal is to
have a much smaller set of externally visible states; for drivers and
filesystems, there will only be "prepare" and "enable" states available,
with no ordering between subsystems. Also, notably, drivers and
filesystems will not be allowed to cause a hotplug operation (in either
direction) to fail. When the process is complete, the hotplug subsystem should be
much more predictable, with a lot more of the details hidden from the rest
of the kernel.
That is all work for a future series, though; the first step is to get the
infrastructure set up. Chances are that will require at least one more
iteration of Thomas's "Episode 1" patch set, meaning that it is
unlikely to be 3.9 material. Starting around 3.10, though, we may well see
significant changes to how CPU hotplugging is handled; the result should be
more comprehensible and reliable code.
Comments (none posted)
February 12, 2013
This article was contributed by Seth Jennings
Swapping is one of the biggest threats to performance. The latency gap
between RAM and swap, even on a fast SSD, can be four orders of magnitude. The
throughput gap is two orders of magnitude. In addition to the speed gap,
storage on which a swap area resides is becoming more shared and
virtualized, which can cause additional I/O latency and nondeterministic
workload performance. The zswap subsystem exists to mitigate these
undesirable effects of swapping through a reduction in I/O activity.
Zswap is a lightweight, write-behind compressed cache for swap pages. It
takes pages that are in the process of being swapped out and attempts to
compress them into a dynamically allocated RAM-based memory pool. If this
process is successful, the writeback to the swap device is deferred and, in
many cases, avoided completely. This results in a significant I/O
reduction and performance gains for systems that are swapping.
Zswap basics
Zswap intercepts pages in the middle of swap writeback and caches them
using the frontswap API. Frontswap has been in the kernel since v3.5 and
has been covered by LWN before. It allows a
backend driver, like zswap, to intercept both swap page writeback and the
page faults for swapped out pages. Zswap also makes use of
the "zsmalloc" allocator (discussed below) for compressed page storage.
Zswap seeks to be as simple as possible in its structure and operation.
There are two primary data structures. The first is the zswap_entry
structure, which contains information about a single compressed page stored
in zswap:
struct zswap_entry {
struct rb_node rbnode;
int refcount;
pgoff_t offset;
unsigned long handle; /* zsmalloc allocation */
unsigned int length;
/* ... */
};
The second is the zswap_tree structure which contains a red-black tree of
zswap entries indexed by the offset value:
struct zswap_tree {
struct rb_root rbroot;
struct list_head lru;
spinlock_t lock;
struct zs_pool *pool;
};
At the highest level, there is an array of zswap_tree structures indexed by
the swap device number.
There is a single lock per zswap_tree to protect the tree
structure during lookups and modifications. The higher-level swap code
provides certain protections that simplify the zswap implementation by not
having to design for concurrent store, load, and invalidate operations on
the same swap entry. While this single-lock design might seem like a
likely source for contention, actual execution demonstrates that the swap
path is largely bottlenecked by other locks at higher levels, such as the
anon_vma mutex or swap_lock. In comparison, the
zswap_tree lock
is very lightly contended. Writeback support, covered in the next section,
also led to this single-lock design.
For page compression, zswap uses compressor modules provided by the kernel's
cryptographic API. This allows users to select the compressor dynamically
at boot time, and gives easy access to hardware compression accelerators or
any other future compression engines.
A zswap store operation occurs when a page is selected for swapping by the
reclaim system and frontswap intercepts the page in
swap_writepage(). The operation begins by compressing the page
into a per-CPU temporary buffer. Compressing into the temporary buffer is
required because the compressed size, and thus the size of the permanent
allocation needed to hold it, isn't known until the compression is actually
done. Once the compressed size is known, an object is allocated and the
temporary buffer is copied into the object. Lastly, a zswap_entry
structure is allocated, populated, and inserted into the tree for that swap
device.
If the store fails for any reason, most likely because of an object
allocation failure, zswap returns an error which is propagated up through
frontswap into swap_writepage(). The page is then swapped out to
the swap device as usual.
A load operation occurs when a program page faults on a page table entry
(PTE) that contains a swap entry and is intercepted by frontswap in
swap_readpage(). The swap entry contains the device and offset
information needed to look up the zswap entry in the appropriate tree.
Once the entry is located, the data is decompressed directly into the page
allocated by the page fault code. The entry is not removed from the tree
during a load; it remains up-to-date until the entry is invalidated.
An invalidate operation occurs when the reference count for a
particular swap offset becomes zero in swap_entry_free(). In this case,
the zswap entry is removed from the appropriate tree, and the entry and the
zsmalloc allocation that it references are freed.
To be preemption-friendly, interrupts are never disabled. Preemption is
only disabled during compression while accessing the per-cpu temporary
buffer page, and during decompression while accessing a mapped
zsmalloc allocation.
Zswap writeback
To operate optimally as a cache, zswap should hold the most recently used pages. With
frontswap, there is, unfortunately, a real potential for an inverse least
recently used
(LRU) condition in which the cache fills with older pages, and newer pages
are forced out to the slower swap device. To address this, zswap is
designed with "resumed" writeback in mind.
As background, the process for swapping pages follows these steps:
- First, an anonymous memory page is selected for swapping and a slot is
allocated in the swap device.
- Next, the page is unmapped from all processes using that page. The
PTEs referencing that page are filled with the swap entry that consists of
the swap type and offset where the page can be found.
- Lastly, the page is scheduled for writeback to the swap device.
When frontswap_store() in swap_writepage() is successful,
the writeback step is not performed. However, the slot in the swap device has been
allocated and is still reserved for the page even though the page only
resides in the frontswap backend. Resumed writeback in zswap forces pages
out of the compressed cache into their previously reserved swap slots in
the swap device. Currently, the policy is basic and forces pages out from
the cache in two cases: (1) when the cache has reached its maximum size
according to the max_pool_percent sysfs tunable or, (2) when zswap is
unable to allocate new space for the compressed pool.
During resumed writeback, zswap decompresses the page, adds it back to the
swap cache, and schedules writeback into the swap slot that was previously
reserved. By splitting swap_writepage() into two functions after
frontswap_store() is called, zswap can resume writeback from the point
where the initial writeback terminated in frontswap. The new function is
called __swap_writepage().
Freeing zswap entries becomes more complex with writeback. Without
writeback, pages would only be freed during invalidate operations
(zswap_frontswap_invalidate page()). With writeback, pages can also be
freed in zswap_writeback_pages(). These invalidate and writeback functions
can run concurrently for the same zswap entry. To guarantee that entries
are not freed while being accessed by another thread, a reference count
field (called refcount) is used the zswap_entry structure.
Zsmalloc rationale
One really can't talk about zswap without mentioning zsmalloc, the
allocator it uses for compressed page storage, which currently resides in
the Linux Staging tree.
Zsmalloc is a slab-based allocator used by zswap; it
provides more reliable allocation of large objects in a memory constrained
environment than does the kernel slab allocator. Zsmalloc has already been
discussed on LWN, so this section will
focus more on the need for zsmalloc in the presence of the kernel slab
allocator.
The objects that zswap stores are compressed pages. The default compressor
is lzo1x-1, which is known for speed, but not so much for high compression. As a
result, zswap objects can frequently be large relative to typical slab
objects (>1/8th PAGE_SIZE). This is
a problem for the kernel slab allocator under memory pressure.
The kernel slab allocator requires high-order page allocations to back
slabs for large objects. For example, on a system with a 4K page size, the
kmalloc-512 cache has slabs that are backed by two contiguous
pages. kmalloc-2048 requires eight contiguous pages per slab. These high-order
page allocations are very likely to fail when the system is under memory
pressure.
Zsmalloc addresses this problem by allowing the pages backing a
slab (or “size class” in zsmalloc terms) to be both non-contiguous and
variable in number. They are variable in number because zsmalloc allows a
slab to be composed of less than the target number of backing pages. A set
of non-contiguous pages backing a slab are stitched together using fields
of struct page to create a “zspage”. This allows zsmalloc to service large
object allocations, up to PAGE_SIZE, without requiring high-order page
allocations.
Additionally, the kernel slab allocator does not allow objects that are
less than a page in size to span a page boundary. This means that if an
object is PAGE_SIZE/2 + 1 bytes in size, it effectively uses an entire
page, resulting in ~50% waste. Hence there are no kmalloc() cache sizes
between PAGE_SIZE/2 and PAGE_SIZE. Zswap frequently needs allocations in
this range, however. Using the kernel slab allocator causes the memory
savings achieved through compression to be lost in fragmentation.
In order to satisfy these larger allocations while not wasting an entire
page, zsmalloc allows objects to span page boundaries at the
cost of having to map the allocations before accessing them. This mapping
is needed because the object might be contained in two non-contiguous
pages. For example, in a zsmalloc size class for objects that
are 2/3 of PAGE_SIZE, three objects could be stored in a zspage with two
non-contiguous backing pages with no waste. The object stored in the
second of the three object positions in the zspage would be split between
two different pages.
Zsmalloc is a good fit for zswap. Zswap was evaluated using the
kernel slab allocator and these issues did have a significant impact on
the frontswap_store() success rate. This was due to kmalloc() allocation
failures and a need to reject pages that compressed to sizes greater than
PAGE_SIZE/2.
Performance
In order to produce a performance comparison, kernel builds were
conducted with an increasing number of threads per run in a constant and
constrained amount of memory. The results indicate a runtime reduction of
53% and an I/O reduction of 76% with zswap compared to normal swapping.
The testing system was configured with:
- Gentoo running v3.7-rc7
- Quad-core i5-2500 @ 3.3GHz
- 512MB DDR3 1600MHz (limited with mem=512m on boot)
- Filesystem and swap on 80GB HDD (about 58MB/s with hdparm -t)
The table below summarizes the test runs.
| Baseline | zswap | Change |
| N | pswpin | pswpout | majflt | I/O
sum | pswpin | pswpout | majflt | I/O
sum | %I/O | MB |
| 8 | 1 | 335 | 291 | 627 | 0 | 0 | 249 | 249 | -60% | 1 |
| 12 | 3688 | 14315 | 5290 | 23293 | 123 | 860 | 5954 | 6937 | -70% | 64 |
| 16 | 12711 | 46179 | 16803 | 75693 | 2936 | 7390 | 46092 | 56418 | -25% | 75 |
| 20 | 42178 | 133781 | 49898 | 225857 | 9460 | 28382 | 92951 | 130793 | -42% | 371 |
| 24 | 96079 | 357280 | 105242 | 558601 | 7719 | 18484 | 109309 | 135512 | -76% | 1653 |
The 'N' column indicates the
maximum number of concurrent threads for the kernel build (make -jN) for
each run. The next four columns are the statistics for the baseline run
without zswap, followed by the same for the zswap run. The I/O sum column
for each run is a sum of pswpin (pages swapped in), pswpout (pages swapped
out), and majflt (major page faults). The difference between the baseline
and zswap runs is shown both in relative terms, as a percentage of I/O
reduction, and in absolute terms, as a reduction of X megabytes of I/O
related to swapping activity.
A compressed swap cache reduces the efficiency of the page reclaim process.
For any store operation, the cache may allocate some pages to store the
compressed page. This results in an reduction of overall page reclaim
efficiency. This reduction in efficiency results in additional shrinking
pressure on the page cache causing an increase in major page faults where
pages must be re-read from disk. In order to have a complete picture of
the I/O impact, the major page faults must be considered in the sum of I/O.
The next table shows the total runtime of the kernel builds:
| Runtime (in seconds) |
| N | base | zswap | %change |
| 8 | 107 | 107 | 0% |
| 12 | 128 | 110 | -14% |
| 16 | 191 | 179 | -6% |
| 20 | 371 | 240 | -35% |
| 24 | 570 | 267 | -53% |
The runtime impact of swap
activity is decreased when comparing runs with the same number of threads.
The rate of degradation is reduced for increasingly constrained runs when
comparing baseline and zswap.
The measurements of
average CPU utilization during the builds are:
| %CPU utilization (out of 400% on 4 cpus) |
| N | base | zswap | %change |
| 8 | 317 | 319 | 1% |
| 12 | 267 | 311 | 16% |
| 16 | 179 | 191 | 7% |
| 20 | 94 | 143 | 52% |
| 24 | 60 | 128 | 113% |
The CPU utilization table shows that with zswap, the kernel build is able
to make more productive use of the CPUs, as is expected from the runtime
results.
Additional performance testing was performed using SPECjbb. Metrics
regarding the performance improvements and I/O reductions that can be
achieved using zswap on both x86 and Power7+ (with and without hardware
compression acceleration), can be found on this page.
Conclusion
Zswap is a compressed swap cache, able to evict pages from the compressed
cache, on an LRU basis, to the backing swap device when the compressed pool
reaches it size limit or the pool is unable to obtain additional pages from
the buddy allocator. Its approach trades CPU cycles for reduced swap I/O.
This trade-off can result in a significant performance improvement as reads
to and writes from to the compressed cache are almost always faster that reading
from a swap device which incurs the latency of an asynchronous block I/O
read.
Comments (7 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>