Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.6-rc7, released on September 23. This one includes a codename change to "Terrified Chipmunk." Linus says: "So if everything works out well, and the upcoming week is calmer still, I suspect I can avoid another -rc. Fingers crossed."

Stable updates: 3.2.30 was released on September 20.

Comments (none posted)

Quotes of the week

It is an accepted fact that memcg sucks. But can it suck faster?

— Glauber Costa

This was motivated by a bug report last February complaining about 200-microsecond latency spikes from RCU grace-period initialization. On systems with 4096 CPUs.

Real-time response. It is far bigger than I thought.

— Paul McKenney

I only mentioned it to see if your virtual crap detector is still working. Looks like it's still in top condition, low latency and 100% hit rate.

— Avi Kivity

Comments (none posted)

Adding a huge zero page

By Jonathan Corbet
September 26, 2012

The transparent huge pages feature allows applications to take advantage of the larger page sizes supported by most contemporary processors without the need for explicit configuration by administrators, developers, or users. It is mostly a performance-enhancing feature: huge pages reduce the pressure on the system's translation lookaside buffer (TLB), making memory accesses faster. It can also save a bit of memory, though, as the result of the elimination of a layer of page tables. But, as it turns out, transparent huge pages can actually increase the memory usage of an application significantly under certain conditions. The good news is that a solution is at hand; it is as easy as a page full of zeroes.

Transparent huge pages are mainly used for anonymous pages — pages that are not backed by a specific file on disk. These are the pages forming the data areas of processes. When an anonymous memory area is created or extended, no actual pages of memory are allocated (whether transparent huge pages are enabled or not). That is because a typical program will never touch many of the pages that are part of its address space; allocating pages before there is a demonstrated need would waste a considerable amount of time and memory. So the kernel will wait until the process tries to access a specific page, generating a page fault, before allocating memory for that page.

But, even then, there is an optimization that can be made. New anonymous pages must be filled with zeroes; to do anything else would be to risk exposing whatever data was left in the page by its previous user. Programs often depend on the initialization of their memory; since they know that memory starts zero-filled, there is no need to initialize that memory themselves. As it turns out, a lot of those pages may never be written to; they stay zero-filled for the life of the process that owns them. Once that is understood, it does not take long to see that there is an opportunity to save a lot of memory by sharing those zero-filled pages. One zero-filled page looks a lot like another, so there is little value in making too many of them.

So, if a process instantiates a new (non-huge) page by trying to read from it, the kernel still will not allocate a new memory page. Instead, it maps a special page, called simply the "zero page," into the process's address space instead. Thus, all unwritten anonymous pages, across all processes in the system, are, in fact, sharing one special page. Needless to say, the zero page is always mapped read-only; it would not do to have some process changing the value of zero for everybody else. Whenever a process attempts to write to the zero page, it will generate a write-protection fault; the kernel will then (finally) get around to allocating a real page of memory and substitute it into the process's address space at the right spot.

This behavior is easy to observe. As Kirill Shutemov described, a process executing a bit of code like this:

    posix_memalign((void **)&p, 2 * MB, 200 * MB);
    for (i = 0; i < 200 * MB; i+= 4096)
        assert(p[i] == 0);
    pause();

will have a surprisingly small resident set at the time of the pause() call. It has just worked through 200MB of memory, but that memory is all represented by a single zero page. The system works as intended.

Or, it does until the transparent huge pages feature is enabled; then that process will show the full 200MB of allocated memory. A growth of memory usage by two orders of magnitude is not the sort of result users are typically looking for when they enable a performance-enhancing feature. So, Kirill says, some sites are finding themselves forced to disable transparent huge pages in self defense.

The problem is simple enough: there is no huge zero page. The transparent huge pages feature tries to use huge pages whenever possible; when a process faults in a new page, the kernel will try to put a huge page there. Since there is no huge zero page, the kernel will simply allocate a real zero page instead. This behavior leads to correct execution, but it also causes the allocation of a lot of memory that would otherwise not have been needed. Transparent huge page support, in other words, has turned off another important optimization that has been part of the kernel's memory management subsystem for many years.

Once the problem is understood, the solution isn't that hard. Kirill's patch adds a special, zero-filled huge page to function as the huge zero page. Only one such page is needed, since the transparent huge pages feature only uses one size of huge page. With this page in place and used for read faults, the expansion of memory use simply goes away.

As always, there are complications: the page is large enough that it would be nice to avoid allocating it if transparent huge pages are not in use. So there's a lazy allocation scheme; Kirill also added a reference count so that the huge zero page can be returned if there is no longer a need for it. That reference counting slows a read-faulting benchmark by 1%, so it's not clear that it is worthwhile; in the end, the developers might conclude that it's better to just keep the zero huge page around once it has been allocated and not pay the reference counting cost. This is, after all, a situation that has come about before with the (small) zero page.

There have not been a lot of comments on this patch; the implementation is relatively straightforward and, presumably, does not need a lot in the way of changes. Given the obvious and measurable benefits from the addition of a huge zero page, it should be added to the kernel sometime in the fairly near future; the 3.8 development cycle seems like a reasonable target.

Comments (19 posted)

Supervisor mode access prevention

By Jonathan Corbet
September 26, 2012

Operating system designers and hardware designers tend to put a lot of thought into how the kernel can be protected from user-space processes. The security of the system as a whole depends on that protection. But there can also be value in protecting user space from the kernel. The Linux kernel will soon have support for a new Intel processor feature intended to make that possible.

Under anything but the strangest (out of tree) memory configurations, the kernel's memory is always mapped, so user-space code could conceivably read and modify it. But the page protections are set to disallow that access; any attempt by user space to examine or modify the kernel's part of the address space will result in a segmentation violation (SIGSEGV) signal. Access in the other direction is rather less controlled: when the processor is in kernel mode, it has full access to any address that is valid in the page tables. Or nearly full access; the processor will still not normally allow writes to read-only memory, but that check can be disabled when the need arises.

Intel's new "Supervisor Mode Access Prevention" (SMAP) feature changes that situation; those wanting the details can find them starting on page 408 of this reference manual [PDF]. This extension defines a new SMAP bit in the CR4 control register; when that bit is set, any attempt to access user-space memory while running in a privileged mode will lead to a page fault. Linux support for this feature has been posted by H. Peter Anvin to generally positive reviews; it could show up in the mainline as early as 3.7.

Naturally, there are times when the kernel needs to work with user-space memory. To that end, Intel has defined a separate "AC" flag that controls the SMAP feature. If the AC flag is set, SMAP protection is in force; otherwise access to user-space memory is allowed. Two new instructions (STAC and CLAC) are provided to manipulate that flag relatively quickly. Unsurprisingly, much of Peter's patch set is concerned with adding STAC and CLAC instructions in the right places. User-space access functions (get_user(), for example, or copy_to_user()) clearly need to have user-space access enabled. Other places include transitions between kernel and user mode, futex operations, floating-point unit state saving, and so on. Signal handling, as usual, has special requirements; Peter had to make some significant changes to allow signal delivery to happen without excessive overhead.

Speaking of overhead, support for this feature will clearly have its costs. User-space access functions tend to be expanded inline, so there will be a lot of STAC and CLAC instructions spread around the kernel. The "alternatives" mechanism is used to patch them out if the SMAP feature is not in use (either not supported by the kernel or disabled with the nosmap boot flag), but the kernel will grow a little regardless. The STAC and CLAC instructions also require a little time to execute. Thus far, no benchmarks have been posted to quantify what the cost is; one assumes that it is small but not nonexistent.

The kernel will treat SMAP violations like it treats any other bad pointer access: the result will be an oops.

One might well ask what the value of this protection is, given that the kernel can turn it off at will. The answer is that it can block a whole class of exploits where the kernel is fooled into reading from (or writing to) user-space memory by mistake. The set of null pointer vulnerabilities exposed a few years ago is one obvious example. There have been many situations where an attacker has found a way to get the kernel to use a bad pointer, while the cases where the attacker could execute arbitrary code in kernel space (before exploiting the bad pointer) have been far less common. SMAP should block the more common attacks nicely.

The other benefit, of course, is simply finding kernel bugs. Driver writers (should) know that they cannot dereference user-space pointers directly from the kernel, but code that does so tends to work on some architectures anyway. With SMAP enabled, that kind of mistake will be found and fixed earlier, before the bad code is shipped in a mainline kernel. As is so often the case, there is real value in having the system enforce the rules that developers are supposed to be following.

Linus liked the patch set and nobody else has complained, so the changes have found their way into the "tip" tree. That makes it quite likely that we will see them again quite soon, probably once the 3.7 merge window opens. It will take a little longer, though, to get processors that support this feature; SMAP is set to first appear in the Haswell line, which should start shipping in 2013. But, once the hardware is available, Linux will be able to take advantage of this new feature.

Comments (6 posted)

Where the 3.6 kernel came from

By Jonathan Corbet
September 26, 2012

As of this writing, the 3.6 development is nearing its close with the 3.6-rc7 prepatch having been released on September 23. There may or may not be a 3.6-rc8 before the final release, but, either way, the real 3.6 kernel is not far away. It thus seems like an appropriate time for our traditional look at what happened in this cycle and who the active participants were.

At the release of -rc7, Linus had pulled 10,153 non-merge changesets from 1,216 developers into the mainline. That makes this release cycle just a little slower than its immediate predecessors, but, with over 10,000 changesets committed, the development community has certainly not been idle. This development cycle is already slightly longer than 3.5 (which required 62 days) and may be as much as two weeks longer by the end, if another prepatch release is required. Almost 523,000 lines of code were added and almost 252,000 were removed this time around for a net growth of about 271,000 lines.

Most active 3.6 developers

By changesets

H Hartley Sweeten 460 4.5%

Mark Brown 175 1.7%

David S. Miller 154 1.5%

Axel Lin 152 1.5%

Johannes Berg 115 1.1%

Al Viro 113 1.1%

Hans Verkuil 111 1.1%

Lars-Peter Clausen 90 0.9%

Sachin Kamat 84 0.8%

Daniel Vetter 83 0.8%

Eric Dumazet 79 0.8%

Rafael J. Wysocki 77 0.8%

Guenter Roeck 76 0.7%

Alex Elder 76 0.7%

Guennadi Liakhovetski 75 0.7%

Sven Eckelmann 75 0.7%

Ian Abbott 74 0.7%

Arik Nemtsov 74 0.7%

Dan Carpenter 72 0.7%

Shawn Guo 70 0.7%

By changed lines

Greg Kroah-Hartman 113897 18.3%

Mark Brown 18761 3.0%

H Hartley Sweeten 14362 2.3%

John W. Linville 14177 2.3%

Chris Metcalf 11419 1.8%

Hans Verkuil 9493 1.5%

Alex Williamson 7335 1.2%

Pavel Shilovsky 6226 1.0%

Sven Eckelmann 5694 0.9%

Johannes Berg 5518 0.9%

Alexander Block 5465 0.9%

Kevin McKinney 5211 0.8%

David S. Miller 4600 0.7%

Christoph Hellwig 4512 0.7%

Yan, Zheng 4481 0.7%

Felix Fietkau 4433 0.7%

Ola Lilja 4191 0.7%

Johannes Goetzfried 4129 0.7%

Vaibhav Hiremath 4087 0.7%

Nicolas Royer 3989 0.6%

H. Hartley Sweeten is at the top of the changesets column this month as the result of a seemingly unending series of patches to get the Comedi subsystem ready for graduation from the staging tree. Mark Brown continues work on audio drivers and related code. David Miller naturally has patches all over the networking subsystem; his biggest contribution this time around was the long-desired removal of the IPv4 routing cache. Axel Lin made lots of changes to drivers in the regulator and MTD subsystems, among others, and Johannes Berg continues his wireless subsystem work.

Greg Kroah-Hartman pulled the CSR wireless driver into the staging tree to get to the top of the "lines changed" column, even though his 69 changesets weren't quite enough to show up in the left column. John Linville removed some old, unused drivers, making him the developer who removed the most code from the kernel this time around. Chris Metcalf added a number of new features to the Tile architecture subtree.

The list of developers credited for reporting problems is worth a look:

Top 3.6 bug reporters

Fengguang Wu 44 7.7%

Martin Hundebøll 21 3.7%

David S. Miller 19 3.3%

Dan Carpenter 17 3.0%

Randy Dunlap 14 2.4%

Bjørn Mork 11 1.9%

Al Viro 10 1.7%

Ian Abbott 9 1.6%

Stephen Rothwell 9 1.6%

Eric Dumazet 8 1.4%

Top 3.6 bug reporters
Fengguang Wu	44	7.7%
Martin Hundebøll	21	3.7%
David S. Miller	19	3.3%
Dan Carpenter	17	3.0%
Randy Dunlap	14	2.4%
Bjørn Mork	11	1.9%
Al Viro	10	1.7%
Ian Abbott	9	1.6%
Stephen Rothwell	9	1.6%
Eric Dumazet	8	1.4%

What we are seeing here is clearly the result of Fengguang Wu's build and boot testing work. As Fengguang finds problems, he reports them and they get fixed before the wider user community has to deal with them. Coming up with 44 bug reports in just over 60 days is a good bit of work.

Some 208 companies (that we know of) contributed to the 3.6 kernel. The most active of these were:

Most active 3.6 employers

By changesets

(None) 1124 11.1%

Red Hat 1035 10.2%

Intel 884 8.7%

(Unknown) 828 8.2%

Vision Engraving Systems 460 4.5%

Texas Instruments 418 4.1%

Linaro 409 4.0%

IBM 286 2.8%

SUSE 282 2.8%

Google 243 2.4%

Wolfson Microelectronics 180 1.8%

(Consultant) 167 1.6%

Freescale 152 1.5%

Ingics Technology 152 1.5%

Samsung 143 1.4%

Qualcomm 135 1.3%

Cisco 127 1.3%

Wizery Ltd. 125 1.2%

NVidia 124 1.2%

Oracle 122 1.2%

By lines changed

Linux Foundation 122520 19.7%

(None) 63608 10.2%

Red Hat 59662 9.6%

Intel 37556 6.0%

(Unknown) 25719 4.1%

Texas Instruments 25533 4.1%

Wolfson Microelectronics 23020 3.7%

Vision Engraving Systems 14876 2.4%

(Consultant) 12830 2.1%

Linaro 11677 1.9%

Tilera 11436 1.8%

Cisco 11223 1.8%

IBM 11006 1.8%

Freescale 9630 1.6%

SUSE 9035 1.5%

Marvell 7984 1.3%

Samsung 7621 1.2%

OMICRON Electronics 7259 1.2%

Etersoft 6236 1.0%

Google 5673 0.9%

Greg Kroah-Hartman's move to the Linux Foundation has caused a bit of a shift in the numbers; the Foundation has moved up in the rankings at SUSE's expense. Beyond that, we see the continued growth of the embedded industry's participation, the continuing slow decline of hobbyist contributions, and an equally slow decline in contributions from "big iron" companies like Oracle and IBM.

Taking a quick look at maintainer signoffs — "Signed-off-by" tags applied by somebody other than the author — the picture is this:

Non-author Signed-off-by tags

By developer

Greg Kroah-Hartman 1232 14.1%

David S. Miller 754 8.6%

John W. Linville 376 4.3%

Mauro Carvalho Chehab 323 3.7%

Mark Brown 291 3.3%

Andrew Morton 280 3.2%

Ingo Molnar 173 2.0%

Luciano Coelho 132 1.5%

Johannes Berg 128 1.5%

Gustavo Padovan 124 1.4%

By company

Red Hat 2323 26.6%

Linux Foundation 1278 14.6%

Intel 592 6.8%

Google 428 4.9%

(None) 411 4.7%

Texas Instruments 359 4.1%

Wolfson Microelectronics 292 3.3%

SUSE 270 3.1%

Samsung 230 2.6%

IBM 189 2.2%

The last time LWN put up a version of this table was for 2.6.34 in May, 2010. At that time, over half the patches heading into the kernel passed through the hands of somebody at Red Hat or SUSE. That situation has changed a bit since then, though the list of developers contains mostly the same names. Once again, we are seeing the mobile and embedded industry on the rise.

All told, it looks like business as usual. There are a lot of problems to be solved in the kernel space, so vast numbers of developers are working to solve them. There appears to be little danger that Andrew Morton's famous 2005 prediction that "we have to finish this thing one day" will come true anytime in the near future. But, if we can't manage to finish the job, at least we seem to have the energy and resources to keep trying.

Comments (2 posted)

Linus Torvalds Linux 3.6-rc7 ?

Ben Hutchings Linux 3.2.30 ?

morten.rasmussen@arm.com sched: Task placement for heterogeneous MP systems ?

H. Peter Anvin x86: Supervisor Mode Access Prevention ?

Joerg Roedel Improve IRQ remapping abstraction in x86 core code ?

Eric W. Biederman userns subsystem conversions v2 ?

Paul E. McKenney [PATCH tip/core/rcu 0/23] v2 Improvements to RT response on big systems and expedited functions ?

Kees Cook module: add syscall to load module from fd ?

Sasha Levin hashtable: introduce a small and naive hashtable ?

Anton Vorontsov KGDB/KDB FIQ (NMI) debugger ?

Namhyung Kim perf report: Add suppport for event group view (v2) ?

David Herrmann input: Dynamic Minor Numbers ?

Rajagopal Venkat devfreq: Add support for devices which can idle ?

Damian Hobson-Garcia Add UIO device supporting dynamic memory allocation ?

Patil, Rachna Support for TSC/ADC MFD driver ?

Jeff Layton vfs: getname/putname overhaul ?

zwu.kernel@gmail.com vfs: hot data tracking ?

Miklos Szeredi overlay filesystem: request for inclusion (v15) ?

Eric W. Biederman userns: Simple filesystems conversions ?

Richard Weinberger UBI: Fastmap request for inclusion (v18) ?

Kent Overstreet Prep work for immutable bio vecs ?

Paolo Bonzini block: add queue-private command filter, editable via sysfs ?

Miao Xie Btrfs: introduce extent buffer cache to btrfs ?

Mel Gorman Reduce compaction scanning and lock contention ?

Glauber Costa bypass charges if memcg is not used ?

John Stultz Android netfilter patches ?

Stephen Hemminger VXLAN driver ?

Matthew Garrett Second attempt at kernel secure boot support ?

Dmitry Kasatkin dm-integrity: integrity protection device-mapper target ?

David Howells Asymmetric keys and module signing ?

Pavel Emelyanov Checkpoint-restore tool v0.2 ?

Theodore Ts'o Release of e2fsprogs 1.42.6 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

Adding a huge zero page

Supervisor mode access prevention

Where the 3.6 kernel came from

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous