|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.6-rc7, released on September 23. This one includes a codename change to "Terrified Chipmunk." Linus says: "So if everything works out well, and the upcoming week is calmer still, I suspect I can avoid another -rc. Fingers crossed."

Stable updates: 3.2.30 was released on September 20.

Comments (none posted)

Quotes of the week

It is an accepted fact that memcg sucks. But can it suck faster?
Glauber Costa

This was motivated by a bug report last February complaining about 200-microsecond latency spikes from RCU grace-period initialization. On systems with 4096 CPUs.

Real-time response. It is far bigger than I thought.

Paul McKenney

I only mentioned it to see if your virtual crap detector is still working. Looks like it's still in top condition, low latency and 100% hit rate.
Avi Kivity

Comments (none posted)

Kernel development news

Adding a huge zero page

By Jonathan Corbet
September 26, 2012
The transparent huge pages feature allows applications to take advantage of the larger page sizes supported by most contemporary processors without the need for explicit configuration by administrators, developers, or users. It is mostly a performance-enhancing feature: huge pages reduce the pressure on the system's translation lookaside buffer (TLB), making memory accesses faster. It can also save a bit of memory, though, as the result of the elimination of a layer of page tables. But, as it turns out, transparent huge pages can actually increase the memory usage of an application significantly under certain conditions. The good news is that a solution is at hand; it is as easy as a page full of zeroes.

Transparent huge pages are mainly used for anonymous pages — pages that are not backed by a specific file on disk. These are the pages forming the data areas of processes. When an anonymous memory area is created or extended, no actual pages of memory are allocated (whether transparent huge pages are enabled or not). That is because a typical program will never touch many of the pages that are part of its address space; allocating pages before there is a demonstrated need would waste a considerable amount of time and memory. So the kernel will wait until the process tries to access a specific page, generating a page fault, before allocating memory for that page.

But, even then, there is an optimization that can be made. New anonymous pages must be filled with zeroes; to do anything else would be to risk exposing whatever data was left in the page by its previous user. Programs often depend on the initialization of their memory; since they know that memory starts zero-filled, there is no need to initialize that memory themselves. As it turns out, a lot of those pages may never be written to; they stay zero-filled for the life of the process that owns them. Once that is understood, it does not take long to see that there is an opportunity to save a lot of memory by sharing those zero-filled pages. One zero-filled page looks a lot like another, so there is little value in making too many of them.

So, if a process instantiates a new (non-huge) page by trying to read from it, the kernel still will not allocate a new memory page. Instead, it maps a special page, called simply the "zero page," into the process's address space instead. Thus, all unwritten anonymous pages, across all processes in the system, are, in fact, sharing one special page. Needless to say, the zero page is always mapped read-only; it would not do to have some process changing the value of zero for everybody else. Whenever a process attempts to write to the zero page, it will generate a write-protection fault; the kernel will then (finally) get around to allocating a real page of memory and substitute it into the process's address space at the right spot.

This behavior is easy to observe. As Kirill Shutemov described, a process executing a bit of code like this:

    posix_memalign((void **)&p, 2 * MB, 200 * MB);
    for (i = 0; i < 200 * MB; i+= 4096)
        assert(p[i] == 0);
    pause();

will have a surprisingly small resident set at the time of the pause() call. It has just worked through 200MB of memory, but that memory is all represented by a single zero page. The system works as intended.

Or, it does until the transparent huge pages feature is enabled; then that process will show the full 200MB of allocated memory. A growth of memory usage by two orders of magnitude is not the sort of result users are typically looking for when they enable a performance-enhancing feature. So, Kirill says, some sites are finding themselves forced to disable transparent huge pages in self defense.

The problem is simple enough: there is no huge zero page. The transparent huge pages feature tries to use huge pages whenever possible; when a process faults in a new page, the kernel will try to put a huge page there. Since there is no huge zero page, the kernel will simply allocate a real zero page instead. This behavior leads to correct execution, but it also causes the allocation of a lot of memory that would otherwise not have been needed. Transparent huge page support, in other words, has turned off another important optimization that has been part of the kernel's memory management subsystem for many years.

Once the problem is understood, the solution isn't that hard. Kirill's patch adds a special, zero-filled huge page to function as the huge zero page. Only one such page is needed, since the transparent huge pages feature only uses one size of huge page. With this page in place and used for read faults, the expansion of memory use simply goes away.

As always, there are complications: the page is large enough that it would be nice to avoid allocating it if transparent huge pages are not in use. So there's a lazy allocation scheme; Kirill also added a reference count so that the huge zero page can be returned if there is no longer a need for it. That reference counting slows a read-faulting benchmark by 1%, so it's not clear that it is worthwhile; in the end, the developers might conclude that it's better to just keep the zero huge page around once it has been allocated and not pay the reference counting cost. This is, after all, a situation that has come about before with the (small) zero page.

There have not been a lot of comments on this patch; the implementation is relatively straightforward and, presumably, does not need a lot in the way of changes. Given the obvious and measurable benefits from the addition of a huge zero page, it should be added to the kernel sometime in the fairly near future; the 3.8 development cycle seems like a reasonable target.

Comments (19 posted)

Supervisor mode access prevention

By Jonathan Corbet
September 26, 2012
Operating system designers and hardware designers tend to put a lot of thought into how the kernel can be protected from user-space processes. The security of the system as a whole depends on that protection. But there can also be value in protecting user space from the kernel. The Linux kernel will soon have support for a new Intel processor feature intended to make that possible.

Under anything but the strangest (out of tree) memory configurations, the kernel's memory is always mapped, so user-space code could conceivably read and modify it. But the page protections are set to disallow that access; any attempt by user space to examine or modify the kernel's part of the address space will result in a segmentation violation (SIGSEGV) signal. Access in the other direction is rather less controlled: when the processor is in kernel mode, it has full access to any address that is valid in the page tables. Or nearly full access; the processor will still not normally allow writes to read-only memory, but that check can be disabled when the need arises.

Intel's new "Supervisor Mode Access Prevention" (SMAP) feature changes that situation; those wanting the details can find them starting on page 408 of this reference manual [PDF]. This extension defines a new SMAP bit in the CR4 control register; when that bit is set, any attempt to access user-space memory while running in a privileged mode will lead to a page fault. Linux support for this feature has been posted by H. Peter Anvin to generally positive reviews; it could show up in the mainline as early as 3.7.

Naturally, there are times when the kernel needs to work with user-space memory. To that end, Intel has defined a separate "AC" flag that controls the SMAP feature. If the AC flag is set, SMAP protection is in force; otherwise access to user-space memory is allowed. Two new instructions (STAC and CLAC) are provided to manipulate that flag relatively quickly. Unsurprisingly, much of Peter's patch set is concerned with adding STAC and CLAC instructions in the right places. User-space access functions (get_user(), for example, or copy_to_user()) clearly need to have user-space access enabled. Other places include transitions between kernel and user mode, futex operations, floating-point unit state saving, and so on. Signal handling, as usual, has special requirements; Peter had to make some significant changes to allow signal delivery to happen without excessive overhead.

Speaking of overhead, support for this feature will clearly have its costs. User-space access functions tend to be expanded inline, so there will be a lot of STAC and CLAC instructions spread around the kernel. The "alternatives" mechanism is used to patch them out if the SMAP feature is not in use (either not supported by the kernel or disabled with the nosmap boot flag), but the kernel will grow a little regardless. The STAC and CLAC instructions also require a little time to execute. Thus far, no benchmarks have been posted to quantify what the cost is; one assumes that it is small but not nonexistent.

The kernel will treat SMAP violations like it treats any other bad pointer access: the result will be an oops.

One might well ask what the value of this protection is, given that the kernel can turn it off at will. The answer is that it can block a whole class of exploits where the kernel is fooled into reading from (or writing to) user-space memory by mistake. The set of null pointer vulnerabilities exposed a few years ago is one obvious example. There have been many situations where an attacker has found a way to get the kernel to use a bad pointer, while the cases where the attacker could execute arbitrary code in kernel space (before exploiting the bad pointer) have been far less common. SMAP should block the more common attacks nicely.

The other benefit, of course, is simply finding kernel bugs. Driver writers (should) know that they cannot dereference user-space pointers directly from the kernel, but code that does so tends to work on some architectures anyway. With SMAP enabled, that kind of mistake will be found and fixed earlier, before the bad code is shipped in a mainline kernel. As is so often the case, there is real value in having the system enforce the rules that developers are supposed to be following.

Linus liked the patch set and nobody else has complained, so the changes have found their way into the "tip" tree. That makes it quite likely that we will see them again quite soon, probably once the 3.7 merge window opens. It will take a little longer, though, to get processors that support this feature; SMAP is set to first appear in the Haswell line, which should start shipping in 2013. But, once the hardware is available, Linux will be able to take advantage of this new feature.

Comments (6 posted)

Where the 3.6 kernel came from

By Jonathan Corbet
September 26, 2012
As of this writing, the 3.6 development is nearing its close with the 3.6-rc7 prepatch having been released on September 23. There may or may not be a 3.6-rc8 before the final release, but, either way, the real 3.6 kernel is not far away. It thus seems like an appropriate time for our traditional look at what happened in this cycle and who the active participants were.

At the release of -rc7, Linus had pulled 10,153 non-merge changesets from 1,216 developers into the mainline. That makes this release cycle just a little slower than its immediate predecessors, but, with over 10,000 changesets committed, the development community has certainly not been idle. This development cycle is already slightly longer than 3.5 (which required 62 days) and may be as much as two weeks longer by the end, if another prepatch release is required. Almost 523,000 lines of code were added and almost 252,000 were removed this time around for a net growth of about 271,000 lines.

Most active 3.6 developers
By changesets
H Hartley Sweeten4604.5%
Mark Brown1751.7%
David S. Miller1541.5%
Axel Lin1521.5%
Johannes Berg1151.1%
Al Viro1131.1%
Hans Verkuil1111.1%
Lars-Peter Clausen900.9%
Sachin Kamat840.8%
Daniel Vetter830.8%
Eric Dumazet790.8%
Rafael J. Wysocki770.8%
Guenter Roeck760.7%
Alex Elder760.7%
Guennadi Liakhovetski750.7%
Sven Eckelmann750.7%
Ian Abbott740.7%
Arik Nemtsov740.7%
Dan Carpenter720.7%
Shawn Guo700.7%
By changed lines
Greg Kroah-Hartman11389718.3%
Mark Brown187613.0%
H Hartley Sweeten143622.3%
John W. Linville141772.3%
Chris Metcalf114191.8%
Hans Verkuil94931.5%
Alex Williamson73351.2%
Pavel Shilovsky62261.0%
Sven Eckelmann56940.9%
Johannes Berg55180.9%
Alexander Block54650.9%
Kevin McKinney52110.8%
David S. Miller46000.7%
Christoph Hellwig45120.7%
Yan, Zheng44810.7%
Felix Fietkau44330.7%
Ola Lilja41910.7%
Johannes Goetzfried41290.7%
Vaibhav Hiremath40870.7%
Nicolas Royer39890.6%

H. Hartley Sweeten is at the top of the changesets column this month as the result of a seemingly unending series of patches to get the Comedi subsystem ready for graduation from the staging tree. Mark Brown continues work on audio drivers and related code. David Miller naturally has patches all over the networking subsystem; his biggest contribution this time around was the long-desired removal of the IPv4 routing cache. Axel Lin made lots of changes to drivers in the regulator and MTD subsystems, among others, and Johannes Berg continues his wireless subsystem work.

Greg Kroah-Hartman pulled the CSR wireless driver into the staging tree to get to the top of the "lines changed" column, even though his 69 changesets weren't quite enough to show up in the left column. John Linville removed some old, unused drivers, making him the developer who removed the most code from the kernel this time around. Chris Metcalf added a number of new features to the Tile architecture subtree.

The list of developers credited for reporting problems is worth a look:

Top 3.6 bug reporters
Fengguang Wu447.7%
Martin Hundebøll213.7%
David S. Miller193.3%
Dan Carpenter173.0%
Randy Dunlap142.4%
Bjørn Mork111.9%
Al Viro101.7%
Ian Abbott91.6%
Stephen Rothwell91.6%
Eric Dumazet81.4%

What we are seeing here is clearly the result of Fengguang Wu's build and boot testing work. As Fengguang finds problems, he reports them and they get fixed before the wider user community has to deal with them. Coming up with 44 bug reports in just over 60 days is a good bit of work.

Some 208 companies (that we know of) contributed to the 3.6 kernel. The most active of these were:

Most active 3.6 employers
By changesets
(None)112411.1%
Red Hat103510.2%
Intel8848.7%
(Unknown)8288.2%
Vision Engraving Systems4604.5%
Texas Instruments4184.1%
Linaro4094.0%
IBM2862.8%
SUSE2822.8%
Google2432.4%
Wolfson Microelectronics1801.8%
(Consultant)1671.6%
Freescale1521.5%
Ingics Technology1521.5%
Samsung1431.4%
Qualcomm1351.3%
Cisco1271.3%
Wizery Ltd.1251.2%
NVidia1241.2%
Oracle1221.2%
By lines changed
Linux Foundation12252019.7%
(None)6360810.2%
Red Hat596629.6%
Intel375566.0%
(Unknown)257194.1%
Texas Instruments255334.1%
Wolfson Microelectronics230203.7%
Vision Engraving Systems148762.4%
(Consultant)128302.1%
Linaro116771.9%
Tilera114361.8%
Cisco112231.8%
IBM110061.8%
Freescale96301.6%
SUSE90351.5%
Marvell79841.3%
Samsung76211.2%
OMICRON Electronics72591.2%
Etersoft62361.0%
Google56730.9%

Greg Kroah-Hartman's move to the Linux Foundation has caused a bit of a shift in the numbers; the Foundation has moved up in the rankings at SUSE's expense. Beyond that, we see the continued growth of the embedded industry's participation, the continuing slow decline of hobbyist contributions, and an equally slow decline in contributions from "big iron" companies like Oracle and IBM.

Taking a quick look at maintainer signoffs — "Signed-off-by" tags applied by somebody other than the author — the picture is this:

Non-author Signed-off-by tags
By developer
Greg Kroah-Hartman123214.1%
David S. Miller7548.6%
John W. Linville3764.3%
Mauro Carvalho Chehab3233.7%
Mark Brown2913.3%
Andrew Morton2803.2%
Ingo Molnar1732.0%
Luciano Coelho1321.5%
Johannes Berg1281.5%
Gustavo Padovan1241.4%
By company
Red Hat232326.6%
Linux Foundation127814.6%
Intel5926.8%
Google4284.9%
(None)4114.7%
Texas Instruments3594.1%
Wolfson Microelectronics2923.3%
SUSE2703.1%
Samsung2302.6%
IBM1892.2%

The last time LWN put up a version of this table was for 2.6.34 in May, 2010. At that time, over half the patches heading into the kernel passed through the hands of somebody at Red Hat or SUSE. That situation has changed a bit since then, though the list of developers contains mostly the same names. Once again, we are seeing the mobile and embedded industry on the rise.

All told, it looks like business as usual. There are a lot of problems to be solved in the kernel space, so vast numbers of developers are working to solve them. There appears to be little danger that Andrew Morton's famous 2005 prediction that "we have to finish this thing one day" will come true anytime in the near future. But, if we can't manage to finish the job, at least we seem to have the energy and resources to keep trying.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.6-rc7 ?
Ben Hutchings Linux 3.2.30 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Stephen Hemminger VXLAN driver ?

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds