The current development kernel is 3.6-rc7
on September 23. This one
includes a codename change to "Terrified Chipmunk." Linus says: "So
if everything works out well, and the upcoming week is calmer still, I
suspect I can avoid another -rc. Fingers crossed.
Stable updates: 3.2.30 was released
on September 20.
Comments (none posted)
It is an accepted fact that memcg sucks. But can it suck faster?
— Glauber Costa
This was motivated by a bug report last February complaining about
200-microsecond latency spikes from RCU grace-period
initialization. On systems with 4096 CPUs.
Real-time response. It is far bigger than I thought.
— Paul McKenney
I only mentioned it to see if your virtual crap detector is still
working. Looks like it's still in top condition, low latency and
100% hit rate.
— Avi Kivity
Comments (none posted)
Kernel development news
The transparent huge pages
applications to take advantage of the larger page sizes supported by most
contemporary processors without the need for explicit configuration by
administrators, developers, or users. It is mostly a performance-enhancing
feature: huge pages reduce the pressure on the system's translation
lookaside buffer (TLB), making memory accesses faster. It can also save
a bit of memory, though, as the result of the elimination of a layer of
page tables. But, as it turns out, transparent huge pages can
actually increase the memory usage of an application significantly under
certain conditions. The good news is that a
solution is at hand; it is as easy as a page full of zeroes.
Transparent huge pages are mainly used for anonymous pages — pages that are
not backed by a specific file on disk. These are the pages forming the
data areas of processes. When an anonymous memory area is created or
extended, no actual pages of memory are allocated (whether transparent huge
pages are enabled or not). That is because a
typical program will never touch many of the pages that are part of its
address space; allocating pages before there is a demonstrated need would
waste a considerable amount of time and memory. So the kernel will wait until the
process tries to access a specific page, generating a page fault, before
allocating memory for that page.
But, even then, there is an optimization that can be made. New anonymous
pages must be filled with zeroes; to do anything else would be to risk
exposing whatever data was left in the page by its previous user. Programs
often depend on the initialization of their memory; since they know that
memory starts zero-filled, there is no need to initialize that memory
themselves. As it turns out, a lot of those pages may never be written to;
they stay zero-filled for the life of the process that owns them. Once
that is understood, it does not take long to see that there is an
opportunity to save a lot of memory by sharing those zero-filled pages.
One zero-filled page looks a lot like another, so there is little value in
making too many of them.
So, if a process instantiates a new (non-huge) page by trying to read from
kernel still will not allocate a new memory page. Instead, it maps a
special page, called simply the "zero page," into the process's address
space instead. Thus, all unwritten anonymous pages, across all processes
in the system, are, in fact, sharing one special page. Needless to say,
the zero page is always mapped read-only; it would not do to have some
process changing the value of zero for everybody else. Whenever a process
attempts to write to the zero page, it will generate a write-protection
fault; the kernel will then (finally) get around to
allocating a real page of memory and substitute it into the process's
address space at the right spot.
This behavior is easy to observe.
As Kirill Shutemov described, a process
executing a bit of code like this:
posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
will have a surprisingly small resident set at the time of the
pause() call. It has just worked through 200MB of memory, but
that memory is all represented by a single zero page. The system works as
Or, it does until the transparent huge pages feature is enabled; then that
show the full 200MB of allocated memory. A growth of memory usage by two
orders of magnitude is not the sort of result users are typically looking
for when they enable a performance-enhancing feature. So, Kirill says,
some sites are finding themselves forced to disable transparent huge pages
in self defense.
The problem is simple enough: there is no huge zero page. The transparent
huge pages feature tries to use huge pages whenever possible; when a
process faults in a new page, the kernel will try to put a huge page
there. Since there is no huge zero page, the kernel will simply allocate a
real zero page instead. This behavior leads to correct execution, but it
also causes the allocation of a lot of memory that would otherwise not have
been needed. Transparent huge page support, in other words, has turned off
another important optimization that has been part of the kernel's memory
management subsystem for many years.
Once the problem is understood, the solution isn't that hard. Kirill's
patch adds a special, zero-filled huge page to function as the huge zero
page. Only one such page is needed, since the transparent huge pages
feature only uses one size of huge page. With this page in place and used
for read faults, the expansion of memory use simply goes away.
always, there are complications: the page is large enough that it would be
nice to avoid allocating it if transparent huge pages are not in use. So
there's a lazy allocation scheme; Kirill also added a reference count so
that the huge zero page can be returned if there is no longer a need for
it. That reference counting slows a read-faulting benchmark by 1%, so it's
not clear that it is worthwhile; in the end, the developers might conclude
that it's better to just keep the zero huge page around once it has been
allocated and not pay the reference counting cost. This is, after all, a
situation that has come about before with
the (small) zero page.
There have not been a lot of comments on this patch; the implementation is
relatively straightforward and, presumably, does not need a lot in the way
of changes. Given the obvious and measurable benefits from the addition of
a huge zero page, it should be added to the kernel sometime in the fairly
near future; the 3.8 development cycle seems like a reasonable target.
Comments (19 posted)
Operating system designers and hardware designers tend to put a lot of
thought into how the kernel can be protected from user-space processes.
The security of the system as a whole depends on that protection. But
there can also be value in protecting user space from the kernel. The
Linux kernel will soon have support for a new Intel processor feature
intended to make that possible.
Under anything but the strangest (out of tree) memory configurations, the
kernel's memory is always mapped, so user-space code could
conceivably read and modify it. But the page protections are set to
disallow that access; any attempt by user space to examine or modify the
kernel's part of the address space will result in a segmentation violation
(SIGSEGV) signal. Access in the other direction is rather less
controlled: when the
processor is in kernel mode, it has full access to any address that is
valid in the page tables. Or nearly full access; the processor will still
not normally allow writes to read-only memory, but that check can be
disabled when the need arises.
Intel's new "Supervisor Mode Access Prevention" (SMAP) feature changes that
situation; those wanting the details can find them starting on
page 408 of this
reference manual [PDF]. This extension defines a new SMAP bit in the
CR4 control register; when that bit is set, any attempt to access
user-space memory while running in a privileged mode will lead to a page
fault. Linux support for this feature has been posted by H. Peter Anvin to generally positive
reviews; it could show up in the mainline as early as 3.7.
Naturally, there are times when the kernel needs to work with user-space
memory. To that end, Intel has defined a separate "AC" flag that controls
the SMAP feature. If the AC flag is set, SMAP protection is in force;
otherwise access to user-space memory is allowed. Two new instructions
(STAC and CLAC) are provided to manipulate that flag relatively quickly.
Unsurprisingly, much of Peter's patch set is concerned with adding STAC and
CLAC instructions in the right places. User-space access functions
(get_user(), for example, or copy_to_user()) clearly need
to have user-space access enabled. Other places include transitions
between kernel and user mode, futex operations, floating-point unit state
saving, and so on. Signal handling, as usual, has special requirements;
Peter had to make some significant changes to allow signal delivery to
happen without excessive overhead.
Speaking of overhead, support for this feature will clearly have its
costs. User-space access functions tend to be expanded inline, so there
will be a lot of STAC and CLAC instructions spread around the kernel. The
"alternatives" mechanism is used to patch
them out if the SMAP feature is
not in use (either not supported by the kernel or disabled with the
nosmap boot flag), but the kernel will grow a little regardless.
The STAC and CLAC instructions also require a little time to execute. Thus
far, no benchmarks have been posted to quantify what the cost is; one
assumes that it is small but not nonexistent.
The kernel will treat SMAP violations like it treats any other bad
pointer access: the result will be an oops.
One might well ask what the value of this protection is, given that the
kernel can turn it off at will. The answer is that it can block a whole
class of exploits where the kernel is fooled into reading from (or writing
to) user-space memory by mistake. The set of null pointer vulnerabilities exposed a few
years ago is one obvious example. There have been many situations where an
attacker has found a way to get the kernel to use a bad pointer, while the
cases where the attacker could execute arbitrary code in kernel space (before
exploiting the bad pointer) have been far less common. SMAP should block
the more common attacks nicely.
The other benefit, of course, is simply finding kernel bugs. Driver
writers (should) know that they cannot dereference user-space pointers
directly from the kernel, but code that does so tends to work on some
anyway. With SMAP enabled, that kind of mistake will be found and fixed
earlier, before the bad code is shipped in a mainline kernel. As is so
often the case, there is real value in having the system enforce the rules
that developers are supposed to be following.
Linus liked the patch set and nobody else has
complained, so the changes have found their way into the "tip" tree. That
makes it quite likely that we will see them again quite soon, probably once
merge window opens. It will take a little longer, though, to get
processors that support this feature; SMAP is set to first appear in the Haswell
line, which should start shipping in 2013. But, once the hardware is
available, Linux will be able to take advantage of this new feature.
Comments (6 posted)
As of this writing, the 3.6 development is nearing its close with the 3.6-rc7
prepatch having been released on
September 23. There may or may not be a
3.6-rc8 before the final release, but, either way, the real 3.6 kernel is
not far away. It thus seems like an appropriate time for our traditional
look at what happened in this cycle and who the active participants
At the release of -rc7, Linus had pulled 10,153 non-merge changesets from
1,216 developers into the mainline. That makes this release cycle just a
little slower than
its immediate predecessors, but, with over 10,000 changesets committed, the
development community has certainly not been idle. This development cycle
is already slightly longer than 3.5 (which required 62 days) and may
be as much as two weeks longer by the end, if another prepatch release is
required. Almost 523,000 lines of code were added and
almost 252,000 were removed this time around for a net growth of about
|Most active 3.6 developers|
|H Hartley Sweeten||460||4.5%|
|David S. Miller||154||1.5%|
|Rafael J. Wysocki||77||0.8%|
|By changed lines|
|H Hartley Sweeten||14362||2.3%|
|John W. Linville||14177||2.3%|
|David S. Miller||4600||0.7%|
H. Hartley Sweeten is at the top of the changesets column this month as the
result of a seemingly unending series of patches to get the Comedi
subsystem ready for graduation from the staging tree. Mark Brown continues
work on audio drivers and related code. David Miller naturally has patches
all over the networking subsystem; his biggest contribution this time
around was the long-desired removal of the IPv4 routing cache. Axel Lin
made lots of changes to drivers in the regulator and MTD subsystems, among
others, and Johannes Berg continues his wireless subsystem work.
Greg Kroah-Hartman pulled the CSR wireless driver into the staging tree to
get to the top of the "lines changed" column, even though his 69 changesets
weren't quite enough to show up in the left column. John Linville removed
some old, unused drivers, making him the developer who removed the most
code from the kernel this time around. Chris Metcalf added a number of new
features to the Tile architecture subtree.
The list of developers credited for reporting problems is worth a look:
|Top 3.6 bug reporters|
|David S. Miller||19||3.3%|
What we are seeing here is clearly the result of Fengguang Wu's build and boot testing work. As Fengguang
finds problems, he reports them and they get fixed before the wider user
community has to deal with them. Coming up with 44 bug reports in just
over 60 days is a good bit of work.
Some 208 companies (that we know of) contributed to the 3.6 kernel. The
most active of these were:
|Most active 3.6 employers|
|Vision Engraving Systems||460||4.5%|
|By lines changed|
|Vision Engraving Systems||14876||2.4%|
Greg Kroah-Hartman's move to the Linux Foundation has caused a bit of a
shift in the numbers; the Foundation has moved up in the rankings at
SUSE's expense. Beyond that, we see the continued growth of the embedded
industry's participation, the continuing slow decline of hobbyist
contributions, and an equally slow decline in contributions from "big iron"
companies like Oracle and IBM.
Taking a quick look at maintainer signoffs — "Signed-off-by" tags applied
by somebody other than the author — the picture is this:
|Non-author Signed-off-by tags|
|David S. Miller||754||8.6%|
|John W. Linville||376||4.3%|
|Mauro Carvalho Chehab||323||3.7%|
The last time LWN put up a version of this table was for 2.6.34 in May, 2010. At that time, over half
the patches heading into the kernel passed through the hands of somebody at
Red Hat or SUSE. That situation has changed a bit since then, though
the list of developers contains mostly the same names. Once
again, we are seeing the mobile and embedded industry on the rise.
All told, it looks like business as usual. There are a lot of problems to
be solved in the kernel space, so vast numbers of developers are working to
solve them. There appears to be little danger that Andrew Morton's famous
2005 prediction that
"we have to finish this thing one day" will come true anytime
in the near future. But, if we can't manage to finish the job, at least we
seem to have the energy and resources to keep trying.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>