Brief items
The current development kernel is 3.12-rc2,
released on September 23. Linus said:
"
Things have been fairly quiet, probably because lots of people were
traveling for LinuxCon and Linux Plumbers conference last week. So nothing
very exciting stands out. It's mainly driver updates/fixes (gpu drivers
stand out, but there's networking too, and smaller stuff all over). Apart
from drivers there's arch updates (tile/arm/mips) and some filesystem noise
(mainly btrfs)."
Stable updates: no stable updates have been released in the last
week. The
3.11.2,
3.10.13,
3.4.63, and
3.0.97 updates are in the review process as
of this writing; they can be expected on or after September 27.
Comments (none posted)
I clearly need to be more aware of Andrew's racing schedule.
—
Linus
Torvalds plans the next merge window
A pox on whoever thought up huge pages. Words cannot express how
much of a godawful mess they have made of Linux MM. And it hasn't
ended yet.
—
Andrew Morton
John scoured the research literature for ideas that might save his
dreams of infinite scaling. He discovered several papers that
described software-assisted hardware recovery. The basic idea was
simple: if hardware suffers more transient failures as it gets
smaller, why not allow software to detect erroneous computations
and re-execute them? This idea seemed promising until John realized
THAT IT WAS THE WORST IDEA EVER. Modern software barely works when
the hardware is correct, so relying on software to correct hardware
errors is like asking Godzilla to prevent Mega-Godzilla from
terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES
IN TOKYO.
—
James
Mickens [PDF]
Comments (9 posted)
Matthew Garrett has finally implemented what we all really wanted in the
first place:
direct boot
into the Zork game from UEFI. "
But despite having a set of
functionality that makes it look much more like an OS than a boot
environment, UEFI doesn't actually expose a standard C library. The EFI
Application Development Kit solves this particular design decision."
Comments (15 posted)
Nouveau is the reverse-engineered driver for NVIDIA GPUs; it has been
developed for a number of years with no assistance from NVIDIA. Now,
though, an NVIDIA developer has surfaced on the Nouveau list with an offer
to help: "
NVIDIA is releasing public documentation on certain aspects
of our GPUs, with the intent to address areas that impact the
out-of-the-box usability of NVIDIA GPUs with Nouveau. We intend to provide
more documentation over time, and guidance in additional areas as we are
able."
This would appear to be a big step in the right direction.
Full Story (comments: 82)
Kernel development news
By Jonathan Corbet
September 25, 2013
Once upon a time, the standard response to scalability problems in the
kernel was the introduction of finer-grained locking. That approach has
its problems, though: the cache-line bouncing that locking
activity creates can be a scalability problem in its own right. So much of
the scalability work in the kernel has, in recent years, been focused on
lockless algorithms instead. But,
sometimes, there is little alternative to the introduction of finer-grained
locks; a current memory management patch set illustrates one of those
situations, with some additional complications.
Page tables hold the mapping between a virtual address in some process's
address space and the physical location of the memory behind that address.
It is easy to think of the page table as a simple linear array indexed by
the page frame number, but the reality is more complicated: page tables are
implemented as a sparse tree with up to four levels. Various subfields of
a virtual address are used to progress through the tree as shown here:
Some systems do not have all four levels; no 32-bit system has the PUD
("page upper directory") level, for example, and some 32-bit systems may
still get by with two-level page tables. Kernel code is written to deal
with all four levels, though; the extra code will vanish in the compilation
state for configurations with fewer levels.
Changes to page tables can be made frequently; every movement of a page
into or out of RAM must be reflected there, as must changes to the virtual
address
space (such as those made via an mmap() call). If the page table
is not shared
across processes, there is little potential for contention (and, thus, for
scalability problems), since only one process will be making changes
there. Sharing of the page tables, as happens most frequently in threaded
workloads, changes the picture, though; it is not uncommon for threads to
be making concurrent page table changes. The more concurrently running
threads there are, the higher the potential for contention becomes.
In some configurations, the entire page table is protected by a single
spinlock (called page_table_lock) in the process's
mm_struct structure. That lock was recognized as a scalability
problem years ago; in response, locking for the lowest level of the page
table tree (the PTE — "page table entry" — pages) was made per-PTE-page for
multiprocessor
configurations. But all of the other layers of the page table tree are
still protected by page_table_lock; in general, changes at those
levels are rare enough that more sophisticated locking is not worth the
trouble.
There is only one problem: as Kirill A Shutemov has pointed out, that is not always true. When
huge pages are in use, the PTE level of the page table tree is omitted.
Instead, the entry in the next level up — the "page middle directory" or
PMD — points directly to a much larger page. So,
in effect, huge pages prune the page table tree back to three levels, with
the PMD becoming the lowest level. The elimination of one level of
translation is one of the reasons why huge pages can improve performance,
though this effect is likely overshadowed by the large increase in the
coverage of the translation lookaside buffer (TLB), which avoids a lot of
address translations altogether.
What Kirill has noted is that highly threaded workloads slow down
considerably when the transparent huge
pages feature is in use. Given that huge pages are meant to increase
performance, this result is seen as surprising and undesirable. The
problem is contention for the page_table_lock; the use of lots of
huge pages greatly increases the number of changes made at the PMD level
and, thus, increases contention. To address this problem, Kirill has put
together a
patch set that pushes the locking down to the PMD level, eliminating much
of that contention.
Locks are normally embedded within the data structures they protect, so one
might be inclined to put a spinlock into the PMD. But the PMD is a
hardware-defined structure; it is simply a page full of pointers to PTE
pages or huge pages, with some status bits. There is no place there for an
added spinlock, so that lock must go somewhere else. When fine-grained
locking was implemented at the PTE level, the same problem was encountered;
the solution was to shoehorn the lock into the already overcrowded
struct page, which is the core structure for tracking the
system's physical memory. (See this article
for details on how struct page is put together). Kirill's
patch replicates the approach used at the PTE level, putting the lock into
struct page.
The results would appear to be reasonably convincing. A benchmark designed
to demonstrate the problem runs in 36.5 seconds with transparent huge pages
off. When transparent huge pages are turned on in an unmodified kernel,
the number of
page faults drops from over 24 million to 50,000, but the run time
increases to 49.9 seconds — not the speed improvement that one might hope
for. Adding the patch, though, cuts the run time to 33.9 seconds,
significantly faster than an unmodified kernel without transparent huge
pages. By getting rid of the cost of the
locking contention at the PMD level, Kirill's patch allows the benchmark to
enjoy the performance benefits that come from using huge pages.
There is one remaining problem, as pointed
out by Peter Zijlstra: the patch as written will not work with the
realtime preemption patch set. In the realtime world, spinlocks are
sleeping locks; that makes them bigger, to the point that they will no
longer fit into the tight space available in struct page.
That structure will grow to accommodate the larger lock, but, given that
there is one page structure for every page in the system, the
memory cost of that growth is difficult to accept. The realtime developers
resolved this problem at the PTE level by allocating the lock separately
and putting a pointer into struct page.
Something similar can certainly be done for the PMD-level locking. But, as
Peter pointed out, the lock allocation means that the initialization of PMD
pages is now subject to out-of-memory failures, complicating the code
considerably. He hoped that the new code could be written with the
assumption that PMD construction could fail so that the realtime tree would
not have to carry a complicated add-on patch. Kirill is not required to
cater to the needs of an out-of-tree patch set, but it's still nicer to
avoid making life difficult for the realtime people if it can be avoided.
So chances are, there will be another version of this set coming in the
near future.
Beyond that, though, this work appears to be mostly complete and in good
shape. It could, thus, find its way into a mainline kernel sometime in the
relatively near future.
Comments (4 posted)
By Jonathan Corbet
September 24, 2013
It is often said that the kernel developers are committed to avoiding ABI
breaks at almost any cost. But ABI problems can, at times, be hard to
avoid. Some have argued that the perf events interface is particularly
subject to incompatible ABI changes because the
perf tool is part
of the
kernel tree itself; since
perf can evolve with the kernel, there
is a possibility that
developers might not even notice a break. So the recent discovery of a
perf ABI issue is worth looking at as an
example of how compatibility problems are handled in that code.
The perf_event_open() system call returns a file descriptor that,
among other things, may be used to map a ring buffer into a process's
address space with mmap(). The first page of that buffer contains
various bits of housekeeping information represented by
struct perf_event_mmap_page, defined in
<uapi/linux/perf_event.h>. Within that structure (in a 3.11
kernel) one finds this bit of code:
union {
__u64 capabilities;
__u64 cap_usr_time : 1,
cap_usr_rdpmc : 1,
cap_____res : 62;
};
For the curious, cap_usr_rdpmc indicates that the RDPMC
instruction (which reads the performance monitoring counters directly) is
available to user-space code, while cap_usr_time indicates that
the time stamp counter can be read with RDTSC. When these
features (described as "capabilities," though they have nothing to do with
the security-oriented capabilities implemented by the kernel) are
available, code which is monitoring itself can eliminate the
kernel middleman and get performance data more efficiently.
The intent of the above union declaration is clear enough: the developers
wanted to be able to deal with the full set of capabilities as a single
quantity, or to be able to access the bits individually via the
cap_ fields. One need not look at it for too long, though, to see
the error: each of the cap_ fields is a separate member of the
enclosing union, so they will all map to the same bit. This interface,
thus, has never worked as intended. But, in a testament to the
thoroughness of our code review, it was merged
for 3.4 and persisted through the 3.11 release.
Once the problem was noticed, Adrian Hunter quickly posted the obvious fix, grouping the cap_
fields into a separate structure. But it didn't take long for Vince Weaver
to find a new problem: code that
worked with the broken structure definition no longer does with the fixed
version. The fix moved
cap_usr_rdpmc from bit 0 to bit 1 (while leaving
cap_usr_time in bit 0), with the result that
binaries built for older kernels look for it in the wrong place. If a
program is,
instead, built with the newer definition, then run on an older kernel, it
will, once again, look in the wrong place and come to the wrong conclusion.
After some discussion, it became clear that it would not be possible to fix
this problem in an entirely transparent way or to
hide the fix from newer code. At that point, Peter Zijlstra suggested that a version number field be used;
applications could explicitly check the ABI version and react accordingly.
But Ingo Molnar rejected that approach as
"really fragile" and came up with a fix of his own. After a
few rounds of discussion, the union came to
look like this:
union {
__u64 capabilities;
struct {
__u64 cap_bit0 : 1,
cap_bit0_is_deprecated : 1,
cap_user_rdpmc : 1,
cap_user_time : 1,
cap_user_time_zero : 1,
cap_____res : 59;
};
};
In the new ABI, cap_bit0 is always zero, while
cap_bit0_is_deprecated is always one. So code that is aware of
the shift can test cap_bit0_is_deprecated to determine which
version of the interface it is using; if it detects a newer kernel, it will
know that the various cap_user_ (changed from cap_usr_)
fields are valid and can be used.
Code built for older kernels will, instead, see all of the old capability
bits (both of which mapped onto bit 0) as being set to zero.
(For the curious, the new cap_user_time_zero field was added in an
independent 3.12 change).
One could argue that this change still constitutes an ABI break, in that
older code may
conclude that RDPMC is unavailable when it is, in fact, supported
by the system it is running on. Such code will not perform as well as it
would have with an older kernel. But it will perform correctly, which is
the biggest concern here. More annoying to some might be the fact that
code written for one version of the interface will fail to compile with the
other; it is an API break, even if the ABI continues to
work. This will doubtless be irritating for some users or packagers, but it
was seen as being better than continuing to allow code to use an interface that
was known to be broken. Vince Weaver, who has sometimes been critical of
how the perf ABI is managed, conceded that
"this seems to be about as reasonable a solution to this
problem as we can get."
One other important aspect to this change is the fact that the structure
itself describes which interpretation should be given to the capability
bits. It can be tempting to just make the change and document somewhere
that, as of 3.12, code must use the new bits. But that kind of check is
easy for developers to overlook or forget, even in this simple situation.
If the fix is backported into stable kernels, though, then simple kernel version
number checks are no longer good enough. With the special
cap_bit0_is_deprecated bit, code can figure out the right thing to
do regardless of which kernel the fix appears in.
In the end, it would be hard to complain that the perf developers have
failed to respond to ABI concerns in this situation. There will be an API
shift in 3.12 (assuming Ingo's patch is merged, which had not happened as
of this writing), but all combinations of newer and older kernels and
applications will continue to work; this ABI break went in during the 3.12
merge window, but never found its way into a stable kernel release. The
key there is early testing; by catching this issue at the beginning of the
development cycle, Vince helped to ensure that it would be fixed by the
time the stable release happened. The kernel developers do not want to
create ABI problems, but extensive user testing of development kernels is a
crucial part of the process that keeps ABI breaks from happening.
Comments (38 posted)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>