LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.12-rc2, released on September 23. Linus said: "Things have been fairly quiet, probably because lots of people were traveling for LinuxCon and Linux Plumbers conference last week. So nothing very exciting stands out. It's mainly driver updates/fixes (gpu drivers stand out, but there's networking too, and smaller stuff all over). Apart from drivers there's arch updates (tile/arm/mips) and some filesystem noise (mainly btrfs)."

Stable updates: no stable updates have been released in the last week. The 3.11.2, 3.10.13, 3.4.63, and 3.0.97 updates are in the review process as of this writing; they can be expected on or after September 27.

Comments (none posted)

Quotes of the week

I clearly need to be more aware of Andrew's racing schedule.
Linus Torvalds plans the next merge window

A pox on whoever thought up huge pages. Words cannot express how much of a godawful mess they have made of Linux MM. And it hasn't ended yet.
Andrew Morton

John scoured the research literature for ideas that might save his dreams of infinite scaling. He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.
James Mickens [PDF]

Comments (9 posted)

Garrett: Implementing UEFI Boot to Zork

Matthew Garrett has finally implemented what we all really wanted in the first place: direct boot into the Zork game from UEFI. "But despite having a set of functionality that makes it look much more like an OS than a boot environment, UEFI doesn't actually expose a standard C library. The EFI Application Development Kit solves this particular design decision."

Comments (15 posted)

NVIDIA to provide documentation for Nouveau

Nouveau is the reverse-engineered driver for NVIDIA GPUs; it has been developed for a number of years with no assistance from NVIDIA. Now, though, an NVIDIA developer has surfaced on the Nouveau list with an offer to help: "NVIDIA is releasing public documentation on certain aspects of our GPUs, with the intent to address areas that impact the out-of-the-box usability of NVIDIA GPUs with Nouveau. We intend to provide more documentation over time, and guidance in additional areas as we are able." This would appear to be a big step in the right direction.

Full Story (comments: 82)

Kernel development news

Split PMD locks

By Jonathan Corbet
September 25, 2013
Once upon a time, the standard response to scalability problems in the kernel was the introduction of finer-grained locking. That approach has its problems, though: the cache-line bouncing that locking activity creates can be a scalability problem in its own right. So much of the scalability work in the kernel has, in recent years, been focused on lockless algorithms instead. But, sometimes, there is little alternative to the introduction of finer-grained locks; a current memory management patch set illustrates one of those situations, with some additional complications.

Page tables hold the mapping between a virtual address in some process's address space and the physical location of the memory behind that address. It is easy to think of the page table as a simple linear array indexed by the page frame number, but the reality is more complicated: page tables are implemented as a sparse tree with up to four levels. Various subfields of a virtual address are used to progress through the tree as shown here:

[Virtual address translation]

Some systems do not have all four levels; no 32-bit system has the PUD ("page upper directory") level, for example, and some 32-bit systems may still get by with two-level page tables. Kernel code is written to deal with all four levels, though; the extra code will vanish in the compilation state for configurations with fewer levels.

Changes to page tables can be made frequently; every movement of a page into or out of RAM must be reflected there, as must changes to the virtual address space (such as those made via an mmap() call). If the page table is not shared across processes, there is little potential for contention (and, thus, for scalability problems), since only one process will be making changes there. Sharing of the page tables, as happens most frequently in threaded workloads, changes the picture, though; it is not uncommon for threads to be making concurrent page table changes. The more concurrently running threads there are, the higher the potential for contention becomes.

In some configurations, the entire page table is protected by a single spinlock (called page_table_lock) in the process's mm_struct structure. That lock was recognized as a scalability problem years ago; in response, locking for the lowest level of the page table tree (the PTE — "page table entry" — pages) was made per-PTE-page for multiprocessor configurations. But all of the other layers of the page table tree are still protected by page_table_lock; in general, changes at those levels are rare enough that more sophisticated locking is not worth the trouble.

There is only one problem: as Kirill A Shutemov has pointed out, that is not always true. When huge pages are in use, the PTE level of the page table tree is omitted. Instead, the entry in the next level up — the "page middle directory" or PMD — points directly to a much larger page. So, in effect, huge pages prune the page table tree back to three levels, with the PMD becoming the lowest level. The elimination of one level of translation is one of the reasons why huge pages can improve performance, though this effect is likely overshadowed by the large increase in the coverage of the translation lookaside buffer (TLB), which avoids a lot of address translations altogether.

What Kirill has noted is that highly threaded workloads slow down considerably when the transparent huge pages feature is in use. Given that huge pages are meant to increase performance, this result is seen as surprising and undesirable. The problem is contention for the page_table_lock; the use of lots of huge pages greatly increases the number of changes made at the PMD level and, thus, increases contention. To address this problem, Kirill has put together a patch set that pushes the locking down to the PMD level, eliminating much of that contention.

Locks are normally embedded within the data structures they protect, so one might be inclined to put a spinlock into the PMD. But the PMD is a hardware-defined structure; it is simply a page full of pointers to PTE pages or huge pages, with some status bits. There is no place there for an added spinlock, so that lock must go somewhere else. When fine-grained locking was implemented at the PTE level, the same problem was encountered; the solution was to shoehorn the lock into the already overcrowded struct page, which is the core structure for tracking the system's physical memory. (See this article for details on how struct page is put together). Kirill's patch replicates the approach used at the PTE level, putting the lock into struct page.

The results would appear to be reasonably convincing. A benchmark designed to demonstrate the problem runs in 36.5 seconds with transparent huge pages off. When transparent huge pages are turned on in an unmodified kernel, the number of page faults drops from over 24 million to 50,000, but the run time increases to 49.9 seconds — not the speed improvement that one might hope for. Adding the patch, though, cuts the run time to 33.9 seconds, significantly faster than an unmodified kernel without transparent huge pages. By getting rid of the cost of the locking contention at the PMD level, Kirill's patch allows the benchmark to enjoy the performance benefits that come from using huge pages.

There is one remaining problem, as pointed out by Peter Zijlstra: the patch as written will not work with the realtime preemption patch set. In the realtime world, spinlocks are sleeping locks; that makes them bigger, to the point that they will no longer fit into the tight space available in struct page. That structure will grow to accommodate the larger lock, but, given that there is one page structure for every page in the system, the memory cost of that growth is difficult to accept. The realtime developers resolved this problem at the PTE level by allocating the lock separately and putting a pointer into struct page.

Something similar can certainly be done for the PMD-level locking. But, as Peter pointed out, the lock allocation means that the initialization of PMD pages is now subject to out-of-memory failures, complicating the code considerably. He hoped that the new code could be written with the assumption that PMD construction could fail so that the realtime tree would not have to carry a complicated add-on patch. Kirill is not required to cater to the needs of an out-of-tree patch set, but it's still nicer to avoid making life difficult for the realtime people if it can be avoided. So chances are, there will be another version of this set coming in the near future.

Beyond that, though, this work appears to be mostly complete and in good shape. It could, thus, find its way into a mainline kernel sometime in the relatively near future.

Comments (4 posted)

A perf ABI fix

By Jonathan Corbet
September 24, 2013
It is often said that the kernel developers are committed to avoiding ABI breaks at almost any cost. But ABI problems can, at times, be hard to avoid. Some have argued that the perf events interface is particularly subject to incompatible ABI changes because the perf tool is part of the kernel tree itself; since perf can evolve with the kernel, there is a possibility that developers might not even notice a break. So the recent discovery of a perf ABI issue is worth looking at as an example of how compatibility problems are handled in that code.

The perf_event_open() system call returns a file descriptor that, among other things, may be used to map a ring buffer into a process's address space with mmap(). The first page of that buffer contains various bits of housekeeping information represented by struct perf_event_mmap_page, defined in <uapi/linux/perf_event.h>. Within that structure (in a 3.11 kernel) one finds this bit of code:

    union {
	__u64	capabilities;
	__u64	cap_usr_time  : 1,
		cap_usr_rdpmc : 1,
		cap_____res   : 62;
    };

For the curious, cap_usr_rdpmc indicates that the RDPMC instruction (which reads the performance monitoring counters directly) is available to user-space code, while cap_usr_time indicates that the time stamp counter can be read with RDTSC. When these features (described as "capabilities," though they have nothing to do with the security-oriented capabilities implemented by the kernel) are available, code which is monitoring itself can eliminate the kernel middleman and get performance data more efficiently.

The intent of the above union declaration is clear enough: the developers wanted to be able to deal with the full set of capabilities as a single quantity, or to be able to access the bits individually via the cap_ fields. One need not look at it for too long, though, to see the error: each of the cap_ fields is a separate member of the enclosing union, so they will all map to the same bit. This interface, thus, has never worked as intended. But, in a testament to the thoroughness of our code review, it was merged for 3.4 and persisted through the 3.11 release.

Once the problem was noticed, Adrian Hunter quickly posted the obvious fix, grouping the cap_ fields into a separate structure. But it didn't take long for Vince Weaver to find a new problem: code that worked with the broken structure definition no longer does with the fixed version. The fix moved cap_usr_rdpmc from bit 0 to bit 1 (while leaving cap_usr_time in bit 0), with the result that binaries built for older kernels look for it in the wrong place. If a program is, instead, built with the newer definition, then run on an older kernel, it will, once again, look in the wrong place and come to the wrong conclusion.

After some discussion, it became clear that it would not be possible to fix this problem in an entirely transparent way or to hide the fix from newer code. At that point, Peter Zijlstra suggested that a version number field be used; applications could explicitly check the ABI version and react accordingly. But Ingo Molnar rejected that approach as "really fragile" and came up with a fix of his own. After a few rounds of discussion, the union came to look like this:

    union {
 	__u64	capabilities;
 	struct {
	    __u64 cap_bit0			: 1,
	    	  cap_bit0_is_deprecated	: 1, 
	    	  cap_user_rdpmc		: 1,
	    	  cap_user_time			: 1,
	    	  cap_user_time_zero		: 1,
	    	  cap_____res			: 59;
 	};
     };

In the new ABI, cap_bit0 is always zero, while cap_bit0_is_deprecated is always one. So code that is aware of the shift can test cap_bit0_is_deprecated to determine which version of the interface it is using; if it detects a newer kernel, it will know that the various cap_user_ (changed from cap_usr_) fields are valid and can be used. Code built for older kernels will, instead, see all of the old capability bits (both of which mapped onto bit 0) as being set to zero. (For the curious, the new cap_user_time_zero field was added in an independent 3.12 change).

One could argue that this change still constitutes an ABI break, in that older code may conclude that RDPMC is unavailable when it is, in fact, supported by the system it is running on. Such code will not perform as well as it would have with an older kernel. But it will perform correctly, which is the biggest concern here. More annoying to some might be the fact that code written for one version of the interface will fail to compile with the other; it is an API break, even if the ABI continues to work. This will doubtless be irritating for some users or packagers, but it was seen as being better than continuing to allow code to use an interface that was known to be broken. Vince Weaver, who has sometimes been critical of how the perf ABI is managed, conceded that "this seems to be about as reasonable a solution to this problem as we can get."

One other important aspect to this change is the fact that the structure itself describes which interpretation should be given to the capability bits. It can be tempting to just make the change and document somewhere that, as of 3.12, code must use the new bits. But that kind of check is easy for developers to overlook or forget, even in this simple situation. If the fix is backported into stable kernels, though, then simple kernel version number checks are no longer good enough. With the special cap_bit0_is_deprecated bit, code can figure out the right thing to do regardless of which kernel the fix appears in.

In the end, it would be hard to complain that the perf developers have failed to respond to ABI concerns in this situation. There will be an API shift in 3.12 (assuming Ingo's patch is merged, which had not happened as of this writing), but all combinations of newer and older kernels and applications will continue to work; this ABI break went in during the 3.12 merge window, but never found its way into a stable kernel release. The key there is early testing; by catching this issue at the beginning of the development cycle, Vince helped to ensure that it would be fixed by the time the stable release happened. The kernel developers do not want to create ABI problems, but extensive user testing of development kernels is a crucial part of the process that keeps ABI breaks from happening.

Comments (38 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds