Kernel development

Brief items

Kernel release status

The current development kernel is 3.12-rc3, released on September 29. Linus's summary was: "On the whole, nothing really appears very scary. Go forth and test."

Stable updates: 3.11.2, 3.10.13, 3.4.63, and 3.0.97 were released on September 26; they were followed by 3.11.3, 3.10.14, 3.4.64, and 3.0.98 on October 1.

Comments (none posted)

The renameat() system call changes the name of the file given as an argument, possibly replacing an existing file in the process. This operation is atomic; the view of the filesystem from user space will reflect the situation before or after the renameat() call, but it will never expose an intermediate state. Things work well when one file is involved, but what happens when multiple rename operations need to be run as a single atomic operation? That is a big problem, but, thanks to a patch from Miklos Szeredi, we might have a solution to a smaller subset.

The problem Miklos is trying to solve is the problem of exchanging two files — both files continue to exist, but their names have been swapped. To achieve this, he has posted a patch set adding a new renameat2() system call:

    int renameat2(int olddir, const char *oldname, 
		  int newdir, const char *newname, unsigned int flags);

This system call differs from renameat() in that it has the new flags argument; if flags is zero, renameat2() behaves exactly as renameat(). If, instead, flags contains RENAME_EXCHANGE, an existing file at newname will not be deleted; instead, it will be renamed to oldname. Thus, with this flag, renameat2() can be used to atomically exchange two files. The main use case for renameat2() is to support union filesystems, where it is often desirable to atomically replace files or directories with "whiteouts" indicating that they have been deleted. One could imagine other possibilities as well; Miklos suggests atomically replacing a directory with a symbolic link as one of them.

No review comments have been posted as of this writing.

Comments (16 posted)

Kernel development news

Transactional memory in the dentry cache

By Jonathan Corbet
October 2, 2013

LWN recently described the "lockref" mechanism merged into the 3.12 kernel. Lockrefs aim to reduce locking overhead in situations where all that needs to happen is the modification of a reference count, but the data structure as a whole must be held invariant while that count is changed. This new locking primitive has helped to reduce the overhead of locking in the system's dentry cache — the extensive data structure that caches mappings between file names and inodes. But dentry cache overhead strongly affects the performance of the system as a whole, so it is not surprising that there is a desire to improve things even more; that appears to be especially true in the 3.12 development cycle, which has seen a lot of attention paid to the reduction of core locking overhead.

One might think that it would be hard to improve on the performance of a lockref in this area, but there is one technology that might just help in this regard: transactional memory. For some years now, transactional memory has been one of those hardware features that was destined to solve a number of our scalability problems. As is often the case, that future has been slow to arrive in the real world. But some manufacturers are now shipping CPUs with some transactional memory support built into them, and software is starting to take advantage of this feature. See, for example, this article describing the work done to add transactional memory support to the GNU C library.

Linus recently got an itch to upgrade to a newer, faster desktop computer; when he chose his new processor, he made a point of getting one that provided transactional memory support. So, he decided, the time had come to try to use that support to further speed dentry cache locking. The result was a short, proof-of-concept patch that shows how things could work. The core part of the patch is worth showing directly:

    asm goto("xbegin %l[repeat]": : :"memory","ax":repeat);
    if (unlikely(d_unhashed(dentry)))
    	goto xabort;
    if (unlikely(!is_simple_dput(dentry)))
    	goto xabort;
    if (unlikely(!arch_spin_value_unlocked(dentry->d_lock.rlock.raw_lock)))
    	goto xabort;
    dentry->d_lockref.count--;
    asm volatile("xend");
    return;

xabort:
    asm volatile("xabort $0");

repeat:
    /* Fallback to lockref code */

The first line adds the xbegin instruction that starts the transaction. This instruction behaves a little strangely, in that its influence extends over the code that follows. Execution will continue with the code after the xbegin, but, should the transaction abort for any reason, all changes made during the transaction will be rolled back and control will jump to the address provided to xbegin (the label repeat in this case).

What follows is a set of tests to determine whether it is truly safe to decrement the reference count without holding the dentry spinlock. In particular, the dentry must be hashed (have a name associated with it, essentially), be on the LRU list, and not be locked. Reading these fields of the dentry structure adds them to the transaction; should some other code make a change to one of them, the transaction will be aborted by the processor. If any of the tests show that the dentry is not in a suitable state for a quick reference count decrement, the code uses the xabort instruction to make the transaction fail. Otherwise, the reference count will be decremented and the xend instruction will bring the transaction to a successful conclusion. The reference count, too, will become part of the transaction; should any other processor also try to change it, the transaction will abort.

The code as written is clearly not intended for merging; one does not put late-model x86 assembly code directly into generic filesystem code, after all. But it is enough to get a sense for how well transactional memory works in this situation. According to Linus, "profiles with this look beautiful", with the locking cost associated with the dput() operation reduced essentially to zero. So there may well be a place for transactional memory within the dentry cache.

Whether this technique will see much wider use in the kernel remains to be seen, though. Andi Kleen posted a patch using transactional memory for most kernel locks back in March, but that work did not go very far, mostly because he could not provide any benchmark numbers showing what kind of performance improvements could be expected. In his patch, Linus made it clear that he suspects those improvements will be small, saying "I doubt the intel TSX patches are going to be useful (if they were, Intel would be crowing about performance numbers now that the CPU's are out, and they aren't)." Instead, he has suggested that this feature will only be useful in a few places:

Quite frankly, from all I've seen so far, the kernel is not going to have very good luck with things like lock elision, because we're really fine-grained already, and at least the Intel lock-elision (don't know about POWER8) basically requires software to do prediction on whether the transaction will succeed or not, dynamically based on aborts etc. And quite frankly, by the time you have to do things like that, you've already lost. We're better off just using our normal locks.

So as far as I'm concerned, transactional memory is going to be useful - *if* it is useful - only for specialized code. Some of that might be architecture-internal lock implementations, other things might be exactly the dput() kind of situation.

The end result is that Andi's patches may be a good starting point for transactional memory support in a more architecture-independent way, but we may not see transactional memory used as a general locking mechanism (or lock-elision mechanism) in the kernel.

Along the lines of architecture-independence, it is worth noting that Intel was not the first to ship a processor with transactional memory support; the PowerPC architecture has had similar functionality for a couple of years. Naturally, the feature works a little differently in that architecture, but the basic functionality is the same. So it should be possible to create some sort of wrappers around transactional memory that can be used in common code. That is, if kernel-space transactional memory can be made to work at all in an efficient manner on PowerPC; there are some challenges in the way of getting that functionality working reliably.

What one might conclude is that, while transactional memory is a useful technology for speeding up specific bits of high-profile kernel code, it does not look like a general solution to the locking problem at this point. That could change in the future, of course, as hardware transactional memory gets faster and more capable. But, for now, transactional memory looks mostly like a useful trick to squeeze a little more performance out of a few highly optimized code paths.

Comments (9 posted)

How much memory power management is useful?

By Jonathan Corbet
October 2, 2013

In most contemporary systems, the easiest ways to reduce power consumption have long since been taken advantage of. But any system component that uses power is worth looking at to see whether it can be made to use less. In the case of main memory, the potential power savings are said to be significant, but they come at the expense of complicating the (already complicated) memory management subsystem. Now, with the latest in a long series of memory power management patches, some developers are questioning whether all of that complexity is called for.

Srivatsa S. Bhat's memory power management patch set was last covered here in April 2013. Many of the core principles remain the same, but some of the details have changed significantly.

The patch set is still based on the notion of dividing main memory into "regions" that match the power-management boundaries of the underlying hardware. At the hardware level, a region is a range of memory that can be powered up or down as a unit. The fundamental goal of the memory power management patches is to keep as many of these regions free as possible, enabling them to be kept in a low-power state.

Regions have some similarity to the "zones" maintained by the memory management subsystem. But zone boundaries are defined by the kernel, and have little to do with the hardware architecture (though zones will not span multiple NUMA nodes). Regions are, instead, a hardware concept. Since there is little commonality between regions and zones, the patch set ends up defining regions as an independent structure, almost entirely distinct from zones.

The first step is then simple: when allocating pages, do so from the lowest-numbered region with available pages. That will tend to concentrate allocations at one end of the memory range, keeping the higher-numbered regions free. To facilitate allocation in this mode, the page allocator's free list is modified to keep free pages sorted by region; that allows pages from the appropriate regions to be found quickly, with no need to scan through the list. The allocation logic is also changed a bit. Currently, the kernel will try to avoid splitting up large contiguous areas to satisfy an allocation request if a suitable (and smaller) chunk of memory is available. The memory power management patches sacrifice that heuristic to the extent that large chunks in low-numbered zones will be split rather than allocating from a smaller chunk in a higher-numbered zone.

The next step is somewhat more invasive: a "region allocator" is added to the memory management subsystem; it manages memory in large blocks corresponding to hardware regions. In current kernels, the page allocator works directly with the system's available memory; in the new scheme, instead, the page allocator allocates memory from regions of memory obtained from the region allocator. In other words, the region allocator has been placed between the page allocator and physical memory. If the page allocator needs more memory, it will request a new region from the region allocator. If, instead, the page allocator realizes that it has a full region on the free list, that region can be returned to the region allocator.

A determined effort is made to ensure that all allocations from any given region have the same "migration type." A page's migration type describes how easily the contents of that page could be moved elsewhere. Anonymous pages for user-space use are relatively easy to move; all that is required is to copy the data and change any page table entries that point to the page. Pages used in the kernel's slab allocators, instead, are firmly nailed into place; moving such a page would require finding and changing all of the kernel-space pointers referring to objects allocated from that page — not an easily solved problem. The purpose here is straightforward enough: it only takes one non-movable page to keep an entire region powered up. If all of those unmovable pages can be kept separate from pages that are more easily relocated, it will be easier to free regions of memory.

The final part of the patch set takes advantage of movable pages to actively free zones of memory when the conditions are right. For example, if a region containing movable pages is mostly empty and free pages are available elsewhere, the kernel will attempt to relocate data out of that region and, once the region is empty, return it to the region allocator. In other cases (clean page-cache pages, for example), the pages can simply be reclaimed. In this way, it is hoped, the inevitable fragmentation that occurs over time can be cleaned up, keeping the maximum number of regions free.

Most of this 40-part patch set is relatively uncontroversial, but there are some worries about the active defragmentation measures. If they are not applied carefully, they could end up increasing power consumption rather than decreasing it: moving pages around takes power, as does replacing data in the page cache that was reclaimed prematurely. So there needs to be some clear evidence that the active measures help to reduce power consumption; thus far, that evidence does not exist, since no power savings benchmark results have been posted for some time.

More to the point, there are concerns that the active measures may reflect a bit of a mismatch between the design of the patch set and how memory power management actually happens on current hardware. The core idea is to keep as many memory regions entirely free of data as possible; that will allow those regions to be powered down — losing all data stored therein — without causing problems. In this model, a single active page can keep a region from being powered down, so regions must be completely evacuated for the power savings to be realized.

But, as Arjan van de Ven explained, memory power management, on current Intel hardware at least, works a bit differently. It has a number of reduced-power modes, most of which are entirely transparent to the operating system and preserve the data stored in memory. Transitions to and from the reduced-power states are quite fast, to the point that latency problems are not a real concern. For this type of power management to kick in, all that is required is that the relevant region of memory be idle most of the time. As Arjan put it:

To get the power savings, my deep suspicion (based on some rudimentary experiments done internally to Intel earlier this year) is that it is more than enough to have "statistical" level of "binding", to get 95%+ of the max theoretical power savings....

In other words, if a significant number of regions can be kept mostly empty, it doesn't matter that much if the last few pages are still in occasional use. That suggests that the parts of Srivatsa's patch set that control the allocation patterns are all that's needed to reap most of the power-savings rewards — on Intel hardware at least. The active clearing of regions may turn out to be a futile effort to gain the last few percent of the available savings. Of course, not all hardware is made by Intel; results for other processors and architectures may vary. Even on Intel systems, there may well be situations where it makes sense to support full, content-destroying memory power-down. But the fact that most of the benefits can apparently be obtained without actively moving pages around will argue for, at least, that feature being turned off by default.

Note the use of the word "apparently" above, though. What is lacking from the discussion at this point is any sort of comprehensive measurements of actual power use with and without these patches. Real measurements, which generally trump more theoretical discussions, should be the deciding factor when the memory management maintainers decide how much of this work to merge and whether the active defragmentation functionality should be enabled by default or not. Either way, the bulk of the patch set seems to be relatively uncontroversial, so we should see some sort of movement on memory power management, finally, in the near future.

Comments (5 posted)

NUMA scheduling progress

By Jonathan Corbet
October 1, 2013

NUMA balancing was a topic of fierce debate through much of 2012; that discussion culminated with the merging of Mel Gorman's NUMA balancing infrastructure patch set into the 3.8 kernel. Those patches provided the basic structure upon which a NUMA balancing solution could be built, but did not attempt to solve the problem in a comprehensive way. Since then, one might be forgiven for thinking that the developers involved have lost interest; not much NUMA-related code has found its way into the mainline. But, as can be seen in Mel's basic scheduler support for NUMA balancing patch set, which weighs in at 63 individual changesets, quite a bit of work has been happening in this area.

The NUMA balancing problem is one of overcoming a fundamental characteristic of NUMA hardware: accesses to memory local to any given NUMA node will be much faster than accesses to remote memory. If processes and their memory live on the same NUMA node, things will run quickly; otherwise performance will be significantly worse. NUMA performance on Linux has long been deemed to be worse than it should be, so there is clearly some room for improvement in this area.

The kernel has, for years, attempted to allocate memory on the same node that the allocating process is running on, but that turns out to be inadequate. Over time, it seems to be inevitable that processes and memory will move around the system, leading to poor performance. Recent efforts to improve the situation have thus been focused on cleaning things up as the system runs, moving memory to the nodes that are actually using it and moving processes to be closer to their memory. This particular patch set, as suggested by the "scheduler support" name, is mainly concerned with the latter task — ensuring that the scheduler runs processes on the nodes that host the bulk of their memory.

Locating a process's memory

If the scheduler is to keep processes near their memory, it must, clearly, have a sense for where that memory is placed. This information is not as straightforward to come by as one might expect, especially if one is concerned mostly with the memory that is actually being used, as opposed to memory which has been allocated but is not in active use. The kernel needs to continually observe a process's memory access patterns to be able to optimize its placement in the system.

One of the core changes in the current NUMA patch set is to enable this observation, using the virtual memory mechanism. The scheduler will periodically scan through each process's address space, revoking all access permissions to the pages that are currently resident in RAM. The next time the affected process tries to access that memory, a page fault will result. The scheduler will trap that fault and restore access to the page in question; it will also increment an access counter in a per-process array indexed by the NUMA node number. Over time, these counts (which are maintained as a sort of running average) will provide a picture of where the process's memory accesses are going. At that point, it is a simple matter of looping through the array to find the node with the most accesses; that becomes the "preferred node" for the process.

If the process is running on a node other than the preferred one, the scheduler will attempt to migrate it to the node where its memory lives. That should result in more node-local memory accesses and, thus, better performance. Of course, the picture is a bit more complicated than this for a number of reasons, starting with the fact that the migration of the process could fail for a number of reasons. So one of the patches in the series will periodically retry the attempt to migrate a process to its preferred node if things didn't work out the first time. Even if the initial migration attempt fails, the process should eventually end up on the preferred node.

This process-follow-memory migration can also run counter to another one of the scheduler's core goals: distributing the load across the system to make the best use of the available CPU time. If a process that has been migrated closer to its memory cannot actually run because the destination NUMA node is overloaded, the goal of improved performance is unlikely to be achieved. So it may be necessary to migrate other processes off the overloaded node. That complicates the picture somewhat. The patch set does not try address the full problem, but it does take on one specific subproblem: cases where swapping two processes between two CPUs will make both processes run more efficiently. Even that task is fraught with hazards: moving two processes at once is more complicated than migrating a single process. To make it work with a minimum of fuss, the patch set adds a variant of stop_machine() that effectively halts work on the two processors involved while the exchange is made.

It also became necessary to avoid tracking NUMA access information for shared executable pages. Including those pages will tend to pull processes toward the nodes where pages holding shared library code can be found. But there is relatively little value in migrating processes near their executable pages, as it turns out, because much of the needed data is in the CPU caches much of the time anyway. So patch 35 in the series avoids tracking faults from those pages.

NUMA groups

As described thus far, the patch set adds a basic mechanism for putting a process near its (exclusively owned) pages. But the concept of "a process's memory" is not necessarily simple either; processes often share pages with each other. That is especially true of threaded applications where all of the threads run in the same address space and have access to the same memory. Threaded applications written with NUMA performance in mind will partition their memory accesses, but they can still end up sharing pages, especially if a feature like transparent huge pages is in use. In other cases, entirely separate processes may still share a large memory area; neither "owns" it, but both can benefit by being placed close to it. Either way, there can be performance benefits from placing such processes on the same node whenever possible.

To improve performance in the shared memory case, the patch set adds a mechanism that tries to detect when processes are sharing memory with each other — and, importantly, when they are accessing the same pages. Doing so requires tracking which processes are accessing each page. Andrea Arcangeli's AutoNUMA patch set, one of the contenders in the 2012 discussions, added an extensive tracking infrastructure for this purpose, but the resulting memory overhead was deemed to be too high. So, as is so often the case, this patch set tries to track this data by shoehorning the data into struct page instead.

In particular, the data is put into the flags field, which is already rather more complex than a simple set of flags. The ID of the NUMA node containing each page is stored there in current kernels; this patch set changes that to the CPU ID, and adds the process ID of the last process to access the page as well. Needless to say, the full process ID does not fit into a subfield of flags, so some compromises need to be made. In this case, only the bottom eight bits of the process ID are used, with the understanding that some collisions will be unavoidable. Whenever the system handles a page fault, this "cpupid" number is stored in the relevant page structure.

With that infrastructure in place, the scheduler can respond to a NUMA scanning fault by checking whether the process currently accessing the page is the same as the last process to do so. If not, the scheduler will consider putting the two processes into a "NUMA group." But first, it must find the other process that accessed the page — not a trivial matter, given that the full process ID for that process is not available. This problem is handled by looking at whichever process is currently running on the CPU that last accessed the page; if the relevant part of the process ID matches, the scheduler assumes (after applying a few sanity tests) that it is the same process as the one that accessed the page. In this case, the processes are placed into a NUMA group to indicate that they are sharing pages; if both processes are already in groups, those two groups will be coalesced into one larger group.

Once again, shared library pages threw a wrench into the works. Since every process shares access to, for example, the C library, all processes in the system tended to get pulled together into a single NUMA group. Deeming that to be less than fully useful, Peter Zijlstra tossed in a patch avoiding group tracking for read-only pages.

Once a process is part of a NUMA group, page faults for shared pages will be tracked at the group level. That gives the scheduler two somewhat independent sets of statistics from which to make decisions: accesses to private pages, and accesses to shared pages. If both sets of numbers agree on the preferred node, the decision is easy. If, instead, there is a disagreement, the preferred node will be chosen by iterating through the arrays and picking the node where the sum of private and shared faults is the highest. That sum is distorted, though, by weighting shared faults a bit more heavily in an attempt to push the system toward decisions that put processes with shared pages on the same node.

There is a lot more than this to the patch set, including a number of algorithmic tweaks and fixes, but the above text describes the most significant themes. Mel has included quite a bit of benchmark data with the patch set; there is a lot of information there, but the bottom line, as he described it, was: "some reports indicate that the performance is getting close to manual bindings for some workloads but your mileage will vary." A number of details are yet to be addressed; in particular, support for CPU hotplugging is not yet there. Even so, Mel thinks that the code is getting ready for movement into the "tip" tree, and, thus, linux-next. That suggests it could be ready for merging in the 3.13 or (perhaps more likely) 3.14 development cycle.

Comments (13 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.12-rc3 ?

Greg KH Linux 3.11.3 ?

Greg KH Linux 3.11.2 ?

Greg KH Linux 3.10.14 ?

Greg KH Linux 3.10.13 ?

Kamal Mostafa Linux 3.8.13.10 ?

Greg KH Linux 3.4.64 ?

Greg KH Linux 3.4.63 ?

Greg KH Linux 3.0.98 ?

Greg KH Linux 3.0.97 ?

Architecture-specific

vijay.kilari@gmail.com AArch64: KGDB support ?

Andi Kleen perf, x86: Add last TSX PMU code for Haswell v3 ?

Yuvaraj Kumar C D Exynos5250 SATA Support ?

Core kernel code

Mel Gorman Basic scheduler support for automatic NUMA balancing V8 ?

Linus Torvalds Avoiding the dentry d_lock on final dput(), part deux: transactional memory ?

Miklos Szeredi cross rename ?

Waiman Long qrwlock: Introducing a queue read/write lock implementation ?

Development tools

John Stultz Lockdep enablement for seqcount/seqlocks ?

Jiri Olsa perf tools: Add toggling events support ?

Pauli Nieminen trace-cmd: Add decoder plugin for raw syscalls ?

Device drivers

Rhyland Klein drivers: power: Add support for bq24735 charger ?

Kishon Vijay Abraham I PHY framework ?

Srinivas KANDAGATLA media: st-rc: Add ST remote control driver ?

Arun Kumar K Exynos5 IS driver ?

Wendy Ng thermal: bcm281xx: Add thermal driver ?

Andy Lutomirski iMC SMBUS and DIMM bus probing ?

Enrico Mioso The huawei_cdc_ncm driver ?

Xiubo Li Add Freescale FTM PWM driver. ?

Markus Rechberger memory mapping for usbfs (v0.4) ?

Sricharan R DRIVERS: IRQCHIP: Add support for crossbar IP ?

Mika Westerberg runtime PM support for I2C adapter devices ?

Durgadoss R Thermal Framework Enhancements ?

Daniel Mack drivers: misc: add gpio wakeup driver ?

Michal Simek FPGA subsystem core ?

Filesystems and block I/O

Kent Overstreet Multipage biovecs ?

Memory management

Srivatsa S. Bhat mm: Memory Power Management ?

Kirill A. Shutemov split page table lock for PMD tables ?

Shaohua Li cleancache: SSD backed cleancache backend ?

Zhang Yanfei Arrange hotpluggable memory as ZONE_MOVABLE ?

Networking

Simon Horman MPLS actions and matches ?

Jozsef Kadlecsik ipset patches for nf-next ?

Chun-Yeow Yeoh Add Mesh Channel Switch Support ?

Security-related

Kees Cook Kernel base address randomization ?

Miscellaneous

Karel Zak util-linux v2.24-rc1 ?

Page editor: Jonathan Corbet
Next page: Distributions>>