Kernel development
Brief items
Kernel release status
The current development kernel is 3.12-rc3, released on September 29. Linus's summary was: "On the whole, nothing really appears very scary. Go forth and test."
Stable updates: 3.11.2, 3.10.13, 3.4.63, and 3.0.97 were released on September 26; they were followed by 3.11.3, 3.10.14, 3.4.64, and 3.0.98 on October 1.
Quotes of the week
/* * Sent from my mobile phone. Please pardon brevity and lack of formatting. */
Exchanging two files
The renameat() system call changes the name of the file given as an argument, possibly replacing an existing file in the process. This operation is atomic; the view of the filesystem from user space will reflect the situation before or after the renameat() call, but it will never expose an intermediate state. Things work well when one file is involved, but what happens when multiple rename operations need to be run as a single atomic operation? That is a big problem, but, thanks to a patch from Miklos Szeredi, we might have a solution to a smaller subset.The problem Miklos is trying to solve is the problem of exchanging two files — both files continue to exist, but their names have been swapped. To achieve this, he has posted a patch set adding a new renameat2() system call:
int renameat2(int olddir, const char *oldname, int newdir, const char *newname, unsigned int flags);
This system call differs from renameat() in that it has the new flags argument; if flags is zero, renameat2() behaves exactly as renameat(). If, instead, flags contains RENAME_EXCHANGE, an existing file at newname will not be deleted; instead, it will be renamed to oldname. Thus, with this flag, renameat2() can be used to atomically exchange two files. The main use case for renameat2() is to support union filesystems, where it is often desirable to atomically replace files or directories with "whiteouts" indicating that they have been deleted. One could imagine other possibilities as well; Miklos suggests atomically replacing a directory with a symbolic link as one of them.
No review comments have been posted as of this writing.
Kernel development news
Transactional memory in the dentry cache
LWN recently described the "lockref" mechanism merged into the 3.12 kernel. Lockrefs aim to reduce locking overhead in situations where all that needs to happen is the modification of a reference count, but the data structure as a whole must be held invariant while that count is changed. This new locking primitive has helped to reduce the overhead of locking in the system's dentry cache — the extensive data structure that caches mappings between file names and inodes. But dentry cache overhead strongly affects the performance of the system as a whole, so it is not surprising that there is a desire to improve things even more; that appears to be especially true in the 3.12 development cycle, which has seen a lot of attention paid to the reduction of core locking overhead.One might think that it would be hard to improve on the performance of a lockref in this area, but there is one technology that might just help in this regard: transactional memory. For some years now, transactional memory has been one of those hardware features that was destined to solve a number of our scalability problems. As is often the case, that future has been slow to arrive in the real world. But some manufacturers are now shipping CPUs with some transactional memory support built into them, and software is starting to take advantage of this feature. See, for example, this article describing the work done to add transactional memory support to the GNU C library.
Linus recently got an itch to upgrade to a newer, faster desktop computer; when he chose his new processor, he made a point of getting one that provided transactional memory support. So, he decided, the time had come to try to use that support to further speed dentry cache locking. The result was a short, proof-of-concept patch that shows how things could work. The core part of the patch is worth showing directly:
asm goto("xbegin %l[repeat]": : :"memory","ax":repeat); if (unlikely(d_unhashed(dentry))) goto xabort; if (unlikely(!is_simple_dput(dentry))) goto xabort; if (unlikely(!arch_spin_value_unlocked(dentry->d_lock.rlock.raw_lock))) goto xabort; dentry->d_lockref.count--; asm volatile("xend"); return; xabort: asm volatile("xabort $0"); repeat: /* Fallback to lockref code */
The first line adds the xbegin instruction that starts the transaction. This instruction behaves a little strangely, in that its influence extends over the code that follows. Execution will continue with the code after the xbegin, but, should the transaction abort for any reason, all changes made during the transaction will be rolled back and control will jump to the address provided to xbegin (the label repeat in this case).
What follows is a set of tests to determine whether it is truly safe to decrement the reference count without holding the dentry spinlock. In particular, the dentry must be hashed (have a name associated with it, essentially), be on the LRU list, and not be locked. Reading these fields of the dentry structure adds them to the transaction; should some other code make a change to one of them, the transaction will be aborted by the processor. If any of the tests show that the dentry is not in a suitable state for a quick reference count decrement, the code uses the xabort instruction to make the transaction fail. Otherwise, the reference count will be decremented and the xend instruction will bring the transaction to a successful conclusion. The reference count, too, will become part of the transaction; should any other processor also try to change it, the transaction will abort.
The code as written is clearly not intended for merging; one does not put
late-model x86 assembly code directly into generic filesystem code, after all.
But it is enough to get a sense for how well transactional memory works in
this situation. According to Linus, "
Whether this technique will see much wider use in the kernel remains to be
seen, though. Andi Kleen posted a patch using
transactional memory for most kernel locks back in March, but that work
did not go very far, mostly because he could not provide any benchmark
numbers showing what kind of performance improvements could be expected.
In his patch, Linus made it clear that he suspects those improvements will be small,
saying "
So as far as I'm concerned, transactional memory is going to be
useful - *if* it is useful - only for specialized code. Some of
that might be architecture-internal lock implementations, other
things might be exactly the dput() kind of situation.
The end result is that Andi's patches may be a good starting point for
transactional memory support in a more architecture-independent way, but we
may not see transactional memory used as a general locking mechanism (or
lock-elision mechanism) in the kernel.
Along the lines of architecture-independence, it is worth noting that
Intel was not the first to ship a processor with transactional memory
support; the PowerPC
architecture has had similar functionality for a couple of years.
Naturally, the feature works a little differently in that architecture, but
the basic functionality is the same. So it should be possible to create
some sort of wrappers around transactional memory that can be used in
common code. That is, if kernel-space transactional memory can be made to
work at all in an efficient manner on PowerPC; there are some challenges in the way of getting that
functionality working reliably.
What one might conclude is that, while transactional memory is a
useful technology for speeding up specific bits of high-profile kernel
code, it does not look like a general solution to the locking problem at
this point. That could change in the future, of course, as hardware
transactional memory gets faster and more capable. But, for now,
transactional memory looks mostly like a useful trick to squeeze a little
more performance out of a few highly optimized code paths.
Srivatsa S. Bhat's memory power management
patch set was last covered here in
April 2013. Many of the core principles remain the same, but some of the
details have changed significantly.
The patch set is still based on the notion of dividing main memory into
"regions" that match the power-management boundaries of the underlying
hardware. At the hardware level, a region is a range of memory that can be
powered up or down as a unit. The fundamental goal of the memory power
management patches is to keep as many of these regions free as possible,
enabling them to be kept in a low-power state.
Regions have some similarity to the "zones" maintained by the memory
management subsystem. But zone boundaries are defined by the kernel, and
have little to do with the hardware architecture (though zones will not
span multiple NUMA nodes). Regions are, instead, a hardware concept.
Since there is little commonality between regions and zones,
the patch set ends up defining regions as an independent structure, almost
entirely distinct from zones.
The first step is then simple: when allocating pages, do so from the
lowest-numbered region with available pages. That will tend to concentrate
allocations at one end of the memory range, keeping the higher-numbered
regions free. To facilitate allocation in this mode, the page allocator's
free list
is modified to keep free pages sorted by region; that allows pages from the
appropriate regions to be found quickly, with no need to scan through the
list. The allocation logic is also changed a bit. Currently, the kernel
will try to avoid splitting up large contiguous areas to satisfy an
allocation request if a suitable (and smaller) chunk of memory is
available. The memory power management patches sacrifice that heuristic to
the extent that large chunks in low-numbered zones will be split rather
than allocating from a smaller chunk in a higher-numbered zone.
The next step is somewhat more invasive: a "region allocator" is added to
the memory management subsystem; it manages memory in large blocks
corresponding to hardware regions. In current kernels, the page
allocator works directly with the system's available memory;
in the new scheme, instead,
the page allocator allocates memory from regions of memory obtained from
the region allocator. In other words, the region allocator has been
placed between the page allocator and physical memory.
If the page allocator needs more memory, it will request a new region from the
region allocator. If, instead, the page allocator realizes that it has a
full region on the free list, that region can be returned to the region
allocator.
A determined effort is made to ensure that all allocations from any given
region have the same "migration type." A page's migration type describes
how easily the contents of that page could be moved elsewhere. Anonymous
pages for user-space use are relatively easy to move; all that is required
is to copy the data and change any page table entries that point to the
page. Pages used in the kernel's slab allocators, instead, are firmly
nailed into place; moving such a page would require finding and changing
all of the kernel-space pointers referring to objects allocated from that
page — not an easily solved problem. The purpose here is straightforward
enough: it only takes one
non-movable page to keep an entire region powered up. If all of those
unmovable pages can be kept separate from pages that are more easily
relocated, it will be easier to free regions of memory.
The final part of the patch set takes advantage of movable pages to
actively free zones of memory when the conditions are right. For example,
if a region containing movable pages is mostly empty and free pages are
available elsewhere, the kernel will attempt to relocate data out of that
region and, once the region is empty, return it to the region allocator.
In other cases (clean
page-cache pages, for example), the pages can simply be reclaimed.
In this way, it is hoped,
the inevitable fragmentation that occurs over time can be cleaned up,
keeping the maximum number of regions free.
Most of this 40-part patch set is relatively uncontroversial, but there are
some worries about the active defragmentation measures. If they are not
applied carefully, they could end up increasing power consumption rather
than decreasing it: moving pages around takes power, as does replacing data
in the page cache that was reclaimed prematurely. So there needs to be
some clear evidence that the active measures help to reduce power
consumption; thus far, that evidence does not exist, since no power savings
benchmark results have been posted for some time.
More to the point, there are concerns that the active measures may reflect
a bit of a mismatch
between the design of the patch set and how memory power management
actually happens on current hardware. The core idea is to keep as many
memory regions entirely free of data as possible; that will allow those
regions to be powered down — losing all data stored therein — without
causing problems. In this model, a single active page can keep a region
from being powered down, so regions must be completely evacuated for the
power savings to be realized.
But, as Arjan van de Ven explained, memory
power management, on current Intel hardware at least, works a bit
differently. It has a number of reduced-power modes, most of which are
entirely transparent to the operating system and preserve the data stored
in memory. Transitions to and from the reduced-power states are quite fast,
to the point that latency problems are not a real concern.
For this type of power management to kick in, all that is
required is that the relevant region of memory be idle most of the time.
As Arjan put it:
In other words, if a significant number of regions can be kept mostly
empty, it doesn't matter that much if the last few pages are still in
occasional use.
That suggests that the parts of Srivatsa's patch set that control the
allocation patterns are all that's needed to reap most of the power-savings
rewards — on Intel hardware at least. The active clearing of regions may
turn out to be a futile effort to gain the last few percent of the
available savings. Of course, not all hardware is made
by Intel; results for other processors and architectures may vary.
Even on Intel systems, there may well be situations where it makes sense to
support
full, content-destroying memory power-down. But the fact that most of the
benefits can apparently be obtained without actively moving pages around
will argue for, at least, that feature being turned off by default.
Note the use of the word "apparently" above, though. What is lacking from
the discussion at this point is any sort of comprehensive measurements of
actual power use with and without these patches. Real measurements, which
generally trump more theoretical discussions, should be the deciding factor when
the memory management maintainers decide how much of this work to merge and
whether the active defragmentation functionality should be enabled by
default or not. Either way, the bulk of the patch set seems to be
relatively uncontroversial, so we should see some sort of movement on
memory power management, finally, in the near future.
NUMA balancing was a topic of fierce debate through much of 2012; that
discussion culminated with the merging of Mel Gorman's NUMA balancing
infrastructure patch set into the 3.8 kernel. Those patches provided the
basic structure upon which a NUMA balancing solution could be built, but
did not attempt to solve the problem in a comprehensive way. Since then,
one might be
forgiven for thinking that the developers involved have lost interest; not
much NUMA-related code has found its way into the mainline. But, as can be
seen in Mel's basic scheduler support for NUMA
balancing patch set, which weighs in at 63 individual changesets, quite
a bit of work has been happening in this area.
The NUMA balancing problem is one of overcoming a fundamental
characteristic of NUMA hardware: accesses to memory local to any given NUMA
node will be much faster than accesses to remote memory. If processes and
their memory live on the same NUMA node, things will run quickly; otherwise
performance will be significantly worse. NUMA performance on Linux has
long been deemed to be worse than it should be, so there is clearly some
room for improvement in this area.
The kernel has, for years, attempted to allocate memory on the same node
that the allocating process is running on, but that turns out to be
inadequate. Over time, it seems to be inevitable that processes and memory
will move around the system, leading to poor performance. Recent efforts to
improve the situation have thus been focused on cleaning things up as the
system runs, moving memory to the nodes that are actually using it and
moving processes to be closer to their memory. This particular patch set,
as suggested by the "scheduler support" name, is mainly concerned with the
latter task — ensuring that the scheduler runs processes on the nodes that
host the bulk of their memory.
If the scheduler is to keep processes near their memory, it must, clearly,
have a sense for where that memory is placed. This information is not as
straightforward to come by as one might expect, especially if one is
concerned mostly with the memory that is actually being used, as opposed to
memory which has been allocated but is not in active use. The kernel needs
to continually observe a process's memory access patterns to be able to
optimize its placement in the system.
One of the core changes in the current NUMA patch set is to enable this
observation, using the virtual memory mechanism. The scheduler
will periodically scan through each process's address space, revoking all
access permissions to the pages that are currently resident in RAM. The
next time the affected process tries to access that memory, a page fault
will result. The scheduler will trap that fault and restore access to the
page in question; it will also increment an access counter in a per-process
array indexed
by the NUMA node number. Over time, these counts (which are maintained as
a sort of running average) will provide a picture of
where the process's memory accesses are going. At that point, it is a
simple matter of looping through the array to find the node with the most
accesses; that becomes the "preferred node" for the process.
If the process is running on a node other than the preferred one, the
scheduler will attempt to migrate it to the node where its memory lives.
That should result in more node-local memory accesses and, thus, better
performance. Of course, the picture is a bit more complicated than this
for a number of reasons, starting with the fact that the migration of the
process could fail for a number of reasons. So one of the patches in the
series will periodically retry the attempt to migrate a process to its
preferred node if things didn't work out the first time. Even if the
initial migration attempt fails, the process should eventually end up on
the preferred node.
This process-follow-memory migration can also run counter to another one of
the scheduler's core goals: distributing the load across the system to make the
best use of the available CPU time. If a process that has been migrated
closer to its memory cannot actually run because the destination NUMA node
is overloaded, the goal of improved performance is unlikely to be
achieved. So it may be necessary to migrate other processes off the
overloaded node. That complicates the picture somewhat. The patch set
does not try address the full problem, but it does take on one specific
subproblem: cases where swapping two processes between two CPUs will make both
processes run more efficiently. Even that task is fraught with hazards:
moving two processes at once is more complicated than migrating a single
process. To make it work with a minimum of fuss, the patch set adds a
variant of stop_machine() that effectively halts work on the two
processors involved while the exchange is made.
It also became necessary to avoid tracking NUMA access information for
shared executable pages. Including those pages will tend to pull processes
toward the nodes where pages holding shared library code can be found. But
there is relatively little value in migrating processes near their
executable pages, as it turns out, because much of the needed data is in
the CPU caches much of the time anyway. So patch 35 in the series avoids
tracking faults from those pages.
As described thus far, the patch set adds a basic mechanism for putting a
process near its (exclusively owned)
pages. But the concept of "a process's memory" is not necessarily simple
either; processes often share pages with each other. That is especially
true of threaded
applications where all of the threads run in the same address space and have
access to the same memory. Threaded applications written with NUMA
performance in mind will partition their memory accesses, but they can
still end up sharing pages, especially if a feature like transparent huge
pages is in use. In other cases, entirely separate processes may still
share a large memory area; neither "owns" it, but both can benefit by being
placed close to it. Either way, there can
be performance benefits from placing such processes on the same node
whenever possible.
To improve performance in the shared memory case, the patch set adds a
mechanism that tries to detect when processes are sharing memory with each
other — and, importantly, when they are accessing the same pages. Doing so
requires tracking which processes are accessing each page. Andrea
Arcangeli's AutoNUMA patch set, one of the
contenders in the 2012 discussions, added an extensive tracking
infrastructure for this purpose, but the resulting memory overhead was
deemed to be too high. So, as is so often the case, this patch set tries
to track this data by shoehorning the data into struct page
instead.
In particular, the data is put into the flags field, which is
already rather more complex than a simple
set of flags. The ID of the NUMA node containing each page is stored there in
current kernels;
this patch set changes that to the CPU ID, and adds the process ID of the
last process to access the page as well. Needless to say, the full process
ID does not fit into a subfield of flags, so some compromises need
to be made. In this case, only the bottom eight bits of the process ID are
used, with the understanding that some collisions will be unavoidable.
Whenever the system handles a page fault, this "cpupid" number is stored in
the relevant page structure.
With that infrastructure in place, the scheduler can respond to a NUMA scanning
fault by checking whether the process currently accessing the page is the
same as the last process to do so. If not, the scheduler will consider
putting the two processes into a "NUMA group." But first, it must find the
other process that accessed the page — not a trivial matter, given that the
full process ID for that process is not available. This problem is handled
by looking at whichever process is currently running on the CPU that last
accessed the page; if the relevant part of the process ID matches, the
scheduler assumes (after applying a few sanity tests) that it is the same
process as the one that accessed the
page. In this case, the processes are placed into a NUMA group to indicate
that they are sharing pages; if both processes are already in groups, those
two groups will be coalesced into one larger group.
Once again, shared library pages threw a wrench into the works. Since
every process shares access to, for example, the C library, all processes
in the system tended to get pulled together into a single NUMA group.
Deeming that to be less than fully useful, Peter Zijlstra tossed in a patch
avoiding group tracking for read-only pages.
Once a process is part of a NUMA group, page faults for shared pages will
be tracked at the group level. That gives the scheduler two somewhat
independent sets of statistics from which to make decisions: accesses to
private pages, and accesses to shared pages. If both sets of numbers agree
on the preferred node, the decision is easy. If, instead, there is a
disagreement, the preferred node will be chosen by iterating through the
arrays and picking the node where the sum of private and shared faults is
the highest. That sum is distorted, though, by weighting shared faults a
bit more heavily in an attempt to push the system toward decisions that put
processes with shared pages on the same node.
There is a lot more than this to the patch set, including a number of
algorithmic tweaks and fixes, but the above text describes the most
significant themes. Mel has included quite a bit of benchmark data with
the patch set; there is a lot of information there, but the bottom line, as
he described it, was: "
profiles with this look
beautiful
", with the locking cost associated with the
dput() operation reduced essentially to zero. So there may well
be a place for transactional memory within the dentry cache.
I doubt the intel TSX patches are going to be useful (if they
were, Intel would be crowing about performance numbers now that the CPU's
are out, and they aren't).
" Instead, he has suggested that this
feature will only be useful in a few places:
How much memory power management is useful?
In most contemporary systems, the easiest ways to reduce power consumption
have long since been taken advantage of. But any system component that uses
power is worth looking at to see whether it can be made to use less.
In the case of main memory, the potential power savings are said to be
significant, but they come at the expense of complicating the (already
complicated) memory management subsystem. Now, with the latest in a long
series of memory power management patches, some developers are questioning
whether all of that complexity is called for.
NUMA scheduling progress
Locating a process's memory
NUMA groups
some reports indicate that the performance
is getting close to manual bindings for some workloads but your mileage
will vary.
" A number of details are yet to be addressed; in
particular, support for CPU hotplugging is not yet there. Even so, Mel
thinks that the code is getting ready for movement into the "tip" tree,
and, thus, linux-next. That suggests it could be ready for merging in the
3.13 or (perhaps more likely) 3.14 development cycle.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>