Brief items
The current development kernel is 3.10-rc1,
released on May 11. All told, nearly
12,000 changesets were pulled into the mainline during the merge window,
making it the busiest such ever. See the separate article below for a
summary of the final changes merged for the 3.10 development cycle.
Stable updates:
3.9.2, 3.8.13, 3.4.45, and 3.0.78 were released on May 11;
3.2.45 came out on May 14.
In the 3.8.13 announcement
Greg Kroah-Hartman said: "NOTE, this is the LAST 3.8.y kernel
release, please move to the 3.9.y kernel series at this time. It is
end-of-life, dead, gone, buried, and put way behind us never to be spoken
of again. Seriously, move on, it's just not worth it anymore."
But the folks at Canonical, having shipped 3.8 in the Ubuntu 13.04 release,
are not moving on; they have announced support for this kernel until August
2014.
Comments (7 posted)
The amount of broken code I just encountered is mind boggling.
I've added comments explaining what is broken, but I fear that some
of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over
repeatedly with a blunt lawn mower.
—
Dave Chinner: not impressed by driver
shrinker code
choice
prompt "BogoMIPs setting"
default BOGOMIPS_MEDIUM
help
The BogoMIPs value reported by Linux is exactly what it sounds
like: totally bogus. It is used to calibrate the delay loop,
which may be backed by a timer clocked completely independently
of the CPU.
Unfortunately, that doesn't stop marketing types (and even people
who should know better) from using the number to compare machines
and then screaming if it's less than some fictitious, expected
value.
So, this option can be used to avoid the inevitable amount of
pain and suffering you will endure when the chaps described above
start parsing /proc/cpuinfo.
config BOGOMIPS_SLOW
bool "Slow (older machines)"
help
If you're comparing a faster machine with a slower machine,
then you might want this option selected on one of them.
config BOGOMIPS_MEDIUM
bool "Medium (default)"
help
A BogoMIPS value for the masses.
config BOGOMIPS_FAST
bool "Fast (marketing)"
help
Some people believe that software runs faster with this
setting so, if you're one of them, say Y here.
config BOGOMIPS_RANDOM
bool "Random (increased Bogosity)"
help
Putting the Bogo back into BogoMIPs.
—
Will Deacon
Comments (4 posted)
Canonical has announced that the Ubuntu kernel team will be providing
stable updates for the 3.8 kernel now that Greg Kroah-Hartman has moved
on. This support will last as long as support for the Ubuntu 13.04
release: through August 2014. "
We welcome any feedback and contribution to this effort. We will be
posting the first review cycle patch set in a week or two."
Full Story (comments: 23)
By Jonathan Corbet
May 15, 2013
Copying a file is a common operation on any system. Some filesystems have
the ability to accelerate copy operations considerably; for example, Btrfs
can just add another set of copy-on-write references to the file data, and
the NFS protocol allows a client to request that a copy be done on the
server, avoiding moving the data over the net twice. But, for the most
part, copying is still done the old-fashioned way, with the most
sophisticated applications possibly using
splice().
There have been various proposals over the years for ways to speed up copy
operations (reflink(), for example), but nothing has
ever made it into the mainline. The latest attempt is Zach Brown's copy_range() patch. It adds a new
system call:
int copy_range(int in_fd, loff_t *in_offset,
int out_fd, loff_t *out_offset, size_t count);
The intent of the system call is fairly clear: copy count bytes
from the input file to the output. It is not said anywhere, but it's
implicit in the patch that the two files should be on the same filesystem.
Inside the kernel, a new copy_range() member is added to the
file_operations structure; each filesystem is meant to implement
that operation to provide a fast copy operation. There is no fallback at
the VFS layer if copy_range() is unavailable, but that looks like
the sort of omission that would be fixed before mainline merging. Whether
merging will ever happen remains to be seen; this is an area that is
littered with abandoned code from previous failed attempts.
Comments (9 posted)
Kernel development news
By Jonathan Corbet
May 12, 2013
By the time Linus
announced the 3.10-rc1 kernel, he had pulled just
short of 12,000 non-merge changesets into the mainline kernel. That makes
3.10 the busiest merge window ever, by over 1,000 patches. The list of
changes merged since
the previous 3.10 merge
window summary is relatively small, but it includes some significant
changes. The most significant of those changes are:
- The bcache caching layer has been
merged. Bcache allows a fast device (like an SSD) to provide fast
caching in front of a slower device; it is designed for fast
performance given the constraints of contemporary solid-state
devices. See Documentation/bcache.txt
for more information.
- The on-disk representation of extents in Btrfs has been changed to
make the structure significantly smaller. "In practice this
results in a 30-35% decrease in the size of our extent tree, which
means we COW less and can keep more of the extent tree in memory which
makes our heavy metadata operations go much faster." It is an
incompatible format change that must be explicitly enabled when the
filesystem is created (or after the fact with btrfstune).
- The MIPS architecture has gained basic support for virtualization with
KVM. MIPS kernels can also now be built using the new "microMIPS"
instruction set, with significant space savings.
- New hardware support includes
Abilis TB10x processors,
Freescale ColdFire 537x processors,
Freescale M5373EVB boards,
Broadcom BCM6362 processors,
Ralink RT2880, RT3883, and MT7620 processors, and
Armada 370/XP thermal management controllers.
Changes visible to kernel developers include:
- The block layer has gained basic power management support; it is
primarily intended to control which I/O requests can pass through to
a device while it is suspending or resuming. To that end,
power-management-related requests should be marked with the net
__REQ_PM flag.
- A lot of work has gone into the block layer in preparation for
"immutable biovecs," a reimplementation of the low-level structure
used to represent ranges of blocks for I/O operations. One of the key
advantages here seems to be that it becomes possible to create a new
biovec that contains a subrange of an existing biovec, leading to fast
and efficient request splitting. The completion of this work will
presumably show up in 3.11.
- The dedicated thread pool implementation used to implement writeback
in the memory management subsystem has been replaced by a workqueue.
If this development cycle follows the usual pattern, the final 3.10 kernel
release can be expected in early July. Between now and then, though, there
will certainly be a lot of bugs to fix.
Comments (none posted)
By Jonathan Corbet
May 14, 2013
One of the kernel's core functions is the management of caching; by
maintaining caches at various levels, the kernel is able to improve
performance significantly. But caches cannot be allowed to grow without
bound or they will eventually consume all of memory. The kernel's answer
to this problem is the "shrinker" interface, a mechanism by which the
memory management subsystem can request that cached items be discarded and
their memory freed for other uses. One of the recurring topics at the
2013 Linux Storage, Filesystem, and Memory
Management Summit was the need to improve the shrinker interface. The
proposed replacement is out for review, so
it seems like time for a closer look.
A new shrinker API
In current kernels, a cache would implement a shrinker function that
adheres to this interface:
#include <linux/shrinker.h>
struct shrink_control {
gfp_t gfp_mask;
unsigned long nr_to_scan;
};
int (*shrink)(struct shrinker *s, struct shrink_control *sc);
The shrink() function is packaged up inside a shrinker
structure (along with some ancillary information); the whole thing is then
registered with a call to register_shrinker().
When memory gets tight, the shrink() function will be called from
the memory management subsystem. The gfp_mask will reflect the
type of allocation that was being attempted when the shrink() call
was made; the shrinker should avoid any actions that contradict that mask.
So, for example, if a GFP_NOFS allocation is in progress, a
filesystem shrinker cannot initiate filesystem activity to free memory.
The nr_to_scan field tells the shrinker how many objects it should
examine and free if possible; if, however, nr_to_scan is zero, the
call is really a
request to know how many objects currently exist in the cache.
The use of a single callback function for two purposes (counting objects
and freeing them) irks some developers; it also makes the interface harder
to implement. So, one of the first steps in the new shrinker patch set is
to redefine the shrinker API to look like this:
long (*count_objects)(struct shrinker *s, struct shrink_control *sc);
long (*scan_objects)(struct shrinker *s, struct shrink_control *sc);
The roughly two-dozen shrinker implementations in the kernel have been
updated to use this new API.
The current shrinker API is not NUMA-aware. In an effort to improve that
situation, the shrink_control structure has been augmented with
a new field:
nodemask_t nodes_to_scan;
On NUMA systems, memory pressure is often not a global phenomenon.
Instead, some nodes will have plenty of free memory while others are
running low. The current shrinker interface will indiscriminately free
memory objects; it pays no attention to which NUMA node any given object is
local to As a result, it can
dump a lot of cached data without necessarily helping to address the real
problem. In the new scheme, shrinkers should observe the
nodes_to_scan field and only free memory from the indicated NUMA
nodes.
LRU lists
A maintainer of an existing shrinker implementation may well look at the
new NUMA awareness requirement with dismay. Most shrinker implementations
are buried deep within filesystems and certain drivers; these subsystems do
not normally track their cached items by which NUMA node holds them. So it
appears that shrinker implementations could get more complicated, but that
turns out not to be the case.
While looking at the shrinker code, Dave Chinner realized that most
implementations look very much the same: they maintain a
least-recently-used (LRU) list of cached items. When the shrinker is
called, a pass is made over the list in an attempt to satisfy the request.
Much of that code looked well suited for a generic replacement; that
replacement, in the form of a new type of linked list, is part of the
larger shrinker patch set.
The resulting "LRU list" data structure encapsulates a lot of the details
of object cache management; it goes well beyond a simple ordered list.
Internally, it is represented by a set of regular list_head
structures (one per node), a set of per-node object counts, and per-node
spinlocks to control access. The inclusion of the spinlock puts the LRU
list at odds with normal kernel conventions: low-level data structures do
not usually include their own locking mechanism, since that locking is
often more efficiently done at a higher level. In this case, putting the
lock in the data structure allows it to provide per-node locking without
the need for NUMA awareness in higher-level callers.
The basic API for the management of LRU lists is pretty much as one might
expect:
#include <linux/list_lru.h>
int list_lru_init(struct list_lru *lru);
int list_lru_add(struct list_lru *lru, struct list_head *item);
int list_lru_del(struct list_lru *lru, struct list_head *item);
A count of the number of items on a list can be had with
list_lru_count(). There is also a mechanism for walking through
an LRU list that is aimed at the needs of shrinker implementations:
unsigned long list_lru_walk(struct list_lru *lru,
list_lru_walk_cb isolate,
void *cb_arg,
unsigned long nr_to_walk);
unsigned long list_lru_walk_nodemask(struct list_lru *lru,
list_lru_walk_cb isolate,
void *cb_arg,
unsigned long nr_to_walk,
nodemask_t *nodes_to_walk);
Either function will wander through the list, calling the
isolate() callback and, possibly, modifying the list in
response to the callback's return value. As one would expect,
list_lru_walk() will pass through the entire LRU list, while
list_lru_walk_nodemask() limits itself to the specified
nodes_to_walk. The callback's prototype looks like this:
typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
spinlock_t *lock,
void *cb_arg);
Here, item is an item from the list to be examined, lock
is the spinlock controlling access to the list, and cb_arg is
specified by the original caller. The return value can be one of four
possibilities, depending on how the callback deals with the given
item:
- LRU_REMOVED indicates that the callback removed the item from
the list; the number of items on the list will be decremented
accordingly. In this case, the callback does the actual removal of
the item.
- LRU_ROTATE says that the given item should be moved to the
("most recently used") end of the list. The LRU list code will
perform the move operation.
- LRU_RETRY indicates that the callback should be called again
with the same item. A second LRU_RETRY return will cause the
item to be skipped. A potential use for this return value is if the
callback notices a potential deadlock situation.
- LRU_SKIP causes the item to be passed over with no changes.
With this infrastructure in place, a lot of shrinker implementations come
down to a call to list_lru_walk_nodemask() and a callback to
process individual items.
Memcg-aware LRU lists
While an improved shrinker interface is well worth the effort on its own,
much of the work described above has been driven by an additional need: better
support for memory control groups (memcgs). In particular, memcg developer
Glauber Costa would like to be able to use the shrinker mechanism to free
only memory that is associated with a given memcg. All that is needed to
reach this goal is to expand the LRU list concept to include memcg
awareness along with NUMA node awareness.
The result is a significant reworking of the LRU list API. What started as
a simple list with some helper functions has now become a two-dimensional
array of lists, indexed by node and memcg ID. A call to
list_lru_add() will now determine which memcg the item belongs to
and put it onto the relevant sublist. There is a new function —
list_lru_walk_nodemask_memcg() — that will walk through an LRU
list, picking out only the elements found on the given node(s) and
belonging to the given memcg. The more generic functions described above
have been reimplemented as wrappers around the memcg-specific versions.
At this point, the "LRU list" is no longer a
generic data structure (though one could still use it that way); it is,
instead, a core component of the memory management subsystem.
Closing notes
A review of the current shrinker implementations in the kernel reveals that
not all of them manage simple object caches. In many cases, what is
happening is that the code in question wanted a way to know when the system
is under memory pressure. In current kernels, the only way to get that
information is to register a shrinker and see when it gets called. Such
uses are frowned upon; they end up putting marginally related code into the
memory reclaim path.
The shrinker patch set seeks to eliminate those users by providing a
different mechanism for code that wants to learn about memory pressure. It
essentially hooks into the vmpressure
mechanism to set up an in-kernel notification mechanism, albeit one
that does not use the kernel's usual notifier infrastructure. Interested
code can call:
int vmpressure_register_kernel_event(struct cgroup *cg, void (*fn)(void));
The given fn() will be called at the same time that pressure
notifications are sent out to user space. The concept of "pressure levels"
has not been implemented for the kernel-side interface, though.
Most of this code is relatively new, and it touches a fair amount of core
memory management code. The latter stages of the patch set, where memcg
awareness is added, could be controversial, but, then, it could be that
developers have resigned themselves to memcg code being invasive and
expensive. One way or another, most or all of this code will probably find
its way into the mainline; the benefits of the shrinker API improvements
will be nice to have. But the path to the mainline could be long, and this
patch set has just begun, so it may be a while before it is merged.
Comments (none posted)
By Jonathan Corbet
May 14, 2013
Page fault handling is normally the kernel's responsibility. When a
process attempts to access an address that is not currently mapped to a
location in RAM, the kernel responds by mapping a page to that location
and, if needed, filling that page with data from secondary storage. But
what if that data is not in a location that is easily reachable by the
kernel? Then, perhaps, it's time to outsource the responsibility for
handling the fault to user space.
One situation where user-space page fault handling can be useful is for the
live migration of virtual machines from one physical host to another.
Migration can be done by stopping the machine, copying its full address
space to the new host, and restarting the machine there. But address
spaces may be large and sparsely used; copying a full address space can
result in a lot of unnecessary work and a noticeable pause
before the migrated system restarts. If, instead, the virtual machine's
address space could be demand-paged from the old host to the new, it could
restart more quickly and the copying of unused data could be avoided.
Live migration with KVM is currently managed with an
out-of-tree char device. This scheme works, but, once the device takes
over a range of memory, that memory is removed from the memory management
subsystem. So it cannot be swapped out, transparent huge pages don't work,
and so on. Clearly it would be better to come up with a solution that,
while allowing user-space handling of demand paging, does not remove the
affected memory from the kernel's management altogether. A patch set recently posted by Andrea Arcangeli
aims to resolve those issues with a couple of new system call options.
The first of those is to extend the madvise() system call, adding
a new command called MADV_USERFAULT. Processes can use this
operation to tell the kernel that user space will handle page faults on a
range of memory. After this call, any access to an unmapped address in the
given range will result in a SIGBUS signal; the process is then
expected to respond by mapping a real page into the unmapped space as
described below. The madvise(MADV_USERFAULT) call should be made
immediately after the memory range is created; user-space fault handling
will not work if the kernel handles any page faults before it is told that
user space will be doing the job.
The SIGBUS signal handler's job is to handle the page fault by
mapping a real page to the faulting address. That can be done in current
kernels with the mremap() system call. The problem with
mremap() is that it works by splitting the virtual memory area
(VMA) structure used to describe the memory range within the kernel.
Frequent mremap() calls will result in the kernel having to manage
a large number of VMAs, which is an expensive proposition. mremap() will
also happily overwrite existing memory mappings, making it harder to detect
errors (or race conditions) in user-space handlers. For these reasons,
mremap() is not an ideal solution to the problem.
Andrea's answer to this problem is a new system call:
int remap_anon_pages(void *dest, void *source, unsigned long len);
This call will cause the len bytes of memory starting at
source to be mapped into the process's address space starting at
dest. At the same time, the source memory range will be
unmapped — the pages previously found there will be atomically moved to the
dest range.
Andrea has posted a small test program that
demonstrates how these APIs are meant to be used.
As one might expect, some restrictions apply:
source and dest must be page-aligned, len should
be a multiple of the page size, the dest range must be completely
unmapped, and the source range must be fully mapped. The mapping
requirements exist to catch bugs in user-space fault handlers; remapping
pages on top of existing memory has a high risk of causing memory
corruption.
One nice feature of the patch set is that, on systems where transparent huge pages are enabled, huge pages
can be remapped with remap_anon_pages() without the need to split
them apart. For that to work, of course, the length and alignment of the
range to move must be compatible with huge pages.
There are a number of limitations in the current patch set. The
MADV_USERFAULT option can only be used on anonymous (swap-backed)
memory areas. A more complete implementation could conceivably support
this feature for file-backed pages as well. The mechanism offers support
for demand paging of data into RAM, but there is no user-controllable
mechanism for pushing data back out; instead, those pages are swapped with
all other anonymous pages. So it is not a complete user-space paging
mechanism; it's more of a hook for loading the initial contents of
anonymous pages from an outside source.
But, even with those limitations, the feature is useful for the intended
virtualization use case. Andrea suggests it could possibly have other uses
as well; remote RAM applications come to mind. First, though, it needs to
get into the mainline, and that, in turn, suggests that the proposed ABI
needs to be reviewed carefully. Thus far, this patch set has not gotten a
lot of review attention; that will need to change before it can be
considered for mainline merging.
Comments (18 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>