LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.10-rc1, released on May 11. All told, nearly 12,000 changesets were pulled into the mainline during the merge window, making it the busiest such ever. See the separate article below for a summary of the final changes merged for the 3.10 development cycle.

Stable updates: 3.9.2, 3.8.13, 3.4.45, and 3.0.78 were released on May 11; 3.2.45 came out on May 14.

In the 3.8.13 announcement Greg Kroah-Hartman said: "NOTE, this is the LAST 3.8.y kernel release, please move to the 3.9.y kernel series at this time. It is end-of-life, dead, gone, buried, and put way behind us never to be spoken of again. Seriously, move on, it's just not worth it anymore." But the folks at Canonical, having shipped 3.8 in the Ubuntu 13.04 release, are not moving on; they have announced support for this kernel until August 2014.

Comments (7 posted)

Quotes of the week

The amount of broken code I just encountered is mind boggling. I've added comments explaining what is broken, but I fear that some of the code would be best dealt with by being dragged behind the bike shed, burying in mud up to it's neck and then run over repeatedly with a blunt lawn mower.
Dave Chinner: not impressed by driver shrinker code

choice
	prompt "BogoMIPs setting"
	default BOGOMIPS_MEDIUM
	help
	  The BogoMIPs value reported by Linux is exactly what it sounds
	  like: totally bogus. It is used to calibrate the delay loop,
	  which may be backed by a timer clocked completely independently
	  of the CPU.

	  Unfortunately, that doesn't stop marketing types (and even people
	  who should know better) from using the number to compare machines
	  and then screaming if it's less than some fictitious, expected
	  value.

	  So, this option can be used to avoid the inevitable amount of
	  pain and suffering you will endure when the chaps described above
	  start parsing /proc/cpuinfo.

	config BOGOMIPS_SLOW
		bool "Slow (older machines)"
		help
		  If you're comparing a faster machine with a slower machine,
		  then you might want this option selected on one of them.

	config BOGOMIPS_MEDIUM
		bool "Medium (default)"
		help
		  A BogoMIPS value for the masses.

	config BOGOMIPS_FAST
		bool "Fast (marketing)"
		help
		  Some people believe that software runs faster with this
		  setting so, if you're one of them, say Y here.

	config BOGOMIPS_RANDOM
		bool "Random (increased Bogosity)"
		help
		  Putting the Bogo back into BogoMIPs.
Will Deacon

Comments (4 posted)

Extended stable support for the 3.8 kernel

Canonical has announced that the Ubuntu kernel team will be providing stable updates for the 3.8 kernel now that Greg Kroah-Hartman has moved on. This support will last as long as support for the Ubuntu 13.04 release: through August 2014. "We welcome any feedback and contribution to this effort. We will be posting the first review cycle patch set in a week or two."

Full Story (comments: 23)

copy_range()

By Jonathan Corbet
May 15, 2013
Copying a file is a common operation on any system. Some filesystems have the ability to accelerate copy operations considerably; for example, Btrfs can just add another set of copy-on-write references to the file data, and the NFS protocol allows a client to request that a copy be done on the server, avoiding moving the data over the net twice. But, for the most part, copying is still done the old-fashioned way, with the most sophisticated applications possibly using splice().

There have been various proposals over the years for ways to speed up copy operations (reflink(), for example), but nothing has ever made it into the mainline. The latest attempt is Zach Brown's copy_range() patch. It adds a new system call:

    int copy_range(int in_fd, loff_t *in_offset,
		   int out_fd, loff_t *out_offset, size_t count);

The intent of the system call is fairly clear: copy count bytes from the input file to the output. It is not said anywhere, but it's implicit in the patch that the two files should be on the same filesystem.

Inside the kernel, a new copy_range() member is added to the file_operations structure; each filesystem is meant to implement that operation to provide a fast copy operation. There is no fallback at the VFS layer if copy_range() is unavailable, but that looks like the sort of omission that would be fixed before mainline merging. Whether merging will ever happen remains to be seen; this is an area that is littered with abandoned code from previous failed attempts.

Comments (9 posted)

Kernel development news

The conclusion of the 3.10 merge window

By Jonathan Corbet
May 12, 2013
By the time Linus announced the 3.10-rc1 kernel, he had pulled just short of 12,000 non-merge changesets into the mainline kernel. That makes 3.10 the busiest merge window ever, by over 1,000 patches. The list of changes merged since the previous 3.10 merge window summary is relatively small, but it includes some significant changes. The most significant of those changes are:

  • The bcache caching layer has been merged. Bcache allows a fast device (like an SSD) to provide fast caching in front of a slower device; it is designed for fast performance given the constraints of contemporary solid-state devices. See Documentation/bcache.txt for more information.

  • The on-disk representation of extents in Btrfs has been changed to make the structure significantly smaller. "In practice this results in a 30-35% decrease in the size of our extent tree, which means we COW less and can keep more of the extent tree in memory which makes our heavy metadata operations go much faster." It is an incompatible format change that must be explicitly enabled when the filesystem is created (or after the fact with btrfstune).

  • The MIPS architecture has gained basic support for virtualization with KVM. MIPS kernels can also now be built using the new "microMIPS" instruction set, with significant space savings.

  • New hardware support includes Abilis TB10x processors, Freescale ColdFire 537x processors, Freescale M5373EVB boards, Broadcom BCM6362 processors, Ralink RT2880, RT3883, and MT7620 processors, and Armada 370/XP thermal management controllers.

Changes visible to kernel developers include:

  • The block layer has gained basic power management support; it is primarily intended to control which I/O requests can pass through to a device while it is suspending or resuming. To that end, power-management-related requests should be marked with the net __REQ_PM flag.

  • A lot of work has gone into the block layer in preparation for "immutable biovecs," a reimplementation of the low-level structure used to represent ranges of blocks for I/O operations. One of the key advantages here seems to be that it becomes possible to create a new biovec that contains a subrange of an existing biovec, leading to fast and efficient request splitting. The completion of this work will presumably show up in 3.11.

  • The dedicated thread pool implementation used to implement writeback in the memory management subsystem has been replaced by a workqueue.

If this development cycle follows the usual pattern, the final 3.10 kernel release can be expected in early July. Between now and then, though, there will certainly be a lot of bugs to fix.

Comments (none posted)

Smarter shrinkers

By Jonathan Corbet
May 14, 2013
One of the kernel's core functions is the management of caching; by maintaining caches at various levels, the kernel is able to improve performance significantly. But caches cannot be allowed to grow without bound or they will eventually consume all of memory. The kernel's answer to this problem is the "shrinker" interface, a mechanism by which the memory management subsystem can request that cached items be discarded and their memory freed for other uses. One of the recurring topics at the 2013 Linux Storage, Filesystem, and Memory Management Summit was the need to improve the shrinker interface. The proposed replacement is out for review, so it seems like time for a closer look.

A new shrinker API

In current kernels, a cache would implement a shrinker function that adheres to this interface:

    #include <linux/shrinker.h>

    struct shrink_control {
	gfp_t gfp_mask;
	unsigned long nr_to_scan;
    };

    int (*shrink)(struct shrinker *s, struct shrink_control *sc);

The shrink() function is packaged up inside a shrinker structure (along with some ancillary information); the whole thing is then registered with a call to register_shrinker().

When memory gets tight, the shrink() function will be called from the memory management subsystem. The gfp_mask will reflect the type of allocation that was being attempted when the shrink() call was made; the shrinker should avoid any actions that contradict that mask. So, for example, if a GFP_NOFS allocation is in progress, a filesystem shrinker cannot initiate filesystem activity to free memory. The nr_to_scan field tells the shrinker how many objects it should examine and free if possible; if, however, nr_to_scan is zero, the call is really a request to know how many objects currently exist in the cache.

The use of a single callback function for two purposes (counting objects and freeing them) irks some developers; it also makes the interface harder to implement. So, one of the first steps in the new shrinker patch set is to redefine the shrinker API to look like this:

    long (*count_objects)(struct shrinker *s, struct shrink_control *sc);
    long (*scan_objects)(struct shrinker *s, struct shrink_control *sc);

The roughly two-dozen shrinker implementations in the kernel have been updated to use this new API.

The current shrinker API is not NUMA-aware. In an effort to improve that situation, the shrink_control structure has been augmented with a new field:

    nodemask_t nodes_to_scan;

On NUMA systems, memory pressure is often not a global phenomenon. Instead, some nodes will have plenty of free memory while others are running low. The current shrinker interface will indiscriminately free memory objects; it pays no attention to which NUMA node any given object is local to As a result, it can dump a lot of cached data without necessarily helping to address the real problem. In the new scheme, shrinkers should observe the nodes_to_scan field and only free memory from the indicated NUMA nodes.

LRU lists

A maintainer of an existing shrinker implementation may well look at the new NUMA awareness requirement with dismay. Most shrinker implementations are buried deep within filesystems and certain drivers; these subsystems do not normally track their cached items by which NUMA node holds them. So it appears that shrinker implementations could get more complicated, but that turns out not to be the case.

While looking at the shrinker code, Dave Chinner realized that most implementations look very much the same: they maintain a least-recently-used (LRU) list of cached items. When the shrinker is called, a pass is made over the list in an attempt to satisfy the request. Much of that code looked well suited for a generic replacement; that replacement, in the form of a new type of linked list, is part of the larger shrinker patch set.

The resulting "LRU list" data structure encapsulates a lot of the details of object cache management; it goes well beyond a simple ordered list. Internally, it is represented by a set of regular list_head structures (one per node), a set of per-node object counts, and per-node spinlocks to control access. The inclusion of the spinlock puts the LRU list at odds with normal kernel conventions: low-level data structures do not usually include their own locking mechanism, since that locking is often more efficiently done at a higher level. In this case, putting the lock in the data structure allows it to provide per-node locking without the need for NUMA awareness in higher-level callers.

The basic API for the management of LRU lists is pretty much as one might expect:

    #include <linux/list_lru.h>

    int list_lru_init(struct list_lru *lru);
    int list_lru_add(struct list_lru *lru, struct list_head *item);
    int list_lru_del(struct list_lru *lru, struct list_head *item);

A count of the number of items on a list can be had with list_lru_count(). There is also a mechanism for walking through an LRU list that is aimed at the needs of shrinker implementations:

    unsigned long list_lru_walk(struct list_lru	*lru, 
			        list_lru_walk_cb isolate,
				void *cb_arg,
				unsigned long nr_to_walk);
    unsigned long list_lru_walk_nodemask(struct list_lru *lru, 
    	     	  	        list_lru_walk_cb isolate,
				void *cb_arg,
				unsigned long nr_to_walk,
				nodemask_t *nodes_to_walk);

Either function will wander through the list, calling the isolate() callback and, possibly, modifying the list in response to the callback's return value. As one would expect, list_lru_walk() will pass through the entire LRU list, while list_lru_walk_nodemask() limits itself to the specified nodes_to_walk. The callback's prototype looks like this:

    typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, 
    	    	 	    			spinlock_t *lock,
						void *cb_arg);

Here, item is an item from the list to be examined, lock is the spinlock controlling access to the list, and cb_arg is specified by the original caller. The return value can be one of four possibilities, depending on how the callback deals with the given item:

  • LRU_REMOVED indicates that the callback removed the item from the list; the number of items on the list will be decremented accordingly. In this case, the callback does the actual removal of the item.

  • LRU_ROTATE says that the given item should be moved to the ("most recently used") end of the list. The LRU list code will perform the move operation.

  • LRU_RETRY indicates that the callback should be called again with the same item. A second LRU_RETRY return will cause the item to be skipped. A potential use for this return value is if the callback notices a potential deadlock situation.

  • LRU_SKIP causes the item to be passed over with no changes.

With this infrastructure in place, a lot of shrinker implementations come down to a call to list_lru_walk_nodemask() and a callback to process individual items.

Memcg-aware LRU lists

While an improved shrinker interface is well worth the effort on its own, much of the work described above has been driven by an additional need: better support for memory control groups (memcgs). In particular, memcg developer Glauber Costa would like to be able to use the shrinker mechanism to free only memory that is associated with a given memcg. All that is needed to reach this goal is to expand the LRU list concept to include memcg awareness along with NUMA node awareness.

The result is a significant reworking of the LRU list API. What started as a simple list with some helper functions has now become a two-dimensional array of lists, indexed by node and memcg ID. A call to list_lru_add() will now determine which memcg the item belongs to and put it onto the relevant sublist. There is a new function — list_lru_walk_nodemask_memcg() — that will walk through an LRU list, picking out only the elements found on the given node(s) and belonging to the given memcg. The more generic functions described above have been reimplemented as wrappers around the memcg-specific versions. At this point, the "LRU list" is no longer a generic data structure (though one could still use it that way); it is, instead, a core component of the memory management subsystem.

Closing notes

A review of the current shrinker implementations in the kernel reveals that not all of them manage simple object caches. In many cases, what is happening is that the code in question wanted a way to know when the system is under memory pressure. In current kernels, the only way to get that information is to register a shrinker and see when it gets called. Such uses are frowned upon; they end up putting marginally related code into the memory reclaim path.

The shrinker patch set seeks to eliminate those users by providing a different mechanism for code that wants to learn about memory pressure. It essentially hooks into the vmpressure mechanism to set up an in-kernel notification mechanism, albeit one that does not use the kernel's usual notifier infrastructure. Interested code can call:

    int vmpressure_register_kernel_event(struct cgroup *cg, void (*fn)(void));

The given fn() will be called at the same time that pressure notifications are sent out to user space. The concept of "pressure levels" has not been implemented for the kernel-side interface, though.

Most of this code is relatively new, and it touches a fair amount of core memory management code. The latter stages of the patch set, where memcg awareness is added, could be controversial, but, then, it could be that developers have resigned themselves to memcg code being invasive and expensive. One way or another, most or all of this code will probably find its way into the mainline; the benefits of the shrinker API improvements will be nice to have. But the path to the mainline could be long, and this patch set has just begun, so it may be a while before it is merged.

Comments (none posted)

User-space page fault handling

By Jonathan Corbet
May 14, 2013
Page fault handling is normally the kernel's responsibility. When a process attempts to access an address that is not currently mapped to a location in RAM, the kernel responds by mapping a page to that location and, if needed, filling that page with data from secondary storage. But what if that data is not in a location that is easily reachable by the kernel? Then, perhaps, it's time to outsource the responsibility for handling the fault to user space.

One situation where user-space page fault handling can be useful is for the live migration of virtual machines from one physical host to another. Migration can be done by stopping the machine, copying its full address space to the new host, and restarting the machine there. But address spaces may be large and sparsely used; copying a full address space can result in a lot of unnecessary work and a noticeable pause before the migrated system restarts. If, instead, the virtual machine's address space could be demand-paged from the old host to the new, it could restart more quickly and the copying of unused data could be avoided.

Live migration with KVM is currently managed with an out-of-tree char device. This scheme works, but, once the device takes over a range of memory, that memory is removed from the memory management subsystem. So it cannot be swapped out, transparent huge pages don't work, and so on. Clearly it would be better to come up with a solution that, while allowing user-space handling of demand paging, does not remove the affected memory from the kernel's management altogether. A patch set recently posted by Andrea Arcangeli aims to resolve those issues with a couple of new system call options.

The first of those is to extend the madvise() system call, adding a new command called MADV_USERFAULT. Processes can use this operation to tell the kernel that user space will handle page faults on a range of memory. After this call, any access to an unmapped address in the given range will result in a SIGBUS signal; the process is then expected to respond by mapping a real page into the unmapped space as described below. The madvise(MADV_USERFAULT) call should be made immediately after the memory range is created; user-space fault handling will not work if the kernel handles any page faults before it is told that user space will be doing the job.

The SIGBUS signal handler's job is to handle the page fault by mapping a real page to the faulting address. That can be done in current kernels with the mremap() system call. The problem with mremap() is that it works by splitting the virtual memory area (VMA) structure used to describe the memory range within the kernel. Frequent mremap() calls will result in the kernel having to manage a large number of VMAs, which is an expensive proposition. mremap() will also happily overwrite existing memory mappings, making it harder to detect errors (or race conditions) in user-space handlers. For these reasons, mremap() is not an ideal solution to the problem.

Andrea's answer to this problem is a new system call:

    int remap_anon_pages(void *dest, void *source, unsigned long len);

This call will cause the len bytes of memory starting at source to be mapped into the process's address space starting at dest. At the same time, the source memory range will be unmapped — the pages previously found there will be atomically moved to the dest range.

Andrea has posted a small test program that demonstrates how these APIs are meant to be used.

As one might expect, some restrictions apply: source and dest must be page-aligned, len should be a multiple of the page size, the dest range must be completely unmapped, and the source range must be fully mapped. The mapping requirements exist to catch bugs in user-space fault handlers; remapping pages on top of existing memory has a high risk of causing memory corruption.

One nice feature of the patch set is that, on systems where transparent huge pages are enabled, huge pages can be remapped with remap_anon_pages() without the need to split them apart. For that to work, of course, the length and alignment of the range to move must be compatible with huge pages.

There are a number of limitations in the current patch set. The MADV_USERFAULT option can only be used on anonymous (swap-backed) memory areas. A more complete implementation could conceivably support this feature for file-backed pages as well. The mechanism offers support for demand paging of data into RAM, but there is no user-controllable mechanism for pushing data back out; instead, those pages are swapped with all other anonymous pages. So it is not a complete user-space paging mechanism; it's more of a hook for loading the initial contents of anonymous pages from an outside source.

But, even with those limitations, the feature is useful for the intended virtualization use case. Andrea suggests it could possibly have other uses as well; remote RAM applications come to mind. First, though, it needs to get into the mainline, and that, in turn, suggests that the proposed ABI needs to be reviewed carefully. Thus far, this patch set has not gotten a lot of review attention; that will need to change before it can be considered for mainline merging.

Comments (18 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds