Kernel development
Brief items
Kernel release status
The current development kernel is 3.5-rc1, announced on June 3. "It's a pretty normal release - roughly 60% drivers, 20% arch updates, and 20% "all over" - filesystems, documentation, tools, you name it. [...] Depending on what your interest is, you might be excited about the CoDel packet scheduler, or about the GPU driver updates, or the new scsi targets. There is something in here for pretty much anybody."
Stable updates: the 3.0.33, 3.2.19, 3.3.8, and 3.4.1 stable kernels were all released on June 4; each contains a long list of fixes. 3.3.8 is the final update for the 3.3 series.
Quotes of the week
it would impudently suggest that I might conceive of you being wrong.
I sigh for your heavy burden.
Kernel development news
The conclusion of the 3.5 merge window
Linus closed the 3.5 merge window on June 2; the announcement for the 3.5 prepatch came one day later. Just under 1,000 patches were pulled into the mainline after the writing of last week's merge window summary, for a total of 9,534 for the merge window as a whole. Some of the more significant changes pulled at the end of the merge window include:
- The HFS filesystem has gained native language support (NLS)
capabilities, with codepages added for several languages.
- A process's directory under /proc now includes a
children file containing the IDs of its child processes.
- The kcmp() system call has been added. Its purpose is to
help user space checkpoint/restore utilities to determine whether two
processes share a given resource or not; see this article for a description of the
interface.
- Also for checkpoint/restore: the prctl() system call has
gained options to set the beginning and end of the argv and
environment areas and the executable file a process is running.
- The ext4 filesystem now has support for metadata checksumming, a
feature which should help to catch metadata corruption that might
otherwise escape notice. This feature requires an on-disk format
change (to store the checksum), so it must be turned on explicitly.
Once this feature is enabled, the filesystem can only be mounted
read-only on older kernels. See this
article for more information on the metadata checksum feature.
- A lot of changes have gone in to improve the handling of
system-specific features on Sony laptops.
- The new skew_tick= boot option controls whether the system
skews timer ticks on a per-CPU basis to minimize contention on the
xtime_lock lock. It defaults to off; turning it on can
reduce jitter on some workloads, but will also increase power
consumption.
- The "frontswap" mechanism, part of the transcendent memory family of
technologies, sneaked its way into the mainline just after the -rc1
release.
- The FUSE filesystem API has added an operation to implement the
fallocate() system call.
- Two new drivers were added at the end of the merge window; they enable AUO K1900 and K1901 epaper display controllers and Emma Mobile STI timers.
Changes visible to kernel developers include:
- The task_work_add() function, useful for requesting that a
function be run in the context of a specific process, has been added.
See this article for a description of
the task_work_add() API.
- The SUNRPC code has a new utility function:
int svc_bind(struct svc_serv *serv, struct net *net);Its purpose is to handle service registration in the given network context; svc_bind() is exported GPL-only. - struct inode_operations has a new update_time() function whose job is to provide any needed special handling for changes to any of the file timestamps. The file_update_time() prototype has been changed: it now returns an int that can indicate that the operation failed. Failures to update the last-access time are now explicitly ignored; this is done to ensure that atime update failures don't make the filesystem unreadable. This work has been generalized out of the btrfs filesystem; see this article for a description of how atime updates can go wrong otherwise.
At this point, the addition of features for the 3.5 development cycle should be at an end unless something sneaks in before -rc2. If the usual pattern holds, the final 3.5 release can be expected right around the beginning of August.
Generic red-black trees
Red-black trees (or "rbtrees") are used throughout the kernel to maintain a sorted list of related items. For example, the block I/O subsystem uses an rbtree to maintain a list of outstanding requests sorted by sector number, and the scheduler has an rbtree of runnable tasks sorted by a quantity known as "virtual runtime." The rbtree interface was described in this 2006 article; it has since been extended, but the core features of the API remain the same. In particular, users must provide their own functions for inserting nodes into the tree and performing searches; that allows the creation of highly-optimized rbtrees containing arbitrary types of structures.There is some appeal to being able to hand-code the search and insertion functions, but there would also be value in generic implementations. The amount of code in the kernel would shrink slightly and the task of debugging those functions would, hopefully, only have to be done once. So it is arguably surprising that nobody has proposed a generic rbtree implementation for all these years. Just as surprising is the fact that two independent generic implementations surfaced at about the same time.
The simpler of the two can be found in this patch set from Kent Overstreet. In this version, rbtree users are required to provide a function to compare two rbtree nodes with this prototype:
typedef int (rb_cmp_t) (struct rb_node *l, struct rb_node *r);
With that in place, the following new functions become available:
int rb_insert(struct rb_root *root, struct rb_node *new, rb_cmp_t cmp);
void rb_insert_allow_dup(struct rb_root *root, struct rb_node *new,
rb_cmp_t cmp);
struct rb_node *rb_search(struct rb_root *root, struct rb_node *search,
rb_cmp_t cmp);
struct rb_node *rb_greater(struct rb_root *root, struct rb_node *search,
rb_cmp_t cmp);
As can be seen from the prototypes, all of these functions deal directly with rb_node structures. Only the comparison function needs to know about what sort of structure the rb_node is embedded within. There is no compile-time mechanism to ensure that the comparison function expects the actual structures found in the tree, but one assumes any errors along those lines will show themselves quickly at run time.
One potential problem is that rb_search() and rb_greater() need to know what is being searched for; that, in turn, requires creating and passing in one of the structures stored in the tree. In some situations, that structure may need to be created on the stack, which is a clear problem if the structure is large. Unfortunately, in the block subsystem example mentioned above, that structure (struct request) is large indeed. Kent has tried to work around this problem by providing inlined versions (called _rb_search() and _rb_greater()) that, with luck, will cause the stack allocation to be optimized away. That depends on all supported versions of the compiler doing the right thing on all architectures, though, which may make some people nervous.
The alternative patch set, posted by Daniel Santos, is significantly more complicated. It can be thought of as a set of tricky preprocessor macros and inline functions that serve as a template for the creation of type-specific rbtree implementations. Here, too, one starts with the creation of a comparison function:
typedef long (*rb_compare_f)(const void *a, const void *b);
In this case, the comparison function will be passed pointers to the actual key value stored in the rbtree node. One then defines an "rbtree interface" with this daunting macro:
RB_DEFINE_INTERFACE(prefix, container_type, root_member,
left_pointer, right_pointer,
object_type, rbnode_member, key_member,
flags, comparison_function, augment_func);
Eleven arguments are a lot to keep track of, so it makes sense to discuss them in the context of an example (taken from Daniel's patch). The CPU scheduler defines a type (struct cfs_rq in kernel/sched/sched.h) to represent a run queue; each run queue contains a red-black tree called tasks_timeline. To optimize the location of the highest-priority task to run, the scheduler keeps a pointer to the leftmost tree node in rb_leftmost. The entries in the red-black tree are of type struct sched_entity (defined in <linux/sched.h>); the embedded rb_node structure is called run_node, and the key used to sort the tree is the u64 value vruntime.
To create the scheduler's tree using the generic mechanism, Daniel adds this declaration:
RB_DEFINE_INTERFACE(
fair_tree,
struct cfs_rq, tasks_timeline, rb_leftmost, /* no rightmost */,
struct sched_entity, run_node, vruntime,
0, compare_vruntime, 0)
Here, fair_tree is the "prefix" value used to generate the names of the tree functions—see below. The next line describes the structure containing the tree (struct cfs_rq), the name of the tree, and the name of the leftmost pointer used by the scheduler; there is no rightmost pointer (nobody cares about the lowest-priority task), so that parameter is simply left blank. The line after that describes the structure contained within the tree (struct sched_entity), the name of its embedded rb_node structure, and the name of the key value. Finally, no flags are given, compare_vruntime() is the comparison function, and, since this is not an augmented tree, there is no augmented callback function. Yes, it lacks a semicolon—the macro supplies that itself.
The result is a new set of functions like:
struct sched_entity *fair_tree_insert(struct cfs_rq runqueue,
struct sched_entity *entity);
void fair_tree_remove(struct cfs_rq runqueue, struct sched_entity *entity);
struct sched_entity *fair_tree_find(struct cfs_rq runqueue, u64 *key);
/* ... */
These functions are all defined with the proper type, so the compiler will ensure that they are always called with the proper argument types. Everything is defined as __always_inline, so the implementations will be inlined at the place where they are called. That should eliminate any performance penalty caused by out-of-line implementations (as seen in Kent's patch), but, so far, nobody seems to have tried to measure that penalty.
There has been little in the way of review of these patches in general. They represent different approaches to the task of creating a generic red-black implementation, one emphasizing simplicity while the other emphasizes explicitness and type safety. Which version might be inserted into the mainline kernel—if either goes in—is entirely unclear at this point.
Volatile ranges with fallocate()
Last November LWN looked at the volatile ranges patch set from John Stultz. This patch is intended to bring an Android feature into the mainline, but it is a reimplemented feature that is more deeply tied into the memory management subsystem. That patch has now returned, but the API has changed so another look is warranted.A "volatile range" is a set of pages in memory containing data that might be useful to an application at some point in the future; a key point is that, if the need arises, the application is able to reacquire (or regenerate) that data from another source. A web browser's in-RAM image cache is a classic example. Keeping the images around can reduce net traffic and improve page rendering times, but, should the cached images vanish, the browser can request a new copy from the source. Thus, while it makes sense to keep this data around, it also makes sense to get rid of it if a more pressing need for memory arises.
If the kernel knew about this sort of cached data, it could dump that data during times of memory stress and quickly reclaim the underlying memory. In such a situation, applications could cache more data than they otherwise would, knowing that there are limits to how much that caching can affect the system as a whole. The result would be better utilization of memory and a system that performs better overall.
Google's Robert Love implemented such a mechanism for Android as "ashmem." There is a desire to get the ashmem functionality into the mainline kernel, but the implementation and API were not to everybody's taste. To get around that problem, John took the core ashmem code, reworked the virtual memory integration, and hooked it into the posix_fadvise() system call; that is the version of the patch that was described last November.
Dave Chinner subsequently pointed out that this functionality might be better suited to the fallocate() system call. That call looks like this:
int fallocate(int fd, int mode, off_t offset, off_t len);
This system call is meant to operate on ranges of data within a file. Of particular interest, perhaps, is the FALLOCATE_FL_PUNCH_HOLE operation, which removes a block of data from an arbitrary location within a file. Declaring a volatile range can be thought of as a form of hole punching, but with a kernel-determined delay. If memory is tight, the hole could be punched immediately; otherwise the operation could complete at some later time, or not at all. Given the similarity between these two operations, it made sense to implement them within the same system call; John duly reworked the patch along those lines.
With the new patch set, to declare that a range of a file's contents is volatile, an application would call:
fallocate(fd, FALLOCATE_FL_MARK_VOLATILE, offset, len);
Where offset and len describe the actual range to be marked. After the call completes, the kernel is not obligated to keep that range in memory, and is not obligated to write that range to backing store before reclaiming it. The application should not attempt to access that portion of the file while it has been marked volatile, since the contents could disappear at any time. Instead, if and when the data turns out to be useful, a call should be made to:
fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);
If the indicated range is still present in memory, the call will return zero and the application can proceed to work with the data. If, instead, any part of the range has been purged by the kernel since it was marked volatile, a non-zero return value will inform the application that it needs to find that data somewhere else.
Any filesystem could conceivably implement this functionality, but, in practice, it only makes sense for a RAM-based filesystem like tmpfs, so it is only implemented there.
The patch is in its third revision as of this writing, having gotten a number of comments in its first two iterations. The number of complaints has fallen off considerably, though, suggesting that most reviewers are happy now. So this feature may just find its way into the 3.6 kernel.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>
