|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 3.5-rc1, announced on June 3. "It's a pretty normal release - roughly 60% drivers, 20% arch updates, and 20% "all over" - filesystems, documentation, tools, you name it. [...] Depending on what your interest is, you might be excited about the CoDel packet scheduler, or about the GPU driver updates, or the new scsi targets. There is something in here for pretty much anybody."

Stable updates: the 3.0.33, 3.2.19, 3.3.8, and 3.4.1 stable kernels were all released on June 4; each contains a long list of fixes. 3.3.8 is the final update for the 3.3 series.

Comments (none posted)

Quotes of the week

I might be misreading and missing details, though - that code is really as pleasant to read as using warm stale beer to deal with industrial-strength hangover. The kind when you end up spitting out a fly or two, if not a cigarette butt.
Al Viro

You wouldn't want me to say that I think you're right,
it would impudently suggest that I might conceive of you being wrong.
I sigh for your heavy burden.
Hugh Dickins

Complexity is my Sun, and I am the planet that orbits around it.
Steven Rostedt

Thing is, first thing on Monday morning my brain don't work too fast. If I then get to basically reverse engineer a patch because people forgot to mention all the important bits, I get annoyed.
Peter Zijlstra doesn't accept poorly-described patches before Tuesday

Comments (3 posted)

Kernel development news

The conclusion of the 3.5 merge window

By Jonathan Corbet
June 5, 2012
Linus closed the 3.5 merge window on June 2; the announcement for the 3.5 prepatch came one day later. Just under 1,000 patches were pulled into the mainline after the writing of last week's merge window summary, for a total of 9,534 for the merge window as a whole. Some of the more significant changes pulled at the end of the merge window include:

  • The HFS filesystem has gained native language support (NLS) capabilities, with codepages added for several languages.

  • A process's directory under /proc now includes a children file containing the IDs of its child processes.

  • The kcmp() system call has been added. Its purpose is to help user space checkpoint/restore utilities to determine whether two processes share a given resource or not; see this article for a description of the interface.

  • Also for checkpoint/restore: the prctl() system call has gained options to set the beginning and end of the argv and environment areas and the executable file a process is running.

  • The ext4 filesystem now has support for metadata checksumming, a feature which should help to catch metadata corruption that might otherwise escape notice. This feature requires an on-disk format change (to store the checksum), so it must be turned on explicitly. Once this feature is enabled, the filesystem can only be mounted read-only on older kernels. See this article for more information on the metadata checksum feature.

  • A lot of changes have gone in to improve the handling of system-specific features on Sony laptops.

  • The new skew_tick= boot option controls whether the system skews timer ticks on a per-CPU basis to minimize contention on the xtime_lock lock. It defaults to off; turning it on can reduce jitter on some workloads, but will also increase power consumption.

  • The "frontswap" mechanism, part of the transcendent memory family of technologies, sneaked its way into the mainline just after the -rc1 release.

  • The FUSE filesystem API has added an operation to implement the fallocate() system call.

  • Two new drivers were added at the end of the merge window; they enable AUO K1900 and K1901 epaper display controllers and Emma Mobile STI timers.

Changes visible to kernel developers include:

  • The task_work_add() function, useful for requesting that a function be run in the context of a specific process, has been added. See this article for a description of the task_work_add() API.

  • The SUNRPC code has a new utility function:

        int svc_bind(struct svc_serv *serv, struct net *net);
    
    Its purpose is to handle service registration in the given network context; svc_bind() is exported GPL-only.

  • struct inode_operations has a new update_time() function whose job is to provide any needed special handling for changes to any of the file timestamps. The file_update_time() prototype has been changed: it now returns an int that can indicate that the operation failed. Failures to update the last-access time are now explicitly ignored; this is done to ensure that atime update failures don't make the filesystem unreadable. This work has been generalized out of the btrfs filesystem; see this article for a description of how atime updates can go wrong otherwise.

At this point, the addition of features for the 3.5 development cycle should be at an end unless something sneaks in before -rc2. If the usual pattern holds, the final 3.5 release can be expected right around the beginning of August.

Comments (none posted)

Generic red-black trees

By Jonathan Corbet
June 5, 2012
Red-black trees (or "rbtrees") are used throughout the kernel to maintain a sorted list of related items. For example, the block I/O subsystem uses an rbtree to maintain a list of outstanding requests sorted by sector number, and the scheduler has an rbtree of runnable tasks sorted by a quantity known as "virtual runtime." The rbtree interface was described in this 2006 article; it has since been extended, but the core features of the API remain the same. In particular, users must provide their own functions for inserting nodes into the tree and performing searches; that allows the creation of highly-optimized rbtrees containing arbitrary types of structures.

There is some appeal to being able to hand-code the search and insertion functions, but there would also be value in generic implementations. The amount of code in the kernel would shrink slightly and the task of debugging those functions would, hopefully, only have to be done once. So it is arguably surprising that nobody has proposed a generic rbtree implementation for all these years. Just as surprising is the fact that two independent generic implementations surfaced at about the same time.

The simpler of the two can be found in this patch set from Kent Overstreet. In this version, rbtree users are required to provide a function to compare two rbtree nodes with this prototype:

    typedef int (rb_cmp_t) (struct rb_node *l, struct rb_node *r);

With that in place, the following new functions become available:

    int rb_insert(struct rb_root *root, struct rb_node *new, rb_cmp_t cmp);
    void rb_insert_allow_dup(struct rb_root *root, struct rb_node *new,
			     rb_cmp_t cmp);
    struct rb_node *rb_search(struct rb_root *root, struct rb_node *search,
			      rb_cmp_t cmp);
    struct rb_node *rb_greater(struct rb_root *root, struct rb_node *search,
			       rb_cmp_t cmp);

As can be seen from the prototypes, all of these functions deal directly with rb_node structures. Only the comparison function needs to know about what sort of structure the rb_node is embedded within. There is no compile-time mechanism to ensure that the comparison function expects the actual structures found in the tree, but one assumes any errors along those lines will show themselves quickly at run time.

One potential problem is that rb_search() and rb_greater() need to know what is being searched for; that, in turn, requires creating and passing in one of the structures stored in the tree. In some situations, that structure may need to be created on the stack, which is a clear problem if the structure is large. Unfortunately, in the block subsystem example mentioned above, that structure (struct request) is large indeed. Kent has tried to work around this problem by providing inlined versions (called _rb_search() and _rb_greater()) that, with luck, will cause the stack allocation to be optimized away. That depends on all supported versions of the compiler doing the right thing on all architectures, though, which may make some people nervous.

The alternative patch set, posted by Daniel Santos, is significantly more complicated. It can be thought of as a set of tricky preprocessor macros and inline functions that serve as a template for the creation of type-specific rbtree implementations. Here, too, one starts with the creation of a comparison function:

    typedef long (*rb_compare_f)(const void *a, const void *b);

In this case, the comparison function will be passed pointers to the actual key value stored in the rbtree node. One then defines an "rbtree interface" with this daunting macro:

    RB_DEFINE_INTERFACE(prefix, container_type, root_member,
			left_pointer, right_pointer,
			object_type, rbnode_member, key_member,
			flags, comparison_function, augment_func);

Eleven arguments are a lot to keep track of, so it makes sense to discuss them in the context of an example (taken from Daniel's patch). The CPU scheduler defines a type (struct cfs_rq in kernel/sched/sched.h) to represent a run queue; each run queue contains a red-black tree called tasks_timeline. To optimize the location of the highest-priority task to run, the scheduler keeps a pointer to the leftmost tree node in rb_leftmost. The entries in the red-black tree are of type struct sched_entity (defined in <linux/sched.h>); the embedded rb_node structure is called run_node, and the key used to sort the tree is the u64 value vruntime.

To create the scheduler's tree using the generic mechanism, Daniel adds this declaration:

    RB_DEFINE_INTERFACE(
	fair_tree,
	struct cfs_rq, tasks_timeline, rb_leftmost, /* no rightmost */,
	struct sched_entity, run_node, vruntime,
	0, compare_vruntime, 0)

Here, fair_tree is the "prefix" value used to generate the names of the tree functions—see below. The next line describes the structure containing the tree (struct cfs_rq), the name of the tree, and the name of the leftmost pointer used by the scheduler; there is no rightmost pointer (nobody cares about the lowest-priority task), so that parameter is simply left blank. The line after that describes the structure contained within the tree (struct sched_entity), the name of its embedded rb_node structure, and the name of the key value. Finally, no flags are given, compare_vruntime() is the comparison function, and, since this is not an augmented tree, there is no augmented callback function. Yes, it lacks a semicolon—the macro supplies that itself.

The result is a new set of functions like:

    struct sched_entity *fair_tree_insert(struct cfs_rq runqueue, 
					  struct sched_entity *entity);
    void fair_tree_remove(struct cfs_rq runqueue, struct sched_entity *entity);
    struct sched_entity *fair_tree_find(struct cfs_rq runqueue, u64 *key);
    /* ... */

These functions are all defined with the proper type, so the compiler will ensure that they are always called with the proper argument types. Everything is defined as __always_inline, so the implementations will be inlined at the place where they are called. That should eliminate any performance penalty caused by out-of-line implementations (as seen in Kent's patch), but, so far, nobody seems to have tried to measure that penalty.

There has been little in the way of review of these patches in general. They represent different approaches to the task of creating a generic red-black implementation, one emphasizing simplicity while the other emphasizes explicitness and type safety. Which version might be inserted into the mainline kernel—if either goes in—is entirely unclear at this point.

Comments (16 posted)

Volatile ranges with fallocate()

By Jonathan Corbet
June 5, 2012
Last November LWN looked at the volatile ranges patch set from John Stultz. This patch is intended to bring an Android feature into the mainline, but it is a reimplemented feature that is more deeply tied into the memory management subsystem. That patch has now returned, but the API has changed so another look is warranted.

A "volatile range" is a set of pages in memory containing data that might be useful to an application at some point in the future; a key point is that, if the need arises, the application is able to reacquire (or regenerate) that data from another source. A web browser's in-RAM image cache is a classic example. Keeping the images around can reduce net traffic and improve page rendering times, but, should the cached images vanish, the browser can request a new copy from the source. Thus, while it makes sense to keep this data around, it also makes sense to get rid of it if a more pressing need for memory arises.

If the kernel knew about this sort of cached data, it could dump that data during times of memory stress and quickly reclaim the underlying memory. In such a situation, applications could cache more data than they otherwise would, knowing that there are limits to how much that caching can affect the system as a whole. The result would be better utilization of memory and a system that performs better overall.

Google's Robert Love implemented such a mechanism for Android as "ashmem." There is a desire to get the ashmem functionality into the mainline kernel, but the implementation and API were not to everybody's taste. To get around that problem, John took the core ashmem code, reworked the virtual memory integration, and hooked it into the posix_fadvise() system call; that is the version of the patch that was described last November.

Dave Chinner subsequently pointed out that this functionality might be better suited to the fallocate() system call. That call looks like this:

    int fallocate(int fd, int mode, off_t offset, off_t len);

This system call is meant to operate on ranges of data within a file. Of particular interest, perhaps, is the FALLOCATE_FL_PUNCH_HOLE operation, which removes a block of data from an arbitrary location within a file. Declaring a volatile range can be thought of as a form of hole punching, but with a kernel-determined delay. If memory is tight, the hole could be punched immediately; otherwise the operation could complete at some later time, or not at all. Given the similarity between these two operations, it made sense to implement them within the same system call; John duly reworked the patch along those lines.

With the new patch set, to declare that a range of a file's contents is volatile, an application would call:

    fallocate(fd, FALLOCATE_FL_MARK_VOLATILE, offset, len);

Where offset and len describe the actual range to be marked. After the call completes, the kernel is not obligated to keep that range in memory, and is not obligated to write that range to backing store before reclaiming it. The application should not attempt to access that portion of the file while it has been marked volatile, since the contents could disappear at any time. Instead, if and when the data turns out to be useful, a call should be made to:

    fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);

If the indicated range is still present in memory, the call will return zero and the application can proceed to work with the data. If, instead, any part of the range has been purged by the kernel since it was marked volatile, a non-zero return value will inform the application that it needs to find that data somewhere else.

Any filesystem could conceivably implement this functionality, but, in practice, it only makes sense for a RAM-based filesystem like tmpfs, so it is only implemented there.

The patch is in its third revision as of this writing, having gotten a number of comments in its first two iterations. The number of complaints has fallen off considerably, though, suggesting that most reviewers are happy now. So this feature may just find its way into the 3.6 kernel.

Comments (13 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.5-rc1 ?
Greg KH Linux 3.4.1 ?
Thomas Gleixner 3.4-rt8 ?
Greg KH Linux 3.3.8 ?
Ben Hutchings Linux 3.2.19 ?
Steven Rostedt 3.2.19-rt30 ?
Greg KH Linux 3.0.33 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds