The current development kernel is 3.5-rc1
on June 3. "It's a
pretty normal release - roughly 60% drivers, 20% arch updates, and 20% "all
over" - filesystems, documentation, tools, you name it. [...] Depending on
what your interest is, you might be excited about the CoDel packet
scheduler, or about the GPU driver updates, or the new scsi targets. There
is something in here for pretty much anybody.
Stable updates: the
3.4.1 stable kernels were all released on
June 4; each contains a long list of fixes. 3.3.8 is the final update
for the 3.3 series.
Comments (none posted)
I might be misreading and missing details, though - that code is
really as pleasant to read as using warm stale beer to deal with
industrial-strength hangover. The kind when you end up spitting
out a fly or two, if not a cigarette butt.
— Al Viro
You wouldn't want me to say that I think you're right,
it would impudently suggest that I might conceive of you being wrong.
I sigh for your heavy burden.
— Hugh Dickins
Complexity is my Sun, and I am the planet that orbits around it.
— Steven Rostedt
Thing is, first thing on Monday morning my brain don't work too
fast. If I then get to basically reverse engineer a patch because
people forgot to mention all the important bits, I get annoyed.
— Peter Zijlstra
poorly-described patches before Tuesday
Comments (3 posted)
Kernel development news
Linus closed the 3.5 merge window on June 2; the announcement
for the 3.5 prepatch came one day
later. Just under 1,000 patches were pulled into the mainline after the
writing of last week's merge window
, for a total of 9,534 for the merge window as a whole. Some of
the more significant changes pulled at the end of the merge window include:
- The HFS filesystem has gained native language support (NLS)
capabilities, with codepages added for several languages.
- A process's directory under /proc now includes a
children file containing the IDs of its child processes.
- The kcmp() system call has been added. Its purpose is to
help user space checkpoint/restore utilities to determine whether two
processes share a given resource or not; see this article for a description of the
- Also for checkpoint/restore: the prctl() system call has
gained options to set the beginning and end of the argv and
environment areas and the executable file a process is running.
- The ext4 filesystem now has support for metadata checksumming, a
feature which should help to catch metadata corruption that might
otherwise escape notice. This feature requires an on-disk format
change (to store the checksum), so it must be turned on explicitly.
Once this feature is enabled, the filesystem can only be mounted
read-only on older kernels. See this
article for more information on the metadata checksum feature.
- A lot of changes have gone in to improve the handling of
system-specific features on Sony laptops.
- The new skew_tick= boot option controls whether the system
skews timer ticks on a per-CPU basis to minimize contention on the
xtime_lock lock. It defaults to off; turning it on can
reduce jitter on some workloads, but will also increase power
- The "frontswap" mechanism, part of the transcendent memory family of
technologies, sneaked its way into the mainline just after the -rc1
- The FUSE filesystem API has added an operation to implement the
fallocate() system call.
- Two new drivers were added at the end of the merge window; they enable
AUO K1900 and K1901 epaper display controllers and Emma Mobile STI
Changes visible to kernel developers include:
At this point, the addition of features for the 3.5 development cycle
should be at an end unless something sneaks in before -rc2. If the usual
pattern holds, the final 3.5 release can be expected right around the
beginning of August.
Comments (none posted)
Red-black trees (or "rbtrees") are used throughout the kernel to maintain a
sorted list of related items. For example, the block I/O subsystem uses an
rbtree to maintain a list of outstanding requests sorted by sector number,
and the scheduler has an rbtree of runnable tasks sorted by a quantity known as "virtual
runtime." The rbtree interface was described in this 2006 article
; it has since been extended,
but the core features of the API remain the same. In particular, users
must provide their own functions for inserting nodes into the tree and
performing searches; that allows the creation of highly-optimized rbtrees
containing arbitrary types of structures.
There is some appeal to being able to hand-code the search and insertion
functions, but there would also be value in generic implementations. The
amount of code in the kernel would shrink slightly and the task of
debugging those functions would, hopefully, only have to be done once. So
it is arguably surprising that nobody has proposed a generic rbtree
implementation for all these years. Just as surprising is the fact that
two independent generic implementations surfaced at about the same time.
The simpler of the two can be found in this
patch set from Kent Overstreet. In this version, rbtree users are
required to provide a function to compare two rbtree nodes with this
typedef int (rb_cmp_t) (struct rb_node *l, struct rb_node *r);
With that in place, the following new functions become available:
int rb_insert(struct rb_root *root, struct rb_node *new, rb_cmp_t cmp);
void rb_insert_allow_dup(struct rb_root *root, struct rb_node *new,
struct rb_node *rb_search(struct rb_root *root, struct rb_node *search,
struct rb_node *rb_greater(struct rb_root *root, struct rb_node *search,
As can be seen from the prototypes, all of these functions deal directly
with rb_node structures. Only the comparison function needs to
know about what sort of structure the rb_node is embedded within.
There is no compile-time mechanism to ensure that the comparison function
expects the actual structures found in the tree, but one assumes any errors
along those lines will show themselves quickly at run time.
One potential problem is that rb_search() and
rb_greater() need to know what is being searched for; that, in
turn, requires creating and passing in one of the structures stored in the
tree. In some situations, that structure may need to be created on the
stack, which is a clear problem if the structure is large. Unfortunately,
in the block subsystem example mentioned above, that structure (struct
request) is large indeed. Kent has tried to work around this problem
by providing inlined versions (called _rb_search() and
_rb_greater()) that, with luck, will cause the stack allocation to
be optimized away. That depends on all supported versions of the compiler
doing the right thing on all architectures, though, which may make some
The alternative patch set, posted by Daniel
Santos, is significantly more complicated. It can be thought of as a
set of tricky preprocessor macros and inline functions that serve as a
template for the creation of type-specific rbtree implementations. Here,
too, one starts with the creation of a comparison function:
typedef long (*rb_compare_f)(const void *a, const void *b);
In this case, the comparison function will be passed pointers to the actual
key value stored in the rbtree node.
One then defines an "rbtree interface" with this daunting macro:
RB_DEFINE_INTERFACE(prefix, container_type, root_member,
object_type, rbnode_member, key_member,
flags, comparison_function, augment_func);
Eleven arguments are a lot to keep track of, so it makes sense to discuss
them in the context of an example (taken from Daniel's patch). The CPU
scheduler defines a type (struct cfs_rq in
kernel/sched/sched.h) to represent a run queue; each run queue
contains a red-black tree called tasks_timeline. To optimize the
location of the highest-priority task to run, the scheduler keeps a pointer
to the leftmost tree node in rb_leftmost. The entries in the
red-black tree are of type struct sched_entity (defined in
<linux/sched.h>); the embedded rb_node structure is
called run_node, and the key used to sort the tree is the
u64 value vruntime.
To create the scheduler's tree using the generic mechanism, Daniel adds
struct cfs_rq, tasks_timeline, rb_leftmost, /* no rightmost */,
struct sched_entity, run_node, vruntime,
0, compare_vruntime, 0)
Here, fair_tree is the "prefix" value used to generate the names
of the tree functions—see below. The next line describes the structure
containing the tree (struct cfs_rq), the name of the tree, and the
name of the leftmost pointer used by the scheduler; there is no rightmost
pointer (nobody cares about the lowest-priority task), so that parameter is
simply left blank. The line after that describes the structure contained
within the tree (struct sched_entity), the name of its embedded
rb_node structure, and the name of the key value. Finally, no
flags are given, compare_vruntime() is the comparison function,
and, since this is not an augmented tree,
there is no augmented callback function. Yes, it lacks a semicolon—the
macro supplies that itself.
The result is a new set of functions like:
struct sched_entity *fair_tree_insert(struct cfs_rq runqueue,
struct sched_entity *entity);
void fair_tree_remove(struct cfs_rq runqueue, struct sched_entity *entity);
struct sched_entity *fair_tree_find(struct cfs_rq runqueue, u64 *key);
/* ... */
These functions are all defined with the proper
type, so the compiler will ensure that they are always called with the
proper argument types. Everything is defined as __always_inline,
so the implementations will be inlined at the place where they are called.
That should eliminate any performance penalty caused by out-of-line
implementations (as seen in Kent's patch), but, so far, nobody seems to
have tried to measure that penalty.
There has been little in the way of review of these patches in general.
They represent different approaches to the task of creating a generic
red-black implementation, one emphasizing simplicity while the other
emphasizes explicitness and type safety. Which version might be inserted
into the mainline kernel—if either goes in—is entirely unclear at this
Comments (16 posted)
Last November LWN looked at the volatile ranges
from John Stultz. This patch is intended to bring an Android
feature into the mainline, but it is a reimplemented feature that is more
deeply tied into the memory management subsystem. That patch has now
returned, but the API has changed so another look is warranted.
A "volatile range" is a set of pages in memory containing data that might
be useful to an application at some point in the future; a key point is
that, if the need arises, the application is able to reacquire (or
regenerate) that data from another source. A web browser's in-RAM image
cache is a classic example. Keeping the images around can reduce net
traffic and improve page rendering times, but, should the cached images
vanish, the browser can request a new copy from the source. Thus, while it
makes sense to keep this data around, it also makes sense to get rid of it
if a more pressing need for memory arises.
If the kernel knew about this sort of cached data, it could dump that data
during times of memory stress and quickly reclaim the underlying memory.
In such a situation, applications could cache more data than they otherwise
would, knowing that there are limits to how much that caching can affect
the system as a whole. The result would be better utilization of memory
and a system that performs better overall.
Google's Robert Love implemented such a mechanism for Android as "ashmem."
There is a
desire to get the ashmem functionality into the mainline kernel, but the
implementation and API were not to everybody's taste. To get around that
problem, John took the core ashmem code, reworked the virtual memory
integration, and hooked it into the posix_fadvise() system call;
that is the version of the patch that was described last November.
Dave Chinner subsequently pointed out that this functionality might be
better suited to the fallocate() system call. That call looks
int fallocate(int fd, int mode, off_t offset, off_t len);
This system call is meant to operate on ranges of data within a file. Of
particular interest, perhaps, is the FALLOCATE_FL_PUNCH_HOLE
operation, which removes a block of data from an arbitrary location within
a file. Declaring a volatile range can be thought of as a form of hole
punching, but with a kernel-determined delay. If memory is tight, the hole
could be punched immediately; otherwise the operation could complete at
some later time, or not at all. Given the similarity between these two
operations, it made sense to implement them within the same system call;
John duly reworked the patch along those lines.
With the new patch set, to declare that a range of a file's contents is
volatile, an application would call:
fallocate(fd, FALLOCATE_FL_MARK_VOLATILE, offset, len);
Where offset and len describe the actual range to be
marked. After the call completes, the kernel is not obligated to keep that
range in memory, and is not obligated to write that range to backing store
before reclaiming it. The application should not attempt to access that
portion of the file while it has been marked volatile, since the contents
could disappear at any time. Instead, if and when the data turns out to be
useful, a call should be made to:
fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);
If the indicated range is still present in memory, the call will return
zero and the application can proceed to work with the data. If, instead,
any part of the range has been purged by the kernel since it was marked
volatile, a non-zero return value will inform the application that it needs
to find that data somewhere else.
Any filesystem could conceivably implement this functionality, but, in
practice, it only makes sense for a RAM-based filesystem like tmpfs, so it
is only implemented there.
The patch is in its third revision as of this writing, having gotten a
number of comments in its first two iterations. The number of complaints
has fallen off considerably, though, suggesting that most reviewers are
happy now. So this feature may just find its way into the 3.6 kernel.
Comments (13 posted)
Patches and updates
- Thomas Gleixner: 3.4-rt8 .
(June 4, 2012)
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>