The current stable 2.6 release is 220.127.116.11
on January 14.
This one contains a dozen or so patches for kernel crashes and security
The current 2.6 prepatch is 2.6.16-rc1, announced by Linus on
January 17. The details of what's in this release can be found in last week's summary and
this week's update (see below). In brief: 2.6.16 will include
the OCFS2 filesystem, the swap migration patches, various
new drivers, the mutex
changeover, high-resolution timers, a transparent inter-process communication
(TIPC) protocol implementation, a big netfilter update, a new
batch-level scheduler class, and more.
See the short log file for an overview
(note that "short" is relative) or the
long-format log for the details.
Linus's post-rc1 git repository contains a pile of network driver updates
and a few other fixes. The 2.6.16 merge window has closed, so one would
not expect to see a whole lot of new features going in at this point.
There will apparently be one exception, however: Andrew Morton intends to
merge the openat() series of system calls, along with the
pselect() and ppoll() implementation. These new system
call were covered here last
The current -mm tree is 2.6.16-rc1-mm1. Recent changes
to -mm include a bunch of reiser3 work ("Please test with caution,
but please test."), several more semaphore-to-mutex conversion
patches, multi-column oops stack backtraces for i386, and a new software
suspend API intended to help move some of the image save/restore work to
Comments (none posted)
Kernel development news
A fair number of patches have been merged since the looking forward to 2.6.16
article was published. In addition to everything listed there, the
following patches are part of 2.6.16-rc1, starting with user-visible
- A big XFS update which should improve performance.
- A big direct rendering update. The Video4Linux and DVB code have
also seen large updates.
- An implementation of the Transparent
Inter-Process Communication (TIPC) protocol. TIPC is used for
communication within clusters.
- Harald Welte's massive "x_tables" patch, which unifies much of the
code for various types of tables used in the netfilter code.
- A big PowerPC update including experimental Mac G5 support. There is also
a new virtual "spufs" filesystem providing access to "synergistic
processing units" on the Cell architecture.
- A framework for "serial peripheral interface" (SPI) devices has been added,
along with a handful of drivers.
- A new SCHED_BATCH scheduler class. Processes in this class are
scheduled normally, with the exception that they get no "interactivity"
bonus when they sleep. Unprivileged processes are allowed to move between
SCHED_NORMAL and SCHED_BATCH at will.
- The tmpfs filesystem has acquired a new set of mount options allowing the
system administrator to specify how memory should be allocated across a
Changes visible to kernel developers include:
- Many parts of the kernel have been converted over to the new mutex type.
- Old-timers who automatically type "make bzImage" will find that it
no longer works; just type "make" instead.
- The device probe() and remove() methods have been moved
from struct device_driver to struct bus_type. The
bus-level methods will override any remaining driver methods.
- When the kernel is configured to be optimized for size, gcc (if it's
version 4.x) is given the freedom to decide whether inline
functions should really be inlined. The __always_inline
attribute now truly forces inlining in all cases. This is an outcome
from the discussion on
inline functions held a couple of weeks ago.
- Another outcome from that discussion: many kernel functions have had
the inline attribute removed. One of the more significant of
these is capable(), which has also been moved to
- The old inter_module functions are only built if the one in-kernel
user (the MTD drivers) is present; otherwise they are unavailable.
The merge window for 2.6.16 is effectively closed, so there should not be a
whole lot more in the way of significant changes being merged in this
Comments (5 posted)
Last September, this page featured an article on the ktimers patch
by Thomas Gleixner. The new timer abstraction was designed to enable the
provision of high-resolution timers in the kernel and to address some of
the inefficiencies encountered when the current timer code is used in this mode.
Since then, there has been a large amount of
discussion, and the code has seen significant work. The end product of
that work, now called "hrtimers," was merged for the 2.6.16 release.
At its core, the hrtimer mechanism remains the same. Rather than using the
"timer wheel" data structure, hrtimers live on a time-sorted linked list,
with the next timer to expire being at the head of the list. A separate
red/black tree is also used to enable the insertion and removal of timer
events without scanning through the list. But while the core
remains the same, just about everything else has changed, at least
There is a new type, ktime_t, which is used to store a time value
in nanoseconds. This type, found in <linux/ktime.h>, is
meant to be used as an opaque structure. And, interestingly, its
definition changes depending on the underlying architecture. On 64-bit
systems, a ktime_t is really just a 64-bit integer value in
nanoseconds. On 32-bit machines, however, it is a two-field structure: one
32-bit value holds the number of seconds, and the other holds nanoseconds.
The order of the two fields depends on whether the host architecture is
big-endian or not; they are always arranged so that the two values can,
when needed, be treated as a single, 64-bit value. Doing things this way
complicates the header files, but it provides for efficient time value
manipulation on all architectures.
A whole set of functions and macros has been provided for working with
ktime_t values, starting with the traditional two ways to declare
and initialize them:
DEFINE_KTIME(name); /* Initialize to zero */
kt = ktime_set(long secs, long nanosecs);
Various other functions exist for changing ktime_t values; all of
these treat their arguments as read-only and return a ktime_t
value as their result:
ktime_t ktime_add(ktime_t kt1, ktime_t kt2);
ktime_t ktime_sub(ktime_t kt1, ktime_t kt2); /* kt1 - kt2 */
ktime_t ktime_add_ns(ktime_t kt, u64 nanoseconds);
Finally, there are some type conversion functions:
ktime_t timespec_to_ktime(struct timespec tspec);
ktime_t timeval_to_ktime(struct timeval tval);
struct timespec ktime_to_timespec(ktime_t kt);
struct timeval ktime_to_timeval(ktime_t kt);
clock_t ktime_to_clock_t(ktime_t kt);
u64 ktime_to_ns(ktime_t kt);
The interface for hrtimers can be found in
<linux/hrtimer.h>. A timer is represented by struct
hrtimer, which must be initialized with:
void hrtimer_init(struct hrtimer *timer, clockid_t which_clock);
Every hrtimer is bound to a specific clock. The system currently
supports two clocks, being:
- CLOCK_MONOTONIC: a clock which is guaranteed always to move
forward in time, but which does not reflect "wall clock time" in any
specific way. In the current implementation, CLOCK_MONOTONIC
resembles the jiffies tick count in that it starts at zero
when the system boots and increases monotonically from there.
- CLOCK_REALTIME which matches the current real-world time.
The difference between the two clocks can be seen when the system time is
adjusted, perhaps as a result of administrator action, tweaking by the
network time protocol code, or suspending and resuming the system. In any
of these situations, CLOCK_MONOTONIC will tick forward as if
nothing had happened, while CLOCK_REALTIME may see discontinuous
changes. Which clock should be used will depend mainly on whether the
timer needs to be tied to time as the rest of the world sees it or not.
The call to hrtimer_init() will tie an hrtimer to a specific
clock, but that clock can be changed with:
void hrtimer_rebase(struct hrtimer *timer, clockid_t new_clock);
Most of the hrtimer fields should not be touched. Two of them,
however, must be set by the user:
int (*function)(void *);
As one might expect, function() will be called when the timer
expires, with data as its parameter.
Actually setting a timer is accomplished with:
int hrtimer_start(struct hrtimer *timer, ktime_t time,
enum hrtimer_mode mode);
The mode parameter describes how the time parameter should be
interpreted. A mode of HRTIMER_ABS indicates that
time is an absolute value, while HRTIMER_REL indicates
that time should be interpreted relative to the current time.
Under normal operation, function() will be called after (at least)
the requested expiration time. The hrtimer code implements a shortcut for
situations where the sole purpose of a timer is to wake up a process on
expiration: if function() is NULL, the process whose task
structure is pointed to by data will be awakened. In most cases,
however, code which uses hrtimers will provide a callback
function(). That function has an integer return value, which
should be either HRTIMER_NORESTART (for a one-shot timer which
should not be started again) or HRTIMER_RESTART for a recurring
In the restart case, the callback must set a new expiration time before
returning. Usually, restarting timers are used by kernel subsystems which
need a callback at a regular interval. The hrtimer code provides a
function for advancing the expiration time to the next such interval:
unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t interval);
This function will advance the timer's expiration time by the given
interval. If necessary, the interval will be added more than once
to yield an expiration time in the future. Generally, the need to add the
interval more than once means that the system has overrun its timer
period, perhaps as a result of high system load. The return value from
hrtimer_forward() is the number of missed intervals, allowing code
which cares to detect and respond to the situation.
Outstanding timers can be canceled with either of:
int hrtimer_cancel(struct hrtimer *timer);
int hrtimer_try_to_cancel(struct hrtimer *timer);
When hrtimer_cancel() returns, the caller can be sure that the
timer is no longer active, and that its expiration function is not running
anywhere in the system. The return value will be zero if the timer was not
active (meaning it had already expired, normally), or one if the timer was
successfully canceled. hrtimer_try_to_cancel() does the same,
but will not wait if the timer function is running; it will, instead,
return -1 in that situation.
A canceled timer can be restarted by passing it to
Finally, there is a small set of query functions.
hrtimer_get_remaining() returns the amount of time left before a
timer expires. A call to hrtimer_active() returns nonzero if the
timer is currently on the queue. And a call to:
int hrtimer_get_res(clockid_t which_clock, struct timespec *tp);
will return the true resolution of the given clock, in nanoseconds.
Comments (9 posted)
The folks at IBM would like to add a "container" capability to the Linux
kernel. Containers are a way of walling a group of processes off from the
rest of the system; a process within a container will only see its fellow
inmate processes and whatever resources are made accessible to that
container. This feature has some obvious security-related applications.
IBM's plans, evidently, also include the ability to pack up a container and
move it to another physical host without disrupting the processes trapped
The patches which have been circulating so far fall short of the final plan,
but they already disturb enough code to have attracted some skeptical
criticism. In particular, the 34-part PID virtualization patch
creates a simple container type, and implements a separate process ID space
within containers. But, as we'll see, doing even that much involves some
significant kernel changes.
The containers themselves are fairly simple. The patches create a virtual
file called /proc/container. If a process writes a string to that
file, a new container is created for that process, using the string as its
name. The namespace is global, so every container on the system must have
a unique name. Any child processes created by the newly-contained process
will also be trapped within the container, with no way out.
At this point, being inside a container does not affect a process's life
that much. The one thing that does change, however, is that each container
has its own process ID (PID) space. Processes within the container can
only see others in the same container. There is nothing particularly
controversial about that behavior, but the developers have another
objective in mind: they want to be able to change the PIDs of contained
processes without the processes themselves noticing. In particular, they
would like to be able to migrate a container to a different system, which
will certainly assign new PIDs to every process within the container. Code
written for Unix-like systems does not normally expect its PID to change
over time, however; so switching PIDs underneath a process could lead to all
kinds of strange behavior. To avoid this problem, the plan is that PIDs remain
constant within the container, even if those PIDs change in the real world.
Implementing constant PIDs (from a viewpoint inside the container) is not a
straightforward task; it involves adding a whole new virtualization layer
inside the kernel. There are two types of PIDs now, "real" PIDs and the
virtual PIDs used by contained processes. Any place in the kernel which
deals with PID values must become aware of which type of PID it is using,
and convert to the other type when necessary. So, as a general rule, any
code which exchanges PIDs with user space must use the virtual variety,
while PIDs handled within the kernel are real.
The PID logic is complicated by a few little details, like: what happens
when containers are nested? A process living within a container has a real
PID and a virtual PID associated with the container. If that process
creates a container of its own, it will acquire yet another PID associated
with the new container. So it is not possible to simply convert a real PID
to a virtual PID; such questions require a "context" so that the kernel
knows which virtual PID is wanted.
The result of all this is that PID handling within the kernel changes
significantly. Code which used to get the current process's PID with
current->pid must now use tsk_pid(current) for the
real PID, or tsk_vpid(current) for the virtual PID - and it must
know which one it wants. In situations where more than one virtual PID
might be appropriate, tsk_vpid_ctx() must be used to supply the
context. Much of the patch set is concerned simply with
making these conversions; for good measure, it also renames the pid
field of struct task_struct to catch any code still trying to
access it directly.
Behind all of this is a concept called "pidspaces." The patch carves up
the global PID space takes the upper 9 bits of the 32-bit PID value and
puts the pidspace number there. A virtual PID as seen within a container
is turned into a real kernel PID by stuffing the pidspace number in those
upper bits. Since the contained processes only see virtual PIDs, they
never see the pidspace number, and they will not notice if that number
All of this code seems to work, but there is a certain amount of opposition
to merging it. As Alan Cox put it:
This is an obscure, weird piece of functionality for some special
case usages most of which are going to be eliminated by Xen. I
don't see the kernel side justification for it at all.
The developers answer that the ability to checkpoint and restart process
trees, possibly moving them in between, will be highly useful. Some other
virtualization projects also require this capability - not everybody wants
to use Xen. So the pressure for
PID virtualization probably won't just go away.
What might happen is that the hiding of current->pid might be
taken out, greatly reducing the size of the patch. Another idea which has
been floated is to eliminate, to the greatest degree possible, the use of
PIDs within the kernel. Almost any in-kernel use of a PID can be replaced
with a direct pointer to the task structure. If a PID eventually is
reduced to little more than a process-identifying cookie used for
communication with user space, it will be easier to virtualize without
complicating large amounts of kernel code.
Comments (8 posted)
Patches and updates
Core kernel code
- Junio C Hamano: GIT 1.1.2.
(January 14, 2006)
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>