LWN.net Logo

Kernel development

Brief items

Kernel release status

The current stable 2.6 release is 2.6.15.1, announced on January 14. This one contains a dozen or so patches for kernel crashes and security problems.

The current 2.6 prepatch is 2.6.16-rc1, announced by Linus on January 17. The details of what's in this release can be found in last week's summary and this week's update (see below). In brief: 2.6.16 will include the OCFS2 filesystem, the swap migration patches, various new drivers, the mutex changeover, high-resolution timers, a transparent inter-process communication (TIPC) protocol implementation, a big netfilter update, a new batch-level scheduler class, and more. See the short log file for an overview (note that "short" is relative) or the long-format log for the details.

Linus's post-rc1 git repository contains a pile of network driver updates and a few other fixes. The 2.6.16 merge window has closed, so one would not expect to see a whole lot of new features going in at this point. There will apparently be one exception, however: Andrew Morton intends to merge the openat() series of system calls, along with the pselect() and ppoll() implementation. These new system call were covered here last December.

The current -mm tree is 2.6.16-rc1-mm1. Recent changes to -mm include a bunch of reiser3 work ("Please test with caution, but please test."), several more semaphore-to-mutex conversion patches, multi-column oops stack backtraces for i386, and a new software suspend API intended to help move some of the image save/restore work to user space.

Comments (none posted)

Kernel development news

More changes for 2.6.16

A fair number of patches have been merged since the looking forward to 2.6.16 article was published. In addition to everything listed there, the following patches are part of 2.6.16-rc1, starting with user-visible changes:

  • A big XFS update which should improve performance.

  • A big direct rendering update. The Video4Linux and DVB code have also seen large updates.

  • An implementation of the Transparent Inter-Process Communication (TIPC) protocol. TIPC is used for communication within clusters.

  • Harald Welte's massive "x_tables" patch, which unifies much of the code for various types of tables used in the netfilter code.

  • A big PowerPC update including experimental Mac G5 support. There is also a new virtual "spufs" filesystem providing access to "synergistic processing units" on the Cell architecture.

  • A framework for "serial peripheral interface" (SPI) devices has been added, along with a handful of drivers.

  • A new SCHED_BATCH scheduler class. Processes in this class are scheduled normally, with the exception that they get no "interactivity" bonus when they sleep. Unprivileged processes are allowed to move between SCHED_NORMAL and SCHED_BATCH at will.

  • The tmpfs filesystem has acquired a new set of mount options allowing the system administrator to specify how memory should be allocated across a NUMA system.

Changes visible to kernel developers include:

  • Many parts of the kernel have been converted over to the new mutex type.

  • Old-timers who automatically type "make bzImage" will find that it no longer works; just type "make" instead.

  • The device probe() and remove() methods have been moved from struct device_driver to struct bus_type. The bus-level methods will override any remaining driver methods.

  • When the kernel is configured to be optimized for size, gcc (if it's version 4.x) is given the freedom to decide whether inline functions should really be inlined. The __always_inline attribute now truly forces inlining in all cases. This is an outcome from the discussion on inline functions held a couple of weeks ago.

  • Another outcome from that discussion: many kernel functions have had the inline attribute removed. One of the more significant of these is capable(), which has also been moved to <linux/capability.h>.

  • The old inter_module functions are only built if the one in-kernel user (the MTD drivers) is present; otherwise they are unavailable.

The merge window for 2.6.16 is effectively closed, so there should not be a whole lot more in the way of significant changes being merged in this cycle.

Comments (5 posted)

The high-resolution timer API

Last September, this page featured an article on the ktimers patch by Thomas Gleixner. The new timer abstraction was designed to enable the provision of high-resolution timers in the kernel and to address some of the inefficiencies encountered when the current timer code is used in this mode. Since then, there has been a large amount of discussion, and the code has seen significant work. The end product of that work, now called "hrtimers," was merged for the 2.6.16 release.

At its core, the hrtimer mechanism remains the same. Rather than using the "timer wheel" data structure, hrtimers live on a time-sorted linked list, with the next timer to expire being at the head of the list. A separate red/black tree is also used to enable the insertion and removal of timer events without scanning through the list. But while the core remains the same, just about everything else has changed, at least superficially.

There is a new type, ktime_t, which is used to store a time value in nanoseconds. This type, found in <linux/ktime.h>, is meant to be used as an opaque structure. And, interestingly, its definition changes depending on the underlying architecture. On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds. On 32-bit machines, however, it is a two-field structure: one 32-bit value holds the number of seconds, and the other holds nanoseconds. The order of the two fields depends on whether the host architecture is big-endian or not; they are always arranged so that the two values can, when needed, be treated as a single, 64-bit value. Doing things this way complicates the header files, but it provides for efficient time value manipulation on all architectures.

A whole set of functions and macros has been provided for working with ktime_t values, starting with the traditional two ways to declare and initialize them:

    DEFINE_KTIME(name);   /* Initialize to zero */

    ktime_t kt;
    kt = ktime_set(long secs, long nanosecs);

Various other functions exist for changing ktime_t values; all of these treat their arguments as read-only and return a ktime_t value as their result:

    ktime_t ktime_add(ktime_t kt1, ktime_t kt2);
    ktime_t ktime_sub(ktime_t kt1, ktime_t kt2);  /* kt1 - kt2 */
    ktime_t ktime_add_ns(ktime_t kt, u64 nanoseconds);

Finally, there are some type conversion functions:

    ktime_t timespec_to_ktime(struct timespec tspec);
    ktime_t timeval_to_ktime(struct timeval tval);
    struct timespec ktime_to_timespec(ktime_t kt);
    struct timeval ktime_to_timeval(ktime_t kt);
    clock_t ktime_to_clock_t(ktime_t kt);
    u64 ktime_to_ns(ktime_t kt);

The interface for hrtimers can be found in <linux/hrtimer.h>. A timer is represented by struct hrtimer, which must be initialized with:

    void hrtimer_init(struct hrtimer *timer, clockid_t which_clock);

Every hrtimer is bound to a specific clock. The system currently supports two clocks, being:

  • CLOCK_MONOTONIC: a clock which is guaranteed always to move forward in time, but which does not reflect "wall clock time" in any specific way. In the current implementation, CLOCK_MONOTONIC resembles the jiffies tick count in that it starts at zero when the system boots and increases monotonically from there.

  • CLOCK_REALTIME which matches the current real-world time.

The difference between the two clocks can be seen when the system time is adjusted, perhaps as a result of administrator action, tweaking by the network time protocol code, or suspending and resuming the system. In any of these situations, CLOCK_MONOTONIC will tick forward as if nothing had happened, while CLOCK_REALTIME may see discontinuous changes. Which clock should be used will depend mainly on whether the timer needs to be tied to time as the rest of the world sees it or not. The call to hrtimer_init() will tie an hrtimer to a specific clock, but that clock can be changed with:

    void hrtimer_rebase(struct hrtimer *timer, clockid_t new_clock);

Most of the hrtimer fields should not be touched. Two of them, however, must be set by the user:

    int  (*function)(void *);
    void *data;

As one might expect, function() will be called when the timer expires, with data as its parameter.

Actually setting a timer is accomplished with:

    int hrtimer_start(struct hrtimer *timer, ktime_t time,
                      enum hrtimer_mode mode);

The mode parameter describes how the time parameter should be interpreted. A mode of HRTIMER_ABS indicates that time is an absolute value, while HRTIMER_REL indicates that time should be interpreted relative to the current time.

Under normal operation, function() will be called after (at least) the requested expiration time. The hrtimer code implements a shortcut for situations where the sole purpose of a timer is to wake up a process on expiration: if function() is NULL, the process whose task structure is pointed to by data will be awakened. In most cases, however, code which uses hrtimers will provide a callback function(). That function has an integer return value, which should be either HRTIMER_NORESTART (for a one-shot timer which should not be started again) or HRTIMER_RESTART for a recurring timer.

In the restart case, the callback must set a new expiration time before returning. Usually, restarting timers are used by kernel subsystems which need a callback at a regular interval. The hrtimer code provides a function for advancing the expiration time to the next such interval:

    unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t interval);

This function will advance the timer's expiration time by the given interval. If necessary, the interval will be added more than once to yield an expiration time in the future. Generally, the need to add the interval more than once means that the system has overrun its timer period, perhaps as a result of high system load. The return value from hrtimer_forward() is the number of missed intervals, allowing code which cares to detect and respond to the situation.

Outstanding timers can be canceled with either of:

    int hrtimer_cancel(struct hrtimer *timer);
    int hrtimer_try_to_cancel(struct hrtimer *timer);

When hrtimer_cancel() returns, the caller can be sure that the timer is no longer active, and that its expiration function is not running anywhere in the system. The return value will be zero if the timer was not active (meaning it had already expired, normally), or one if the timer was successfully canceled. hrtimer_try_to_cancel() does the same, but will not wait if the timer function is running; it will, instead, return -1 in that situation.

A canceled timer can be restarted by passing it to hrtimer_restart().

Finally, there is a small set of query functions. hrtimer_get_remaining() returns the amount of time left before a timer expires. A call to hrtimer_active() returns nonzero if the timer is currently on the queue. And a call to:

    int hrtimer_get_res(clockid_t which_clock, struct timespec *tp);

will return the true resolution of the given clock, in nanoseconds.

Comments (9 posted)

Containers and PID virtualization

The folks at IBM would like to add a "container" capability to the Linux kernel. Containers are a way of walling a group of processes off from the rest of the system; a process within a container will only see its fellow inmate processes and whatever resources are made accessible to that container. This feature has some obvious security-related applications. IBM's plans, evidently, also include the ability to pack up a container and move it to another physical host without disrupting the processes trapped inside.

The patches which have been circulating so far fall short of the final plan, but they already disturb enough code to have attracted some skeptical criticism. In particular, the 34-part PID virtualization patch creates a simple container type, and implements a separate process ID space within containers. But, as we'll see, doing even that much involves some significant kernel changes.

The containers themselves are fairly simple. The patches create a virtual file called /proc/container. If a process writes a string to that file, a new container is created for that process, using the string as its name. The namespace is global, so every container on the system must have a unique name. Any child processes created by the newly-contained process will also be trapped within the container, with no way out.

At this point, being inside a container does not affect a process's life that much. The one thing that does change, however, is that each container has its own process ID (PID) space. Processes within the container can only see others in the same container. There is nothing particularly controversial about that behavior, but the developers have another objective in mind: they want to be able to change the PIDs of contained processes without the processes themselves noticing. In particular, they would like to be able to migrate a container to a different system, which will certainly assign new PIDs to every process within the container. Code written for Unix-like systems does not normally expect its PID to change over time, however; so switching PIDs underneath a process could lead to all kinds of strange behavior. To avoid this problem, the plan is that PIDs remain constant within the container, even if those PIDs change in the real world.

Implementing constant PIDs (from a viewpoint inside the container) is not a straightforward task; it involves adding a whole new virtualization layer inside the kernel. There are two types of PIDs now, "real" PIDs and the virtual PIDs used by contained processes. Any place in the kernel which deals with PID values must become aware of which type of PID it is using, and convert to the other type when necessary. So, as a general rule, any code which exchanges PIDs with user space must use the virtual variety, while PIDs handled within the kernel are real.

The PID logic is complicated by a few little details, like: what happens when containers are nested? A process living within a container has a real PID and a virtual PID associated with the container. If that process creates a container of its own, it will acquire yet another PID associated with the new container. So it is not possible to simply convert a real PID to a virtual PID; such questions require a "context" so that the kernel knows which virtual PID is wanted.

The result of all this is that PID handling within the kernel changes significantly. Code which used to get the current process's PID with current->pid must now use tsk_pid(current) for the real PID, or tsk_vpid(current) for the virtual PID - and it must know which one it wants. In situations where more than one virtual PID might be appropriate, tsk_vpid_ctx() must be used to supply the context. Much of the patch set is concerned simply with making these conversions; for good measure, it also renames the pid field of struct task_struct to catch any code still trying to access it directly.

Behind all of this is a concept called "pidspaces." The patch carves up the global PID space takes the upper 9 bits of the 32-bit PID value and puts the pidspace number there. A virtual PID as seen within a container is turned into a real kernel PID by stuffing the pidspace number in those upper bits. Since the contained processes only see virtual PIDs, they never see the pidspace number, and they will not notice if that number changes.

All of this code seems to work, but there is a certain amount of opposition to merging it. As Alan Cox put it:

This is an obscure, weird piece of functionality for some special case usages most of which are going to be eliminated by Xen. I don't see the kernel side justification for it at all.

The developers answer that the ability to checkpoint and restart process trees, possibly moving them in between, will be highly useful. Some other virtualization projects also require this capability - not everybody wants to use Xen. So the pressure for PID virtualization probably won't just go away.

What might happen is that the hiding of current->pid might be taken out, greatly reducing the size of the patch. Another idea which has been floated is to eliminate, to the greatest degree possible, the use of PIDs within the kernel. Almost any in-kernel use of a PID can be replaced with a direct pointer to the task structure. If a PID eventually is reduced to little more than a process-identifying cookie used for communication with user space, it will be easier to virtualize without complicating large amounts of kernel code.

Comments (8 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

  • Junio C Hamano: GIT 1.1.2. (January 14, 2006)

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Networking

Architecture-specific

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds