Kernel development [LWN.net]

Kernel release status

The current stable 2.6 release is 2.6.13.2, released on September 16.

The current 2.6 prepatch is 2.6.14-rc2, released by Linus on September 19. "Not a whole lot o' excitement, ye scurvy dogs, but it has t' ALSA, LSM, audit and watchdog merges that be missed from -rc1, and a merge series with Andrew." Some specific additions which came in after -rc1 include a new virtual filesystem for security modules, some DCCP additions, a number of audit subsystem patches, some netfilter enhancements, and an ALSA update. See the long-format changelog for the details.

Linus's git repository currently contains a SCSI update, some netfilter patches, an InfiniBand update, and various fixes.

The current -mm tree is 2.6.14-rc1-mm1. Recent changes to -mm include per-task write throttling, a conversion of the input subsystem to sysfs (includes some driver model changes which will need reworking prior to merging), a big reiser4 update meant to address various review comments, the removal of the perfctr patches (the maintainer is moving on and recommending perfmon instead), and some page allocator tweaks.

Comments (2 posted)

Quotes of the week

Insulting is ok, but I personally get really pissed off [when] a tool is both confused and insulting. At least be _correct_ and insulting.

-- Linus Torvalds

Ok, maybe I'm just strange, but when I see code like

	if (is_key_possessed(keyref)) {

I'm inevitably mentally going "Linda Blair! It is spewing pea-soup and rotating its head!"

-- Linus Torvalds

Comments (none posted)

A new approach to kernel timers

The kernel internal API includes a flexible mechanism for requesting that events happen at some point in the future. This timer subsystem is relatively easy to work with and efficient, but it has always suffered from a fundamental limitation: it is tied to the kernel clock interrupt, with the result that the resolution of timers is limited to the clock interrupt period. For a 2.6.13 kernel, on the i386 architecture, using the default clock interval, timers can be no more precise than 4ms. For many applications, that resolution is adequate, but some others (including real time work and some desktop multimedia applications) require the ability to sleep reliably for shorter periods. Thus, a number of developers have produced high-resolution timer patches over the years, but none of them have been merged into the mainline.

Ingo Molnar's recently-released 2.6.13-rt6 tree, which contains the realtime preemption patch set, brought a surprise in the form of a new high-resolution timer implementation by Thomas Gleixner. Ingo has stated his intention to merge this new code ("ktimers") upstream, so it merits a look.

The ktimer implementation starts with the view that there are two fundamentally different types of timers used in the system. They are (using the terms adopted by the patch):

Timeouts. Timeouts are used primarily by networking and device drivers to detect when an event (I/O completion, for example) does not occur as expected. They have low resolution requirements, and they are almost always removed before they actually expire.
Timers are used to sequence ongoing events. They can have high resolution requirements, and usually expire.

The current kernel timer implementation is heavily oriented toward timeouts. To see how, consider the following diagram which, with sufficient imagination, can be construed as a model of the data structure used inside the kernel to manage timers:

At the right side of the diagram is an array (tv1) containing a set of 256 (in most configurations) linked lists of upcoming timer events. This array is indexed directly by the bottom bits of a jiffies value to find the next set of events to execute. When the kernel has, over the course of 256 jiffies, cycled through the entire tv1 array, that array must be replenished with the next 256 jiffies worth of events. That is done by using the next set of jiffies bits (six, normally) to index into the next array (tv2), which points to those 256 jiffies of timer entries. Those entries are "cascaded" down to tv1 and distributed into the appropriate slots depending on their expiration times. When tv2 is exhausted, it is replenished from tv3 in the same way. This process continues up to tv5. The final entry in tv5 is special, in that it holds all of the far-future events which do not otherwise fit into this hierarchy.

This structure has some distinct advantages. It can retrieve all of the events to execute with a simple array lookup. Insertion of events is cheap, since their location in the structure is easy to calculate. Importantly, the removal of events is also cheap; there is no need to search through a long list of events to find a specific one to take out. Since most timeouts are removed before they expire, quick removal is a useful feature.

On the other hand, this data structure is firmly tied to jiffies values, and cannot easily cope with timers with sub-jiffies resolution. The cascade process, which moves events from the higher arrays to the lower ones, can be expensive if there are a lot of events to work with. Events which are removed prior to expiration will often not have to be cascaded at all, while those which survive through to expiration will have to work their way through the structure. If the clock interrupt frequency is raised (to get better timer resolution), these cascades will happen more often, and the cost of the data structure goes up.

The ktimers patch makes no changes to the existing API or data structure, which are deemed to be adequate and efficient for use with timeouts. Instead, it adds an entirely new API (and internal implementation) aimed at the needs of high-resolution timers. So ktimers are described entirely with human time units - nanoseconds, in particular. They are kept in a sorted, per-CPU list, implemented as a red-black tree. This structure provides for relatively quick insertion or removal, though it will be slower than the timeout structure shown above - but there is no need for the cascade operation.

The core structure for ktimers is, unsurprisingly, struct ktimer. They must be initialized before use with one of the following functions:

    void init_ktimer_mono(struct ktimer *timer);
    void init_ktimer_real(struct ktimer *timer);

Internally, each ktimer is tied to a "base," being the clock by which it is run. The ktimer patch provides two such clocks. The "monotonic" clock is similar to jiffies in that it is a straightforward, always-increasing count. The "realtime" clock, instead, tries to match time as known outside of the system; that clock can be corrected by the kernel or by the system administrator. A ktimer with a 5ms expiration will, if initialized with init_ktimer_mono(), expire 5ms in the future (with the usual proviso that delays can happen). That same timer, if initialized with init_ktimer_real(), will expire when the realtime clock says that 5ms have passed. But, since the realtime clock may be adjusted in the meantime, the actual elapsed time could differ.

There are some caller-accessible fields in struct ktimer:

    void (*function)(void *);
    void *data;
    nsec_t expired;
    nsec_t interval;

When the timer expires, function() will be called with data as its argument. The expired field will contain the time at which the timer actually expired, which might be later than requested. Interestingly, the high-resolution version of the ktimers patch does not set this field. Finally, interval is used for periodic timers.

A timer is set with a call to:

    int start_ktimer(struct ktimer *timer, nsec_t *time, int mode);

Here, time is the expiration time in nanoseconds, and mode describes how that time is to be interpreted. The possible mode values are:

KTIMER_ABS: the timer will expire at an absolute time.
KTIMER_REL: the given time value is a relative time, which must be added to the current time to get an absolute expiration time.
KTIMER_INCR: for timers which have been used before, the time value is added to the previous expiration time.
KTIMER_FORWARD: like KTIMER_INCR, except that the time value will be added repeatedly, if necessary, to obtain an expiration time in the future.
KTIMER_REARM: like KTIMER_FORWARD, except that the interval value stored in the timer is added.
KTIMER_RESTART: the expiration time of the timer is not changed at all.

For KTIMER_FORWARD and KTIMER_REARM, the ktimer code also maintains an integer overrun field in the ktimer structure. If a timer is started after the next expected expiration time (in other words, the system fell behind and did not restart the timer soon enough), overrun will be incremented to allow the calling code to compensate.

The return value will be zero, unless the timer is already expired, in which case the timer will not be started and the return value will be negative. If, however, the mode argument contains the bit KTIMER_NOCHECK, the timer will be started and executed normally, regardless of whether it is already expired.

Most of the other ktimer functions are reasonably self-explanatory for those who have seen the current timer API:

    int modify_ktimer(struct ktimer *timer, nsec_t *time, int mode);
    int try_to_stop_ktimer(struct ktimer *timer);
    int stop_ktimer(struct ktimer *timer);

There is also a convenience function to make a process sleep on a ktimer:

    nsec_t schedule_ktimer(struct ktimer *timer, nsec_t *time, 
                           int state, int mode);

The additional argument here (state) should be TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, depending on whether the sleep should be interrupted by signals or not. The return value is the number of nanoseconds remaining in the requested sleep time; it will be zero except when the sleep is ended prematurely.

The standalone ktimers patch posted by Thomas is the version most likely to be merged. This patch runs ktimers from the normal clock interrupt, with the result that it provides no better resolution than the existing timer API. All of the structure is there to do better, however, once the low-level timer code and architecture specific support is in place. A separate patch exists which enables ktimers to provide high-resolution timers on the i386 architecture.

So far, the largest objection to the ktimer implementation is the use of nanoseconds for time values. Nanosecond timekeeping requires 64-bit variables, which will slow things down a little on 32-bit systems. The response from the developers is that the additional overhead is almost zero and not worth worrying about. So, unless some other surprise turns up, ktimers could find their way into the kernel not too long after 2.6.14 comes out.

(See also: this posting from Thomas, which describes the motivation behind ktimers and its relation to other timing patches in detail).

Comments (5 posted)

ZONE_DMA32

Linux systems typically divide main memory into three zones. Most memory fits into the "normal" zone, ZONE_NORMAL. At the low end, however, there are 16MB of memory which are partitioned into the DMA zone ZONE_DMA; this memory is then reserved for situations where it is specifically needed. The most common user of DMA memory is older peripherals which can only address 24 bits of memory. Finally, on the high end, ZONE_HIGHMEM contains all memory which cannot be directly addressed by the kernel.

Not all systems implement all of these zones. Some newer architectures do not support ancient peripherals and leave out ZONE_DMA. In general, 64-bit systems have no addressing problems and do not need ZONE_HIGHMEM. The ia64 architecture settled on a different implementation of ZONE_DMA, defining it to cover all memory addressed below 4GB.

As it turns out, there are uses for a 4GB zone. Quite a few devices have trouble accessing memory which cannot be addressed with 32 bits. Drivers for such devices have been forced to use ZONE_DMA, the I/O memory management unit (on systems which have one), or bounce buffers. None of those solutions is ideal: ZONE_DMA is a small and scarce resource, IOMMU space can also be scarce, and bounce buffers are slow. All of these problems could be avoided if DMA memory could be reliably allocated below the 4GB boundary.

Andi Kleen has decided that the time has come for the x86-64 architecture to support a 32-bit DMA zone. So his patch adds a new zone (ZONE_DMA32) and an associated GFP flag (GFP_DMA32) for allocations. According to Andi, the reason which prevented the addition of this zone in the first place (the fact that the virtual memory subsystem had a very hard time balancing memory between zones) has gone away. Meanwhile, the lack of this zone is causing real problems.

This patch does not actually add the new zone for any architecture except x86-64. For ia64, it causes GFP_DMA to mean the same thing as GFP_DMA32, with the idea that GFP_DMA should, once again, be restricted to the older, 24-bit meaning. The patch also causes the generic DMA code to use the new zone when it makes sense, making it available to properly-written drivers with no additional work required.

This patch has come too late for inclusion into 2.6.14, but expect to see it in a mainline kernel shortly thereafter.

Comments (none posted)

Predictive per-task write throttling

Memory-intensive tasks can be the bane of many a system administrator. One task which plows through vast numbers of pages can make the system thrash for everybody. The problem is especially acute when the memory hog is writing pages. Since each page dirtied by the process must be written to backing store before it can be reclaimed, a write-intensive task can quickly take a large portion of the system's memory out of commission. Often, a simple large file copy can noticeably impact a system's performance for some time after the copy apparently completes.

The Linux VM subsystem attempts to address this problem with a simple form of write throttling. When the number of dirty pages gets too large, a process caught in the act of dirtying a page will be sent off to write out a few pages before being allowed to proceed. This technique slows the dirtying of pages while simultaneously helping to reclaim pages which have already been written to. This write throttling code makes no attempt to penalize any specific process, however; it will happily throttle any process which dirties a page at the wrong time.

Andrea Arcangeli has decided to improve the situation with a per-task predictive write throttling patch, currently found in the -mm tree. The patch is surprisingly simple - especially after noting that the bulk of it is involved with setting up the /proc and sysctl control interfaces.

At its core, the patch adds a simple accumulator which keeps an approximate count of the number of pages dirtied by each process over the last five seconds. It then assumes that each process will continue to dirty pages at about the same rate into the future. The "are there too many dirty pages?" calculation is then changed to take this rate into account. The code, thus, is making a guess at what the dirty memory situation will be like in the future, based on what each process is doing. Any process which looks like it will cause too much memory to be dirtied gets to perform writeback for a while, while processes which are not writing to lots of pages are not given that particular chore.

Andrea's preliminary results show that, with this patch in place, small, interactive tasks run in competition with a large copy task will run more quickly. Since the copy operation is being made to perform writeback (when it would have otherwise been dirtying more pages), more memory is available for the other tasks in the system. The interesting part of the result is that the copy task runs no slower with this patch in place. A process which is bound by the system's ability to write pages to disk will not benefit from being allowed to dirty the bulk of the system's memory, and it will not suffer by being throttled. So this little patch looks like it could be a winner for everybody involved.

Comments (6 posted)

Linus Torvalds Arrr! Linux v2.6.14-rc2 ?

Alexey Dobriyan 2.6.14-rc2-kj1 ?

Andrew Morton 2.6.14-rc1-mm1 ?

Alexey Dobriyan 2.6.14-rc1-kj1 ?

Chris Wright Linux 2.6.13.2 ?

Andrew Morton 2.6.13-mm3 ?

Al Viro 2.6.13-git12-bird1 ?

Ingo Molnar 2.6.13-rt6, ktimer subsystem ?

Con Kolivas 2.6.13-ck4 ?

Con Kolivas 2.6.13-ck5 ?

Nick Piggin atomic: introduce atomic_cmpxchg ?

Nick Piggin atomic: introduce atomic_inc_not_zero ?

Pavel Machek swsusp3: push image reading/writing into userspace ?

tglx@linutronix.de ktimers subsystem ?

Nick Piggin sched: disable preempt in idle tasks ?

Junio C Hamano GIT 0.99.7 ?

Petr Baudis Cogito-0.15 ?

Michal Piotrowski 2.6.13-mm3 ort v.b6 (OOPS Reporting Tool), try2 ?

Keith Owens Announce: kdb v4.4 is available for kernel 2.6.14-rc2 ?

James Bottomley scsi merge for 2.6.13 ?

Harald Welte New Omnikey Cardman 4000 driver ?

Harald Welte New Omnikey Cardman 4040 driver ?

Greg KH subclasses in sysfs to solve world peace ?

Lennert Buytenhek intel ixp2000 network driver ?

Dmitry Torokhov RFC/RFT: Input - sysfs integration ?

Junio C Hamano Current state of GIT fetch/pull clients ?

tglx@linutronix.de ktimers subsystem ?

Ram vfs: shared subtree patches ?

Ravikiran G Thirumalai mm: Reimplementation of dynamic per-cpu allocator ?

Andrea Arcangeli per-task-predictive-write-throttling-1 ?

WU Fengguang Adaptive read-ahead based on page cache context ?

WU Fengguang Adaptive read-ahead with look-ahead ?

Andi Kleen Add 4GB DMA32 zone ?

Con Kolivas vm - swap prefetch-8 ?

Con Kolivas vm - swap prefetch-9 ?

Mitch Williams Updated bonding sysfs patchset. ?

Evgeniy Polyakov zero-copy receiving sendfile(). ?

Chris Wright LSM update, add securityfs ?

David Howells Coming key management support improvements ?

David Howells Keys: Add possessor permissions to keys ?

Chen, Kenneth W Quick update on latest Linux kernel performance ?

Greg KH udev 069 release ?

Greg KH udev 070 release and change of maintainer ?

Soeren Sandmann Announce: Sysprof 1.0 -- a sampling, systemwide Linux profiler ?

Douglas Gilbert sdparm 0.95 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

A new approach to kernel timers

ZONE_DMA32

Predictive per-task write throttling

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous