Brief items
The current stable 2.6 release is 2.6.13.2,
released on September 16.
The current 2.6 prepatch is 2.6.14-rc2, released by Linus on
September 19. "Not a whole lot o' excitement, ye scurvy dogs,
but it has t' ALSA, LSM, audit and watchdog merges that be missed from
-rc1, and a merge series with Andrew." Some specific additions
which came in after -rc1 include a new virtual filesystem for security
modules, some DCCP additions, a number of audit subsystem patches, some
netfilter enhancements, and an ALSA update. See the long-format changelog for the details.
Linus's git repository currently contains a SCSI update, some netfilter patches, an InfiniBand update, and various fixes.
The current -mm tree is 2.6.14-rc1-mm1. Recent changes
to -mm include per-task write throttling, a conversion of the input
subsystem to sysfs (includes some driver model changes which will need
reworking prior to merging), a big reiser4 update meant to address various
review comments, the removal of the perfctr patches (the maintainer is moving
on and recommending perfmon instead), and some page allocator tweaks.
Comments (2 posted)
Kernel development news
Insulting is ok, but I personally get really pissed off [when] a tool
is both confused and insulting. At least be _correct_ and insulting.
-- Linus Torvalds
Ok, maybe I'm just strange, but when I see code like
if (is_key_possessed(keyref)) {
I'm inevitably mentally going "Linda Blair! It is spewing pea-soup and
rotating its head!"
-- Linus Torvalds
Comments (none posted)
The kernel internal API includes a flexible mechanism for requesting that
events happen at some point in the future. This timer subsystem is
relatively easy to work with and efficient, but it has always suffered from
a fundamental limitation: it is tied to the kernel clock interrupt, with
the result that the resolution of timers is limited to the clock interrupt
period. For a 2.6.13 kernel, on the i386 architecture, using the default
clock interval, timers can be no more precise than 4ms. For many
applications, that resolution is adequate, but some others (including real
time work and some desktop multimedia applications) require the ability to
sleep reliably for shorter periods. Thus, a number of developers have
produced high-resolution timer patches over the years, but none of them
have been merged into the mainline.
Ingo Molnar's recently-released 2.6.13-rt6 tree, which contains
the realtime preemption patch set, brought a surprise in the form of a new
high-resolution timer implementation by Thomas Gleixner. Ingo has stated
his intention to merge this new code ("ktimers") upstream, so it merits a
look.
The ktimer implementation starts with the view that there are two
fundamentally different types of timers used in the system. They are
(using the terms adopted by the patch):
- Timeouts. Timeouts are used primarily by networking and device
drivers to detect when an event (I/O completion, for example) does not
occur as expected. They have low resolution requirements, and they
are almost always removed before they actually expire.
- Timers are used to sequence ongoing events. They can have high
resolution requirements, and usually expire.
The current kernel timer implementation is heavily oriented toward
timeouts. To see how, consider the following diagram which, with
sufficient imagination, can be construed as a model of the data structure
used inside the kernel to manage timers:
At the right side of the diagram is an array (tv1) containing a
set of 256 (in most configurations) linked lists of upcoming timer events.
This array is indexed directly by the bottom bits of a
jiffies value to find the next set of events to execute. When the
kernel has, over the course of 256 jiffies, cycled through the entire
tv1 array, that array must be replenished with the next 256
jiffies worth of events. That is done by using the next set of jiffies bits (six,
normally) to index into the next array (tv2), which points to
those 256 jiffies of timer entries. Those entries are "cascaded" down to
tv1 and distributed into the appropriate slots depending on their
expiration times.
When tv2 is exhausted, it is
replenished from tv3 in the same way. This process continues up
to tv5. The final entry in tv5 is special, in that it
holds all of the far-future events which do not otherwise fit into this
hierarchy.
This structure has some distinct advantages. It can retrieve all of the
events to execute with a simple array lookup. Insertion of events is
cheap, since their location in the structure is easy to calculate.
Importantly, the removal of events is also cheap; there is no need to
search through a long list of events to find a specific one to take out.
Since most timeouts are removed before they expire, quick removal is a
useful feature.
On the other hand, this data structure is firmly tied to jiffies
values, and cannot easily cope with timers with sub-jiffies
resolution. The cascade process, which moves events from the higher arrays
to the lower ones, can be expensive if there are a lot of events to work
with. Events which are removed prior to expiration will often not have to
be cascaded at all, while those which survive through to expiration will
have to work their way through the structure. If the clock interrupt
frequency is raised (to get better timer resolution), these cascades will
happen more often, and the cost of the data structure goes up.
The ktimers patch makes no changes to the existing API or data structure,
which are deemed to
be adequate and efficient for use with timeouts. Instead, it adds an
entirely new API (and internal implementation) aimed at the needs of
high-resolution timers. So ktimers are described entirely with human time
units - nanoseconds, in particular. They are kept in a sorted, per-CPU
list, implemented as a red-black tree. This structure provides for
relatively quick insertion or removal, though it will be slower than the
timeout structure shown above - but there is no need for the cascade
operation.
The core structure for ktimers is, unsurprisingly,
struct ktimer. They must be initialized before use with one
of the following functions:
void init_ktimer_mono(struct ktimer *timer);
void init_ktimer_real(struct ktimer *timer);
Internally, each ktimer is tied to a "base," being the clock by which it is
run. The ktimer patch provides two such clocks. The "monotonic" clock is
similar to jiffies in that it is a straightforward,
always-increasing count. The "realtime" clock, instead, tries to match
time as known outside of the system; that clock can be corrected by the
kernel or by the system administrator. A ktimer with a 5ms expiration
will, if initialized with init_ktimer_mono(), expire 5ms in the
future (with the usual proviso that delays can happen). That same timer,
if initialized with init_ktimer_real(), will expire when the
realtime clock says that 5ms have passed. But, since the realtime clock
may be adjusted in the meantime, the actual elapsed time could differ.
There are some caller-accessible fields in struct ktimer:
void (*function)(void *);
void *data;
nsec_t expired;
nsec_t interval;
When the timer expires, function() will be called with
data as its argument. The expired field will contain the
time at which the timer actually expired, which might be later than
requested. Interestingly, the high-resolution version of the ktimers patch
does not set this field. Finally, interval is used for
periodic timers.
A timer is set with a call to:
int start_ktimer(struct ktimer *timer, nsec_t *time, int mode);
Here, time is the expiration time in nanoseconds, and
mode describes how
that time is to be interpreted. The possible mode values are:
- KTIMER_ABS: the timer will expire at an absolute time.
- KTIMER_REL: the given time value is a relative time,
which must be added to the current time to get an absolute expiration
time.
- KTIMER_INCR: for timers which have been used before, the
time value is added to the previous expiration time.
- KTIMER_FORWARD: like KTIMER_INCR, except that the
time value will be added repeatedly, if necessary, to obtain
an expiration time in the future.
- KTIMER_REARM: like KTIMER_FORWARD, except that the
interval value stored in the timer is added.
- KTIMER_RESTART: the expiration time of the timer is not
changed at all.
For KTIMER_FORWARD and KTIMER_REARM, the ktimer code also
maintains an integer overrun field in the ktimer structure. If a
timer is started after the next expected expiration time (in other words,
the system fell behind and did not restart the timer soon enough),
overrun will be incremented to allow the calling code to
compensate.
The return value will be zero, unless the timer is already expired, in
which case the timer will not be started and the return value will
be negative. If, however, the mode argument contains the bit
KTIMER_NOCHECK, the timer will be started and executed normally,
regardless of whether it is already expired.
Most of the other ktimer functions are reasonably self-explanatory for
those who have seen the current timer API:
int modify_ktimer(struct ktimer *timer, nsec_t *time, int mode);
int try_to_stop_ktimer(struct ktimer *timer);
int stop_ktimer(struct ktimer *timer);
There is also a convenience function to make a process sleep on a ktimer:
nsec_t schedule_ktimer(struct ktimer *timer, nsec_t *time,
int state, int mode);
The additional argument here (state) should be
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, depending on
whether the sleep should be interrupted by signals or not. The return
value is the number of nanoseconds remaining in the requested sleep time;
it will be zero except when the sleep is ended prematurely.
The standalone ktimers patch
posted by Thomas is the version most likely to be merged. This patch runs
ktimers from the normal clock interrupt, with the result that it provides
no better resolution than the existing timer API. All of the structure is
there to do better, however, once the low-level timer code and architecture
specific support is in place. A separate patch exists which enables
ktimers to provide high-resolution timers on the i386 architecture.
So far, the largest objection to the ktimer implementation is the use of
nanoseconds for time values. Nanosecond timekeeping requires 64-bit
variables, which will slow things down a little on 32-bit systems. The
response from the developers is that the additional overhead is almost zero
and not worth worrying about. So, unless some other surprise turns up,
ktimers could find their way into the kernel not too long after 2.6.14
comes out.
(See also: this posting from
Thomas, which describes the motivation behind ktimers and its relation
to other timing patches in detail).
Comments (5 posted)
Linux systems typically divide main memory into three zones. Most memory
fits into the "normal" zone,
ZONE_NORMAL. At the low end,
however, there are 16MB of memory which are partitioned into the DMA zone
ZONE_DMA; this memory is then reserved for situations where it is
specifically needed. The most common user of DMA memory is older
peripherals which can only address 24 bits of memory. Finally, on the high
end,
ZONE_HIGHMEM contains all memory which cannot be directly
addressed by the kernel.
Not all systems implement all of these zones. Some newer architectures do
not support ancient peripherals and leave out ZONE_DMA. In
general, 64-bit systems have no addressing problems and do not need
ZONE_HIGHMEM. The ia64 architecture settled on a different
implementation of ZONE_DMA, defining it to cover all memory
addressed below 4GB.
As it turns out, there are uses for a 4GB zone. Quite a few devices have
trouble accessing memory which cannot be addressed with 32 bits. Drivers
for such devices have been forced to use ZONE_DMA, the I/O memory
management unit (on systems which have one), or bounce buffers. None of
those solutions is ideal: ZONE_DMA is a small and scarce resource,
IOMMU space can also be scarce, and bounce buffers are slow. All of these
problems could be avoided if DMA memory could be reliably allocated below
the 4GB boundary.
Andi Kleen has decided that the time has come for the x86-64 architecture
to support a 32-bit DMA zone. So his patch adds a new zone
(ZONE_DMA32) and an associated GFP flag (GFP_DMA32) for
allocations. According to Andi, the reason which prevented the addition of
this zone in the first place (the fact that the virtual memory subsystem
had a very hard time balancing memory between zones) has gone away.
Meanwhile, the lack of this zone is causing real problems.
This patch does not actually add the new zone for any architecture except
x86-64. For ia64, it causes GFP_DMA to mean the same thing as
GFP_DMA32, with the idea that GFP_DMA should, once again,
be restricted to the older, 24-bit meaning. The patch also causes the
generic DMA code to use the new zone when it makes sense, making it
available to properly-written drivers with no additional work required.
This patch has come too late for inclusion into 2.6.14, but expect to see
it in a mainline kernel shortly thereafter.
Comments (none posted)
Memory-intensive tasks can be the bane of many a system administrator. One
task which plows through vast numbers of pages can make the system thrash
for everybody. The problem is especially acute when the memory hog is
writing pages. Since each page dirtied by the process must be written to
backing store before it can be reclaimed, a write-intensive task can
quickly take a large portion of the system's memory out of commission.
Often, a simple large file copy can noticeably impact a system's
performance for some time after the copy apparently completes.
The Linux VM subsystem attempts to address this problem with a simple form
of write throttling. When the number of dirty pages gets too large, a
process caught in the act of dirtying a page will be sent off to write out
a few pages before being allowed to proceed. This technique slows the
dirtying of pages while simultaneously helping to reclaim pages which have
already been written to. This write throttling code makes no attempt to
penalize any specific process, however; it will happily throttle any
process which dirties a page at the wrong time.
Andrea Arcangeli has decided to improve the situation with a per-task predictive write
throttling patch, currently found in the -mm tree. The patch is
surprisingly simple - especially after noting that the bulk of it is
involved with setting up the /proc and sysctl control interfaces.
At its core, the patch adds a simple accumulator which keeps an approximate
count of the number of pages dirtied by each process over the last five
seconds. It then assumes that each process will continue to dirty pages at
about the same rate into the future. The "are there too many dirty pages?"
calculation is then changed to take this rate into account. The code,
thus, is making a guess at what the dirty memory situation will be like in
the future, based on what each process is doing. Any process which looks
like it will cause too much memory to be dirtied gets to perform writeback
for a while, while processes which are not writing to lots of pages are not
given that particular chore.
Andrea's preliminary results show that, with this patch in place, small,
interactive tasks run in competition with a large copy task will run more
quickly. Since the copy operation is being made to perform writeback (when
it would have otherwise been dirtying more pages), more memory is available
for the other tasks in the system. The interesting part of the result is
that the copy task runs no slower with this patch in place. A process
which is bound by the system's ability to write pages to disk will not
benefit from being allowed to dirty the bulk of the system's memory, and it
will not suffer by being throttled. So this little patch looks like it
could be a winner for everybody involved.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>