Brief items
The current development kernel is 2.5.63,
released by Linus on February 24. It
includes an ISAPnP update, some IDE changes (see
last week's Kernel Page), an ACPI update,
various architecture updates, a new x86 "double fault" handler, a bluetooth
update, and the inevitable set of spelling fixes. The
the long-format changelog has the details.
Linus's BitKeeper tree includes some more loadable module fixes, more
spelling fixes ("A 'wether' is a castrated goat"), a uClinux
update, an XFS update, a software suspend update, and various other fixes
and performance improvements.
The current stable kernel is 2.4.20. Marcelo has promised a new
2.4.21 prepatch soon, but it had not appeared as of this writing.
The current 2.4 prepatch from Alan Cox is 2.4.21-pre4-ac6; it adds mostly driver
updates.
Comments (3 posted)
Kernel development news
The reverse-mapping VM (RMAP) was merged into 2.5 to solve a specific problem:
there was no easy way for the kernel to find out which page tables referred
to a given physical page. Certain activities - swapping being at the top
of the list - require making changes to all relevant page tables. You
simply can not swap a page to disk until all of the page table entries
pointing to it have been invalidated. The 2.4 kernel handles swapping by
scanning through the page tables, one process at a time, and invalidating
entries for
pages that look like suitable victims. If it happens to find all of the
page table entries in time, the page can then be evicted to disk.
In 2.5, a new data structure was added to make this process easier.
Initially each page in the system (as represented by its
struct page structure in the system memory map) had a linked
list of reverse mapping entries pointing to every page table entry
referencing that page. That worked, but it introduced some problems of its
own. The reverse mapping entries took up a lot of memory, and quite a bit
of time to maintain. Operations which required working with a lot of pages
slowed down. And the fork() system call, which must add a new
reverse mapping entry for every page in the process's address space, slowed
significantly. As a result, there has been an ongoing effort to mitigate
RMAP's costs.
Now a new technique, as embodied in this
patch by Dave McCracken, has been proposed. This approach, called
"object-based reverse mapping," is based on the realization that, in some
cases at least, there are other paths from a struct page to a
page table entry. If those paths can be used, the full RMAP overhead is
unnecessary and can be cut out.
By one reckoning, there are two basic types of user-mode page in a Linux
system. Anonymous pages are just plain memory, the kind a process
would get from malloc(). Most other pages are file-backed
in some way; this means that, behind the scenes, the contents of that page
are associated with a file somewhere in the system. File-backed pages
include program code and files mapped in with
mmap(). For these pages, it is possible to find their page table
entries without using RMAP entries. To see how, let us refer to the
following low-quality graphic, the result of your editor's nonexistent
drawing skills:
The struct page structure for a given page is in the upper left
corner. One of the fields of that structure is called mapping; it
points to an address_space structure describing the object which
backs up that page. That structure includes the inode for the file,
various data structures for managing the pages belonging to the file, and
two linked lists (i_mmap and i_mmap_shared) containing
the vm_area_struct structures for each process which has a mapping
into the file. The vm_area_struct (usually called a "VMA")
describes how the mapping appears in a particular process's address space;
the file /proc/pid/maps lists out the VMAs for the process
with ID pid. The VMA provides the information needed to
find out what a given page's virtual address is in that process's address
space, and that, in turn, can be used to find the correct page table
entry.
So all the object-based RMAP patch does is remove the direct reverse
mapping entry (pointing from the page structure directly to the
page table entry). When it is necessary to find that entry, the virtual
memory subsystem simply takes the longer way around, via the
address_space and vm_area_struct structures. Finding a
page table entry this way certainly will take longer than following a
direct pointer, but it should come out cheaper when one considers all of
the RMAP information that no longer needs to be maintained.
The object-based RMAP patch does not change the handling of anonymous
pages, which do not have an associated address_space structure.
Martin Bligh has posted some initial
benchmarks showing some moderate improvement in the all-important
kernel compilation test. The object-based approach does seem to help with
some of the worst RMAP performance regressions. Andrew Morton pointed out a worst-case performance scenario for
this approach, but it is not clear how big a problem it would really be.
Andrew has included this patch in his 2.5.62-mm3 tree.
Assuming that this patch goes in (it's late in the development process, but
that hasn't stopped Linus from taking rather more disruptive VM patches
before...), one might wonder if a complete object-based implementation
might follow. The answer is "probably not." Anonymous pages tend to be
private to individual processes, so there is no long chain of reverse
mappings to manage in any case. So even if such pages came to look like
file-backed pages (as could happen, say, with a rework of the swapping
code), there isn't necessarily much to be gained from the object-based
approach.
Comments (3 posted)
The object-based RMAP patch is one approach to reducing the overhead of the
virtual memory subsystem. William Lee Irwin has
posted another: page clustering. Much of the
VM subsystem's overhead is per-page; each page requires a memory map entry,
possibly RMAP chains, etc. One way of reducing that overhead, clearly,
would be to have fewer pages. Since most users will react poorly to
suggestions that they remove memory from their systems, the only
feasible way of reducing the page count is to make the pages themselves bigger.
The page clustering patch (based on work originally done by Hugh Dickins)
works by taking physical pages (as seen by the hardware) and grouping them
into larger, virtual pages as seen by the kernel. x86 hardware works
(normally) with 4K pages; with page clustering the kernel can work with
pages as large as 32K (according to the comments in page.h or 64K
(according to what the code is actually doing). Thus, the page count (and
associated overhead) can be reduced by a factor of up to 16.
This idea is not particularly new; early versions of BSD clustered the
512-byte pages provided by VAX systems into 1024-byte internal pages.
Still, it's a bit tricky to implement inside the Linux kernel. Much kernel
code thinks it understands the concept of the "page size," but, with this
patch, there are two different page sizes. Code dealing with the hardware
memory management unit (MMU) must work on the MMU's terms, while code
working with kernel pages should see the larger size. The result is a
great deal of work trying to figure out whether each bit of code should be
working with PAGE_SIZE units, or the new MMUPAGE_SIZE.
It is not a job for the faint of heart.
This patch is, for now, not for casual users; by William's admission a
number of things are still broken. But, fear not: "I've yet to
encounter non-fsck-recoverable filesystem corruption with remotely current
sources." Even when the problems are fixed, this patch looks fairly
involved for 2.5 at this point. But, one never knows.
Comments (1 posted)
One result of all the work that was done with improved threading support in
the 2.5 kernel is that threads stopped showing up in the
/proc
filesystem. Most people don't miss them, but there are reasons for wanting
to be able to deal with individual threads through
/proc. The
main problems have been useability and performance. If you are running a
system with thousands of threads,
/proc becomes rather large and
difficult to work with. It's also slow. Ingo Molnar found that, with
16,000 threads in
/proc, the
top utility took 22 seconds
to work through them all.
The result of Ingo's work, of course, is a
patch improving the situation. The first thing Ingo did was to create
a "lookup cursor" that gets stashed into the file structure for a
process that is digging through /proc. That cursor caches the
current state of the directory read operation, greatly speeds the process
of reading through a large /proc directory. Ingo also added some
new process information so that the thread group leader can be queried for
cumulative information on the whole group.
Nobody complained much about those changes; there was one other, though,
that was a bit more controversial. With Ingo's patch, threads show up in
/proc with a period in front of the process ID. Thus, a normal
process might be represented as /proc/1234, while a thread would,
instead, be /proc/.1234. That change makes it easy for
applications to distinguish threads from "full" processes; it also has the
effect of hiding threads from a casual /proc directory listing.
Unsurprisingly, a number of developers (including Linus) see the period as being a bit
of a hack. Wouldn't it be better to put threads in a subdirectory under
the thread group leader's ID? Linus even posted a quick patch showing how he thought it could
be done. A new patch from Ingo has not yet appeared, but it seems likely
that the next revision will put threads into subdirectories. At that
point, threads will probably return to /proc in the 2.5 kernel.
And /proc will remain fast even with large numbers of threads;
Ingo's 16,000-thread top case went from 22 seconds to 0.16
seconds.
Comments (none posted)
Driver porting
Below you'll find two new articles in the LWN driver porting series; they
deal with timekeeping and safe sleeping. Since last week we have also
added an article on working with the preemptible kernel and an updated
description of the 2.5 workqueue interface. Those articles, and all the
others, can be found on the
driver
porting page.
Comments (none posted)
One might be tempted to think that the basic task of keeping track of the
time would not change that much from one kernel to the next. And, in fact,
most kernel code which worries about times (and time intervals) will likely
work unchanged in the 2.6 kernel. Code which gets into the details of how
the kernel manages time may well need to adapt to some changes, however.
Internal clock frequency
One change which
shouldn't be problematic for most code is the
change in the internal clock rate on the x86 architecture. In previous
kernels,
HZ was 100; in 2.6 it has been bumped up to 1000. If
your code makes any assumptions about what HZ really was (or, by extension,
what
jiffies really signified), you may have to make some changes
now. For what it's worth, as of 2.6.0-test9, the default values of HZ in the
mainline kernel source (which sometimes lags the architecture-specific
trees) is as follows:
Alpha: 1024/1200;
ARM: 100/128/200/1000;
CRIS: 100;
i386: 1000;
IA-64: 1024;
M68K: 100;
M68K-nommu: 50-1000;
MIPS: 100/128/1000;
MIPS64: 100;
PA-RISC: 100/1000;
PowerPC32: 100;
PowerPC64: 1000;
S/390: 100;
SPARC32: 100;
SPARC64: 100;
SuperH: 100/1000;
UML: 100;
v850: 24-100;
x86-64: 1000.
Kernel time variables
When the internal clock rate on a 32-bit system is set to 1000, the classic
32-bit
jiffies variable will overflow in just over 49 days.
Overflows could always happen on systems with a long uptime, but, when it
took well over a year of uptime, it was a relatively rare occurrence - even
on Linux systems. It is not uncommon at all, however, for a system to be
up for more than 50 days. In most cases, having
jiffies wrap
around is not a real problem; it can be inconvenient for tasks like process
accounting, however. So the 2.5 kernel has a new counter called
jiffies_64. With 64 bits to work with,
jiffies_64 will
not wrap around in a time frame that need concern most of us - at least
until some future kernel starts using a gigahertz internal clock.
For what it's worth, on most architectures, the classic, 32-bit
jiffies variable is now just the least significant half of
jiffies_64.
Note that, on 32-bit systems, a 64-bit jiffies value raises
concurrency issues. It is deliberately not declared as a volatile
value (for performance reasons), so the possibility exists that code like:
u64 my_time = jiffies_64;
could get an inconsistent version of the variable, where the top and bottom
halves do not match. To avoid this possibility, code accessing
jiffies_64 should use xtime_lock, which is the new
seqlock type as of 2.5.60. In most cases,
though, it will be easier to
just use the convenience function provided by the kernel:
#include <linux/jiffies.h>
u64 my_time = get_jiffies_64();
Users of the internal xtime variable will notice a couple of
similar changes. One is that xtime, too, is now protected by
xtime_lock (as it is in 2.4 as of 2.4.10), so any code which plays
around with disabling interrupts or such before accessing xtime
will need to change. The best solution is probably to use:
struct timespec current_kernel_time(void);
which takes care of locking for you.
xtime also now is a
struct timespec rather than struct timeval; the
difference being that the sub-second part is called tv_nsec, and
is in nanoseconds.
Timers
The kernel timer interface is essentially unchanged since 2.4, with one
exception. The new function:
void add_timer_on(struct timer_list *timer, int cpu);
will cause the timer function to run on the given CPU with the expiration
time hits.
Delays
The 2.5 kernel includes a new macro
ndelay(), which delays for a
given number of nanoseconds. It can be useful for interactions with
hardware which insists on very short delays between operations. On most
architectures, however,
ndelay(n) is equal to
udelay(1)
for waits of less than one microsecond.
POSIX clocks
The POSIX clocks patch (merged into 2.5.63) is beyond the scope of this
article. If you are working with a device which can provide an interesting
time service (high resolution or high accuracy), you may want to consider
using it to drive a POSIX clock. Look into
kernel/posix-timers.c
for more information.
Comments (2 posted)
Contrary to expectations, the classic functions
sleep_on() and
interruptible_sleep_on() were not removed in the 2.5 series. It
seems that they are still needed in a few places where (1) taking them
out is quite a bit of work, and (2) they are actually used in a way
that is safe. Most authors of kernel code should, however, pretend that
those functions no longer exist. There are very few situations in which
they can be used safely, and better alternatives exist.
wait_event() and friends
Most of those alternatives have been around since 2.3 or earlier. In many
situations, one can use the
wait_event() macros:
DECLARE_WAIT_QUEUE_HEAD(queue);
wait_event(queue, condition);
int wait_event_interruptible (queue, condition);
These macros work the same as in 2.4: condition is a boolean
condition which will be tested within the macro; the wait will end when the
condition evaluates true.
It is worth noting that these macros have moved from
<linux/sched.h> to <linux/wait.h>, which
seems a more sensible place for them. There is also a new one:
int wait_event_interruptible_timeout(queue, condition, timeout);
which will terminate the wait if the timeout expires.
prepare_to_wait() and friends
In many situations,
wait_event() does not provide enough
flexibility - often because tricky locking is involved.
The alternative in those cases has been to do a full "manual"
sleep, which involves the following steps (shown here in a sort of
pseudocode, of course):
DECLARE_WAIT_QUEUE_HEAD(queue);
DECLARE_WAITQUEUE(wait, current);
for (;;) {
add_wait_queue(&queue, &wait);
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
schedule();
remove_wait_queue(&queue, &wait);
if (signal_pending(current))
return -ERESTARTSYS;
}
set_current_state(TASK_RUNNING);
A sleep coded in this manner is safe against missed wakeups. It is also a
fair amount of error-prone boilerplate code for a very common situation.
In 2.6, a set of helper functions has been added which makes this task
easier. The modern equivalent of the above code would look like:
DECLARE_WAIT_QUEUE_HEAD(queue);
DEFINE_WAIT(wait);
while (! condition) {
prepare_to_wait(&queue, &wait, TASK_INTERRUPTIBLE);
if (! condition)
schedule();
finish_wait(&queue, &wait)
}
prepare_to_wait_exclusive() should be used when an exclusive wait
is needed. Note that the new macro DEFINE_WAIT() is used here,
rather than DECLARE_WAITQUEUE(). The former should be used when
the wait queue entry is to be used with prepare_to_wait(), and
should probably not be used in other situations unless you
understand what it is doing (which we'll get into next).
Wait queue changes
In addition to being more concise and less error prone,
prepare_to_wait() can yield higher performance in situations where
wakeups happen frequently. This improvement is obtained by causing the
process to be removed from the wait queue immediately upon wakeup; that
removal keeps the process from seeing multiple wakeups if it doesn't
otherwise get around to removing itself for a bit.
The automatic wait queue removal is implemented via a change in the wait
queue mechanism. Each wait queue entry now includes its own "wake
function," whose job it is to handle wakeups. The default wake function
(which has the surprising name default_wake_function()), behaves
in the customary way: it sets the waiting task into the
TASK_RUNNING state and handles scheduling issues. The
DEFINE_WAIT() macro creates a wait queue entry with a different
wake function, autoremove_wake_function(), which automatically
takes the newly-awakened task out of the queue.
And that, of course, is how DEFINE_WAIT() differs from
DECLARE_WAITQUEUE() - they set different wake functions. How the
semantics of the two differ is not immediately evident from their names,
but that's how it goes. (The new runtime initialization function
init_wait() differs from the older init_waitqueue_entry()
in exactly the same way).
If need be, you can define your own wake function - though the need for
that should be quite rare (about the only user, currently, is the support
code for the epoll() system calls). The wake function has this
prototype:
typedef int (*wait_queue_func_t)(wait_queue_t *wait,
unsigned mode, int sync);
A wait queue entry can be given a different wakeup function with:
void init_waitqueue_func_entry(wait_queue_t *queue,
wait_queue_func_t func);
One other change that most programmers won't notice: a bunch of wait queue
cruft from 2.4 (two different kinds of wait queue lock, wait queue
debugging) has been removed from 2.6.
Comments (7 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>