Brief items
The current stable 2.6 kernel is 2.6.16.1,
released on March 27.
2.6.15.7 was released at the
same time. Both patches contain a fair number of important fixes, some of
which are security-related.
There has been no 2.6 development prepatch released over the last week.
Patches are flowing into the mainline git repository at a high rate,
however; see below for a list.
The current -mm tree is 2.6.16-mm2. Recent changes to
-mm include the ability to call poll() on sysfs files (LWN coverage), support for
64-bit I/O and memory resources, priority-inheriting futex support, and a new
set of central time management patches.
Comments (none posted)
Kernel development news
The flood of patches heading into the mainline continues at full rate -
though the merge window should be closing soon. The following is the
highlights from code merged since
last week's summary, starting
with the user-visible changes:
- The lightweight robust
futexes patch.
- The software RAID (MD) layer can now handle on-the-fly resizing of
RAID5 arrays.
- Support for devfs has been removed from the SCSI subsystem, though it
remains in many other parts of the kernel.
- The user-space software
suspend patch.
- A big XFS update
- An 802.11 software MAC implementation for wireless networking stacks.
Version 20 of the wireless extensions API was also merged.
- The reverse-engineered Broadcom
43xx driver has been merged. As a result, the list of wireless
network cards supported by Linux has just grown considerably.
- A "memory spreading" mechanism which can be used to spread page cache
and filesystem buffer allocations across all nodes of a NUMA system.
- Two new fadvise()
operations for controlling asynchronous file writeout behavior.
- Support for reordering functions in the linked kernel image. The idea
here is to put the highly-used bits of kernel code together so that
the highly-trafficked part of the kernel fits within a single TLB
entry. Currently, only x86-64 has the infrastructure for reordering.
- Multiple-block allocation and mapping has been added to the ext3
filesystem, improving performance for sequential file access patterns.
- A new scheduling domain has been added to represent multi-core
systems.
- A new RTC subsystem has been added, providing support for a variety of
real-time hardware clocks.
Internal kernel API changes merged include:
- A new utility function has been added:
int execute_in_process_context(void (*fn)(void *data),
void *data,
struct execute_work *work);
This function will arrange for fn() to be called in process
context (where it can sleep). Depending on when
execute_in_process_context() is called, fn() could
be invoked immediately or delayed by way of a work queue.
- The SMP alternatives
patch.
- A rework of the relayfs API - but the sysfs interface has been left
out for now.
- A tracing mechanism for developers debugging block subsystem code.
- There is a new internal flag (FMODE_EXEC) used to indicate
that a file has been opened for execution.
- The obsolete MODULE_PARM() macro is gone forevermore.
- A new function, flush_anon_page(), can be used in conjunction
with get_user_pages() to safely perform DMA to anonymous
pages in user space.
- Zero-filled memory can now be allocated from slab caches with
kmem_cache_zalloc(). There is also a new slab debugging
option to produce a /proc/slab_allocators file with detailed
allocation information.
- There are four new ways of creating mempools:
mempool_t *mempool_create_page_pool(int min_nr, int order);
mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size);
mempool_t *mempool_create_kzalloc_pool(int min_nr, size_t size);
mempool_t *mempool_create_slab_pool(int min_nr,
struct kmem_cache *cache);
The first creates a pool which allocates whole pages (the number of
which is determined by order), while the second and third create a
pool backed by kmalloc() and kzalloc(),
respectively. The fourth is a shorthand form of creating slab-backed
pools.
- The prototype for hrtimer_forward() has changed:
unsigned long hrtimer_forward(struct hrtimer *timer,
ktime_t now, ktime_t interval);
The new now argument is expected to be the current time.
This change allows some calls to be optimized. The data
field has also been removed from the hrtimer structure.
- A whole set of generic bit operations (find first set, count set bits,
etc.) has been added, helping to unify this code across architectures
and subsystems.
- The inode f_ops pointer - which refers to the
file_operations structure for the open file - has been marked
const. Quite a bit of code, which used to change that
structure, has been changed to compensate. Similar changes have been
made in many filesystems. "The goal is both to increase
correctness (harder to accidentally write to shared datastructures)
and reducing the false sharing of cachelines with things that get
dirty in .data (while .rodata is nicely read only and thus cache
clean)."
If the usual pattern holds, the merging of new features will stop sometime
around the end of the month, with 2.6.17-rc1 being released shortly
thereafter.
Comments (6 posted)
"
Holy cow."
That was Andrew Morton's reaction to a
34-part patch, posted by Peter Zijlstra, which creates an abstract API for
page replacement policies. The page replacement code is at the core of the
virtual memory system; it is, essentially, a set of heuristics which must
decide which pages should be evicted from main memory and made available
for other uses. Page replacement is a bit of a black art; it is easy to
see when a system is managing memory poorly, but path to improvements
is often far from clear. Memory management in Linux was a sore point
for many years, but it seems to work well for most loads now. Given that
all this tricky code has finally been beaten into reasonably good shape,
why would anybody want to mess with it now?
The answer is that there is quite a bit of research work going into
alternative page replacement mechanisms, and Linux might just be able to
benefit from some of that work. After all, few people would say that Linux
virtual memory works so well that there is no room for improvement.
This massive patch set creates an API for page replacement
algorithms, allowing them to be changed at will. Or, at least, changed at
reboot; there is currently no provision for loading replacement algorithms
as modules or swapping them out on the fly. But, by selecting a page
replacement scheme at kernel configuration time, system administrators can
choose one which best suits their workload. Virtual memory hackers and
others can play with different algorithms to see how they work out. And
there is no need to pick one in particular as the page replacement
algorithm for the Linux kernel.
To work with this API, a page replacement algorithm must define a set of
specific functions. Thus, for example, there is a pair of initialization
functions:
void page_replace_init(void);
void page_replace_init_zone(struct zone *zone);
These functions, called at boot time, prepare the page replacement code to
work with the system it finds itself running on.
When the core kernel knows something about the use of specific pages, it
can tell the replacement algorithm with these calls:
void page_replace_hint_active(struct page *page);
void page_replace_hint_use_once(struct page *page);
The first is called when the kernel notes that the page is in active use,
while the second indicates that the page is unlikely to be used again in
the near future.
There are various other functions for helping with the housekeeping, but
the core of the API is this function here:
void page_replace_candidates(struct zone *zone, int count,
struct list_head *list);
This function must select up to count pages from the given zone
as candidates for eviction. This is where the page replacement code will
gaze into its crystal ball to figure out which pages will not be used again
anytime soon; those are the ones which will be singled out and passed back
to the core kernel.
Quite a few other functions exist. They deal with issues like page
migration, tracking of non-resident pages, printing out information from
the page replacement code, and more. See the
documentation file for a full list and brief explanation of those other
functions.
The patch set also contains four different page replacement mechanisms.
One is the modified least-recently-used (LRU) code found in current
kernels, reworked to use the new API. Another is the CLOCK-PRO
algorithm, covered here last
August. There is an implementation of the CART technique, discussed in this paper
[PDF]. Then there is a simple random replacement scheme, seemingly
just for the fun of it. Actually, the random
replacement patch is, due to its simplicity, a good place to start for
somebody interested in seeing what a modularized page replacement algorithm
looks like.
This patch looks somewhat similar to the pluggable CPU schedulers patch,
which allows the scheduling algorithm to be changed. That patch continues
to be maintained, but, since its initial posting in 2004, it has never been
seriously considered for inclusion into the mainline kernel. There is a
strong preference toward figuring out what's wrong - if anything - with the
current code and fixing it, rather than creating a mechanism for playing
with entirely different implementations. Thus, Andrew Morton followed his
initial response with:
Rather than replacing the whole lot four times I'd really prefer to
see precise descriptions of these problems, see if we can improve
the situation incrementally rather than wholesale slash-n-burn...
Linus has a similar opinion, and,
additionally, is not convinced that page replacement is really an issue
needing a great deal of attention. "It smells like university
research to me."
The proponents of this patch respond that there are, indeed, situations
where the current code falls apart. Given that, the next logical step
would seem to be gathering information on the cases where Linux memory
management fails. Then the developers can start to think about what needs
to be done to address those failures. Even if the page replacement
framework patches are never merged, it looks like they may help to drive
forward the next phase of work in Linux memory management algorithms.
That should be a good thing regardless.
Comments (none posted)
Applications like network servers that need to monitor multiple file
descriptors using
select(),
poll(),
or (on Linux)
epoll_wait()
sometimes face a problem:
how to wait until either one of the file descriptors becomes ready,
or a signal (say,
SIGINT)
is delivered. These system calls, as it turns out, do not interact
entirely well with signals.
A seemingly obvious solution would be to write an empty handler for the signal,
so that the signal delivery interrupts the
select() call:
static void handler(int sig) { /* do nothing */ }
int main(int argc, char *argv[])
{
fd_set readfds;
struct sigaction sa;
int nfds, ready;
sa.sa_handler = handler; /* Establish signal handler */
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGINT, &sa, NULL);
/* ... */
ready = select(nfds, &readfds, NULL, NULL, NULL);
/* ... */
After select() returns we can determine what happened by looking
at the function result and errno. If errno comes back as
EINTR, we know that the select() call was interrupted by
a signal, and can act accordingly. But this solution suffers from a race
condition: if the SIGINT signal is delivered after the call to
sigaction(), but before the call to
select(), it will fail to interrupt that select() call
and will thus be lost.
We can try playing various games like setting a global flag
within the signal handler and monitoring that flag in the main program,
and using
sigprocmask()
to block the signal until just before the
select()
call.
However, none of these techniques can entirely eliminate the race condition:
there is always some interval, no matter how brief,
where the signal could be handled before the
select()
call is started.
The traditional solution to this problem is the so-called
self-pipe trick, often credited to
D J Bernstein.
Using this technique, a program establishes a signal handler
that writes a byte to a specially created pipe whose read end is
also monitored by the
select().
The self-pipe trick
cleverly solves the problem of safely waiting either for a
file descriptor to become ready or a signal to be delivered.
However, it requires a relatively large amount of code to implement
a requirement that is essentially simple.
(For example, a robust solution requires marking both
the read and write ends of the pipe non-blocking.)
For this reason, the POSIX.1g committee devised an enhanced version of
select(),
called
pselect().
The major difference between
select()
and
pselect()
is that the latter call has a signal mask
(sigset_t)
as an additional argument:
int pselect(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
const struct timespec *timeout, const sigset_t *sigmask);
The
sigmask
argument specifies a set of signals that should be blocked during the
pselect()
call; it overrides the current signal mask for the duration of that call.
So, when we make the following call:
ready = pselect(nfds, &readfds, &writefds, &exceptfds,
timeout, &sigmask);
the kernel performs a sequence of steps that
is equivalent to atomically performing the following system calls:
sigset_t sigsaved;
sigprocmask(SIG_SETMASK, &sigmask, &sigsaved);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
sigprocmask(SIG_SETMASK, &sigsaved, NULL);
For some time now, glibc has provided a library implementation of
pselect() that actually uses the above sequence of system calls.
The problem is that this implementation remains vulnerable to the very race
condition that pselect() was designed to avoid, because the
separate system calls are not executed as an atomic unit.
Using
pselect(),
we can safely wait for either a signal to be delivered
or a file descriptor to become ready,
by replacing the first part of our example program with the following code:
sigset_t emptyset, blockset;
sigemptyset(&blockset); /* Block SIGINT */
sigaddset(&blockset, SIGINT);
sigprocmask(SIG_BLOCK, &blockset, NULL);
sa.sa_handler = handler; /* Establish signal handler */
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);
sigaction(SIGINT, &sa, NULL);
/* Initialize nfds and readfds, and perhaps do other work here */
/* Unblock signal, then wait for signal or ready file descriptor */
sigemptyset(&emptyset);
ready = pselect(nfds, &readfds, NULL, NULL, NULL, &emptyset);
...
This code works because the
SIGINT
signal is only unblocked once control has passed to the kernel.
As a result, there is no point where the signal can be delivered before
pselect()
executes.
If the signal is generated while
pselect()
is blocked, then, as with
select(),
the system call is interrupted, and the signal is delivered
before the system call returns.
Although
pselect()
was conceived several years ago, and was already publicized in 1998 by
W. Richard Stevens
in his
Unix Network Programming, vol. 1, 2nd ed.,
actual implementations have been slow to appear.
Their eventual appearance in recent releases of various Unix
implementations has been driven in part by the fact that
the 2001 revision of the POSIX.1 standard requires a conforming
system to support
pselect().
With the 2.6.16 kernel release,
and the required wrapper function that appears in
the recently released glibc 2.4,
pselect()
also becomes available on Linux.
Linux 2.6.16 also includes a new (but nonstandard)
ppoll()
system call, which adds a signal mask argument to the traditional
poll()
interface:
int ppoll(struct pollfd *fds, nfds_t nfds, const struct timespec *timeout,
const sigset_t *sigmask);
This system call adds the same functionality to
poll()
that
pselect()
adds to
select().
Not to be left in the cold, the
epoll maintainer has patches in the pipeline to add
similar functionality in the form of a new
epoll_pwait()
system call.
There are a few other, minor differences between
pselect() and
ppoll()
and their traditional counterparts.
For example the type of the
timeout
is:
struct timespec {
long tv_sec; /* Seconds */
long tv_nsec; /* Nanoseconds */
};
This allows the timeout interval to be specified with greater precision than
is available with the older system calls.
The glibc wrappers for
pselect()
and
ppoll()
also hide a couple of details of the underlying system calls.
First, the system calls actually expect the signal mask
argument to be described by two arguments, one of which is a
pointer to a
sigset_t
structure, while the other is an integer that indicates
the size of that structure in bytes.
This allows for the possibility of a larger
sigset_t
type in the future.
The underlying system calls also modify their
timeout
argument so that on an early return
(because a file descriptor became ready,
or a signal was delivered), the caller knows how
much of the timeout remained.
However, the respective wrapper functions
hide this detail by making a local copy of the
timeout
argument and passing that copy to the underlying system calls.
(The Linux
select()
system call also modifies its
timeout
argument, and this behavior is visible to applications.
However, many other
select()
implementations don't modify this argument.
POSIX.1 permits either behavior in a
select()
implementation.)
Further details of
pselect()
and
ppoll()
can be found in the latest versions of the
select(2)
and
poll(2)
man pages, which can be found
here.
Comments (19 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>