Brief items
The current 2.6 prepatch remains 2.6.11-rc4. The slow trickle of
fixes into Linus's BitKeeper repository continues, with the final 2.6.11
release likely to happen before too long.
The current -mm prepatch is 2.6.11-rc4-mm1. Recent changes to -mm include
device mapper multipath support (see below), the cpushare
"secure computing" patch, a SCSI changer driver, a new set of BIO
support functions, some performance counter updates, and various fixes.
The current 2.4 prepatch is 2.4.30-pre2, released by Marcelo on
February 23. This prepatch adds a new set of fixes (mostly in the networking
subsystem) and a few filesystem and driver updates.
Comments (3 posted)
Kernel development news
Martin Hicks recently posted
a
patch which adds a new degree of user-space control over memory
management policy. In particular, it creates a new
/proc entry:
/proc/sys/vm/toss_page_cache_nodes
If a suitably privileged process writes one or more NUMA node numbers to
that file, all pages belonging to that node which are found in the page
cache will be flushed out. Essentially, this operation causes a node to
forget about all locally-cached pages from files in the filesystem.
Clearing the page cache in this way would normally be bad for performance.
The page cache exists to allow the filesystem to satisfy common filesystem
requests without going to the disk; clearing the cache defeats that
functionality and would normally be undesirable. There are exceptions to
everything, however. This patch is aimed at large-scale high-performance
computing tasks running in a cluster environment. Such jobs typically do
best if they can start with a clean system; they have no real use for
whatever may have been cached for the previous user. More to the point, a
full page cache can cause memory allocations to be satisfied with non-local
(slower) memory, resulting in significantly worse performance. By clearing
the cache before starting a new job, a system administrator can ensure that
local memory is available for that job.
Not everybody likes the patch. Ingo Molnar thinks that this capability will create
confusion and make the debugging of memory problems even harder.
How are we supposed to debug VM problems where one player
periodically flushes the whole pagecache? ... Providing APIs to
flush system caches, sysctl or syscall, is the road to VM madness.
Andrew Morton, instead, sees the value of the patch for some users, but doesn't like the implementation. He would
like to see this capability made useful for other classes of users, such as
kernel developers who want to put the system into a known state before
running tests. He also doesn't like the /proc interface, and
argues for a new system call instead. His suggestion was:
sys_free_node_memory(long node_id, long pages_to_make_free,
long what_to_free);
This form of the call would allow the clearing of something less than the
entire page cache, making the tool a bit less crude. The
what_to_free argument would be a bitmask specifying which types of
memory to free; beyond the page cache, this call could cause the kernel to
reclaim anonymous memory or slab caches.
The system call approach would seem to make sense; there is one remaining
glitch, however: SUSE already shipped the /proc interface in
SLES9. That revelation drew a complaint
from Andrew:
This is why you should target kernel.org kernels first. Now we
risk ending up with poor old suse carrying an obsolete interface
and application developers have to be able to cater for both
interfaces.
An explicit purpose behind the 2.6 development model is to get patches into
the mainline quickly so that their form can be stabilized before
distributors ship them. As the developers become used to this mode of
operation, this sort of issue should become relatively rare.
Comments (3 posted)
Multipath connectivity is a feature of high-end storage systems. A storage box
packed with disks will be connected to multiple transport paths, any one
of which can be used to submit I/O requests. A computer will be connected
to more than one of these transport interconnects, and can choose among
them when it has an I/O request for the storage server. This sort of
arrangement is expensive, but it provides for higher reliability (things
continue to work if an interconnect fails) and better performance.
Support for multipath in Linux has traditionally been spotty, at best.
Some low-level block drivers have included support for their specific
devices, but support at that level leads to duplicated functionality and
difficulties for administrators. Some thought has gone into how multipath
is best supported: does that logic belong at the driver layer, the SCSI
mid-layer, the block layer, or somewhere else? The conclusion that was
reached at last year's Kernel Summit was that the device mapper was the
best place for multipath support.
That support has now been coded up and posted for review; it was added to the
2.6.11-rc4-mm1 kernel. When used with the user-space multipath tools distribution,
the device mapper can now provide proper multipath support - for some
hardware, at least.
Internally, the multipath code creates a data structure, attached to a
device mapper target, which looks like this:
When time comes to transfer blocks to or from a device mapper target
representing a multipath device, the code goes to the first priority group
in the list. Each group represents a set of paths to the device, each of
which is considered equal to the others; the preferred paths (being the
fastest and/or most reliable) should be contained in the first group in the
list. Priority groups include a path selector - a function which
determines which path should be used for each I/O request. The current
patches include a round-robin selector
which simply rotates through the paths to balance the load across them.
Should situations arise which require more complicated policies, it should
not be tremendously difficult to create an appropriate path selector.
If a given path starts to generate errors, it is marked as failed and the
path selector will pass over it. Should all paths in a priority group
fail, the next group in the list (if it exists) will be used. The
multipath tools include a management daemon which is informed of failed
paths; its job is to scream for help and retry the failed paths. If a path
starts to work again, the daemon will inform the device mapper, which will
resume using that path.
There may be times when no paths are available; this can happen, for
example, when a new priority group has been selected and is in the process
of initializing itself. In this situation, the multipath target will
maintain a queue of pending BIO structures. Once a path becomes available,
a special worker thread works through the pending I/O list and sees to it
that all requests are executed.
At the lower level, the multipath code includes a set of hardware hooks for dealing with
hardware-specific events. These hooks include a status function, an
initialization function, and an error handler. The patch set includes a hardware handler for EMC CLARiiON devices.
Comments on the patches have been relatively few, and have dealt mostly
with trivial issues. The multipath patches are unintrusive; they add new
functionality, but do not make significant changes to existing code. So
chances are good that they could find their way into the 2.6.12 kernel.
Comments (6 posted)
The FUTEX code implements lightweight mutual exclusion primitives for user
space. It is intended to be used in situations - such as multi-threaded
programs - where mutual exclusion is needed, but where the implementation must be fast.
Olof Johansson recently
stumbled across a
case where the FUTEX code can
deadlock the system (thus failing the "fast" test) which shows how hard it
can be to get concurrency issues right.
One of the many locking primitives provided by the kernel is the
reader-writer semaphore, or "rwsem". An rwsem can be obtained for either
read or write access. Any number of readers will be allowed to hold the
semaphore concurrently. Any thread which must change the protected data
structures must, however, obtain the semaphore for write access. Only one
writer is allowed at any given time, and no readers may be in the critical
section while the writer is at work.
If a thread tries to obtain an rwsem for write access, and that semaphore
is currently held (by somebody else) for read access, the writer will be
put to sleep. Once
the writer gets in line, however, no more readers will be allowed in. Once
the existing readers have gotten out of the way, the writer will be allowed
to proceed. The queued readers will only wake up after the writer is
done. This implementation makes rwsems fair, in that readers cannot starve
writers indefinitely. It also makes certain types of subtle faults
possible, however.
If a process might have to wait on a FUTEX, the kernel must obtain that
process's memory map semaphore (mmap_sem). This semaphore, which
is an rwsem, controls access to the internal FUTEX data structures; it is
taken for read access. The kernel must also query the value of the FUTEX
itself, which is done through a call to get_user(). Should that
access generate a page fault, the fault handler will obtain
mmap_sem for read access a second time. This double access works
just fine; the second down_read() call simply looks like another
reader, which can run concurrently with the first.
Life gets complicated, however, when other processes share the same address
space. Since the FUTEX mechanism is aimed at threads, this is a situation
which comes about frequently. Consider the following series of events:
| Thread 1 | Thread 2 |
| Call sys_futex() | |
| down_read(¤t->mm->mmap_sem); |
|
| call mmap() |
| down_write(¤t->mm->mmap_sem); |
| (goes to sleep) |
| call get_user() | |
| (everything comes to a halt) |
When the second process calls mmap(), it must obtain
mmap_sem for write access. Since the first process is already a
reader, the down_write() call is queued and the process is put to
sleep. When the first process makes its get_user() call, it tries
to obtain the rwsem for read access for the second time. Since there is
now a writer waiting, however, the first process also is put to sleep.
Since the first process is the one holding the initial read lock, this
situation will never resolve itself; it is a deadlock. This particular
type of deadlock is nasty in that it requires a race condition to become
visible; things usually just work.
Several possible solutions have been proposed. The rwsem "lock depth"
could be explicitly tracked so that a
second attempt to obtain read access simply implements a counter and does
not sleep. Processes holding mmap_sem could be marked with a
special PF_MMAP_SEM flag; the page fault code would see that flag,
realize that the semaphore is already held, and not take it again. Olof's
initial report included a patch which tries to explicitly fault in the page
before taking the semaphore so that the get_user() call would not
generate a fault.
The solution which will eventually be adopted will likely take a different
approach, however. Conventional wisdom has long said that functions like
get_user() cannot be called in atomic context (in an interrupt
handler or when a spinlock is held), since they might sleep. In fact, if
the user-space access functions generate a page fault in atomic context,
the fault handler simply refuses to bring in the page and the function
returns an error code. So the solution, first suggested by Linus, is to put the process into
an atomic mode (by calling inc_preempt_count()) just before the
get_user() call. If get_user() fails, the page must be
faulted in. So the mmap_sem is dropped, the page is explicitly
faulted, and the whole process starts over again.
As often happens, the full solution turned out to be a bit more complicated
than initially thought. So Olof put together a
patch implementing a new user-space access function:
int get_user_inatomic(value, user_pointer);
This function is atomic; it may succeed or fail, but it will always return
without sleeping. Like get_user(), it is implemented as a macro
which tries to do the right thing regardless of the data type of the value
to be fetched. That implementation drew a
complaint from one developer, who would rather see new interfaces done
in a more strongly-typed manner. So the details of the patch that eventually
gets merged (presumably after 2.6.11) may change, but it will likely follow
this approach.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>