Brief items
The current 2.6 prepatch is 2.6.21-rc6,
released by Linus on
April 5. It contains a fair number of fixes. Says Linus: "
We
should be getting close to a 2.6.21 release, so please update any
regression reports you've done."
A few dozen patches have been merged into the mainline git repository since
-rc6 was released. Your editor guesses that one more -rc will be needed
before 2.6.21 is done.
The current -mm tree is 2.6.21-rc6-mm1. Recent changes
to -mm include a number of tweaks for Sony laptops, an enlarged set of
paravirt_ops hooks, a new set of /proc files for learning
about process memory, a rework of the NFS file locking code, and the signalfd() patches.
Andrew notes that -mm is now a "rather large" 25MB patch against the
mainline.
The current stable 2.6 kernel is 2.6.20.6, released on April 6; 2.6.20.5 had been released
moments earlier. The two patches contain a fair number of fixes, including
one for a remotely exploitable crash in the Appletalk code.
For older kernels: 2.6.16.47-rc1 was released on
April 11 with about a dozen fixes.
Comments (none posted)
Kernel development news
But being a
subsystem maintainer requires that you trust contributors to some
degree, and you just can't trust contributors when you're a
perfectionist. This means that the maintainer should be less of a
perfectionist than the contributors, otherwise he/she ends up doing
everything by him/herself.
--
Jean Delvare
Comments (3 posted)
The story of sysfs (and the device model in general) is a long and
complicated one. The creation of a single data structure to represent the
system's hardware and software configuration was long overdue; many tasks
(power management, for
example) cannot be done properly without it. Sysfs adds value to that
structure by representing it to user space. This structure is useful in
many ways, but it has also brought its share of hassles. Exposing kernel
data structures to user space makes it hard to change those structures
without breaking the user-space API; it also exposes every one of them to
user-space initiated lifecycle problems.
Internally, the core building block for the device model is the kobject.
Objects represented in sysfs - devices, for example - each contain a
kobject which, among other things, is the focal point for sysfs access.
The kobject also contains a reference count for the containing object which
is used to manage its lifecycle. A given kobject and its containing data
structure can be deleted when the reference count goes to zero - and not
before. Reference counting works, but it can lead to surprises.
As an example, consider a USB device - a mouse, say. When this device is
plugged into the system, a suitable device structure (containing a kobject)
is created and registered with the kernel. When the mouse is unplugged,
that structure is released. But imagine what happens if a user-space
process opens a sysfs file associated with the mouse device while it is
present, and keeps that file open long after the physical device goes
away. The kernel must be able to handle operations on that open sysfs
file, even though the driver thinks that the device it represents is long
gone. The reference counting in the kobject makes this work - most of the
time. The potential for confusion is high, though, especially with drivers
which have not been written with this sort of lifecycle management in
mind.
Back at the end of March, Tejun Heo posted a
discussion of device model lifecycle issues which points out this
problem and a few others. His argument is that the need to manage objects
with different lifecycles makes programming with the device model hard -
something developers have known for some time. Even the core device model
maintainers will admit that it's easy to get things wrong.
More recently, Tejun has followed up with a patch set which attempts to
simplify the situation. There is a great deal of cleanup work in these
patches, and one small API change, but the core change is this: it enables
a clean separation of the lifecycles of sysfs objects and the underlying
data structures they represent. As a result, it is no longer necessary for
code outside of sysfs to be concerned about the fact its data structures
may have a shorter life than the sysfs objects representing those
structures.
A sysfs directory (which represents a kobject) is represented within the
kernel by struct
sysfs_dirent. In current kernels, if the sysfs_dirent
structure exists, its underlying kobject is expected to exist as well. It
is not possible for the kobject to go away as long as the
sysfs_dirent structure exists; that means that the structure
containing the kobject must continue to exist as long as any references to
the sysfs files exist. Tejun's patch works by eliminating that requirement.
In the modified sysfs, each sysfs_dirent contains a new counter
called s_active. This counter tracks the number of active
references to the object; these references are the ones which involve the
associated kobject at the current moment. A user-space process which is
holding a sysfs file open will not increase the s_active count
until it performs an actual operation on that file, and the reference
remains only for as long as it takes to complete the operation. Since most
sysfs operations are quite fast, active references will not normally be
held for long.
The active count, as it happens, is maintained with an rwsem - a reader/writer
semaphore. Active references are tracked as readers, so there can be any
number of them outstanding at a given time. The code to obtain an active
reference works with a call to down_read_trylock(), meaning that
it will take a "lock" (a reference) if one is available, but it will not
block if the operation fails. All of the relevant
sysfs operations have been changed to obtain active references before
referencing the kobject - and they make sure that the reference was
granted. If an attempt to obtain an active reference fails, sysfs fails
the higher-level operation with -ENODEV.
The only way
down_read_trylock() will fail is if another thread holds a writer
lock on the semaphore - or is in the process of waiting for the readers to
get out of the way so it can get that lock.
Should something happen which causes the underlying kobject to go away, the
cleanup code will call down_write() on the s_active rwsem
in the sysfs_dirent entry, thus taking a writer lock. This call
will cause any future
attempts to obtain an active reference to fail; it will also block until
all currently-existing active references are released.
The end result of all this is that, once the final kobject_put()
call has completed for a given kobject, there will be no further attempts
to access that kobject from sysfs. The kobject (and its containing data
structure) can be safely deleted, and the driver need worry no more about
it.
As an added bonus, there is no longer any need to increase module reference
counts when sysfs attributes are being accessed. A driver which is being
unloaded will release all of its devices, meaning that sysfs will no longer
make any calls into the driver module anyway; the module reference count
becomes superfluous. So Tejun's patch removes the owner field
from attribute structures - a change which ripples through a significant
amount of driver code.
There have been some comments on how the patches are implemented, but no
disagreement with the ultimate goal; these changes could go in as soon as
2.6.22. Tejun would appear to have more improvements in mind, but, even
with no further changes, the current patches go a long way toward making
sysfs safer and easier to work with.
Comments (3 posted)
Part of the fun of working with truly large machines is that one gets to
discover new scalability surprises before anybody else. So the SGI folks
often have more fun than many of the rest of us. Their latest discovery
has to do with the number of kernel threads which, on a 4096-processor
system, leads to some interesting kernel behavior.
To begin, they found out that they could not even boot a kernel with the
default configuration. Linux systems normally have a limit of 32768 active
processes at any given time. Anybody who has run "ps" will have noted that
kernel threads are taking up an increasing number of those slots; your
editor's single-processor desktop is running 39 of them. In fact, there
are now enough kernel threads on a
typical system that they will fill that entire space - and more - on a
4096-CPU machine. This problem is relatively easy to take care of by
raising the limit on the number of processes. But it gets more interesting
from there.
The init process is the parent of last resort for every other process on
the system, including kernel threads. So, on a big system, init has a
lot of child processes. These children live on a big linked list;
that list must be searched by various functions, including the variants of
wait(). If the process being searched for is toward the end of
the list, that search can take a long time. Since (1) most kernel
threads are long-lived, and (2) new processes are put at the end of
the list, chances are that a search will, indeed, be looking for a process
at the end.
Then, for the ultimate in fun, load a module into the kernel. The module
loading process calls stop_machine_run() when the new module is
being linked in; this function creates a high-priority kernel thread for
each processor on the system. That thread will grab its assigned CPU and
simply sit there until told to exit; while all CPUs are locked up in this
way the linking process can be performed. Calling a function like
stop_machine_run() is a somewhat antisocial act in the best of
times. But, in the 4096-processor system, stop_machine_run() will
create 4096 threads, each of which goes on the end of init's child list,
and each of which must be searched for when the time comes to clean it up.
The result is a system which simply stops for an extended period of time.
One could argue that people with systems that large simply should not load
modules, but there is a possibility of pushback from the user community.
So other solutions need to be found. Robin Holt's problem report included a simple patch which
moves exiting processes to the beginning of the child list. This change
solves the immediate problem by making searches for those children find
them without having to iterate through all of the long-lived processes
which are not going anywhere.
Linus had a couple of alternatives. One
was to create a separate list for zombie processes, eliminating that search
altogether. Another was to stop making kernel threads be children of the
init process since they have little to do with user space in any case.
But some developers feel that the real solution might be to start cutting
back on the number of kernel threads.
The biggest culprit for kernel thread creation will certainly be
workqueues, which, by default, create one thread for every CPU on the
system. There are situations which can benefit from multiple threads and
CPU locality, but there are undoubtedly many places where all of those
threads are not needed. Cleaning them up would help to solve some of the
scalability issues; as an added bonus it would remove some of the clutter
from ps listings.
In many cases, a workqueue may not be necessary at all. Instead, kernel
subsystems could just use the "generic" keventd workqueue (which runs as the
events/n threads). There are some issues with using
keventd, including indeterminate latency and a small possibility
of deadlocks, but, for many situations, it may work well enough.
In other cases, using a thread makes sense. Tasks involving long delays
are one example; running a function with multi-second delays in
keventd is considered impolite. Work requiring complicated
context also benefits from its own thread. But, in a number of cases,
those threads need not be created until there is actually some work to be
done. A quick ps run on most systems will show threads related to error
handling, asynchronous I/O, bluetooth, and more. In the current scheme,
they are created at boot (or module load) time and many of them may never
do any real work before the system shuts down. Thread creation is cheap,
so many of these threads could be created on demand when they are needed.
There are probably some real improvements to be made in this area; all
that's needed is somebody with the time and motivation to do the work. In
the mean time, those of you with 4096-way systems may need to apply a patch
or two.
Comments (2 posted)
The slab allocator has been at the core of the kernel's memory management
for many years. This allocator (sitting on top of the low-level page
allocator) manages caches of objects of a specific size, allowing for fast
and space-efficient allocations. Kernel hackers tend not to wander into
the slab code because it's complex and because, for the most part, it
works quite well.
Christoph Lameter is one of those people for whom the slab allocator does
not work quite so well. Over time, he has come up with a list of
complaints that is getting impressively long. The slab allocator maintains
a number of queues of objects; these queues can make allocation fast but
they also add quite a bit of complexity. Beyond that, the storage overhead
tends to grow with the size of the system:
SLAB Object queues exist per node, per CPU. The alien cache queue
even has a queue array that contain a queue for each processor on
each node. For very large systems the number of queues and the
number of objects that may be caught in those queues grows
exponentially. On our systems with 1k nodes / processors we have
several gigabytes just tied up for storing references to objects
for those queues This does not include the objects that could be on
those queues. One fears that the whole memory of the machine could
one day be consumed by those queues.
Beyond that, each slab (a group of one or more continuous pages from which
objects are allocated) contains a chunk of metadata at the beginning which
makes alignment of objects harder. The code for cleaning up caches when
memory gets tight adds another level of complexity. And so on.
Christoph's response is the SLUB
allocator, a drop-in replacement for the slab code. SLUB promises
better performance and scalability by dropping most of the queues and
related overhead and simplifying the slab structure in general, while
retaining the current slab allocator interface.
In the SLUB allocator, a slab is simply a group of one or more pages neatly
packed with objects of a given size. There is no metadata within the slab
itself, with the exception that free objects are formed into a simple
linked list. When an allocation request is made, the first free object is
located, removed from the list, and returned to the caller.
Given the lack of per-slab metadata, one might well wonder just how that
first free object is found. The answer is that the SLUB allocator stuffs
the relevant information into the system memory map - the page
structures associated with the pages which make up the slab. Making
struct page larger is frowned upon in a big way, so the SLUB
allocator makes this complicated structure even more so with the addition
of another union. The end result is that struct page gets three
new fields which only have meaning when the associated page is part of a
slab:
void *freelist;
short unsigned int inuse;
short unsigned int offset;
For slab use, freelist points to the first free object within a
slab, inuse is the number of objects which have been allocated
from the slab, and offset tells the allocator where to find the
pointer to the next free object. The SLUB allocator can use RCU to free
objects, but, to do so, it must be able to put the "next object" pointer
outside of the object itself; the offset pointer is the
allocator's way of tracking where that pointer was put.
When a slab is first created by the allocator, it has no objects allocated
from it. Once an object has been allocated, it becomes a "partial" slab
which is stored on a list in the kmem_cache structure. Since this
is a patch aimed at scalability, there is, in fact, one "partial" list for
each NUMA node on the system. The allocator tries to keep allocations
node-local, but it will reach across nodes before filling the system with
partial slabs.
There is also a per-CPU array of active slabs, intended to prevent cache
line bouncing even within a NUMA node. There is a special thread which
runs (via a workqueue) which monitors the usage of per-CPU slabs; if a
per-CPU slab
is not being used, it gets put back onto the partial list for use by other
processors.
If all objects within a slab are allocated, the allocator simply forgets
about the slab altogether. Once an object in a full slab is freed, the
allocator can relocate the containing slab via the system memory map and
put it back onto the appropriate partial list. If all of the objects
within a given slab (as tracked by the inuse counter) are freed,
the entire slab is given back to the page allocator for reuse.
One interesting feature of the SLUB allocator is that it can combine slabs
with similar object sizes and parameters. The result is fewer slab caches
in the system (a 50% reduction is claimed), better locality of slab
allocations, and less fragmentation of slab memory. The patch does note:
Note that merging can expose heretofore unknown bugs in the kernel
because corrupted objects may now be placed differently and corrupt
differing neighboring objects. Enable sanity checks to find those.
Causing bugs to stand out is generally considered to be a good thing, but
wider use of the SLUB allocator could lead to some quirky behavior until
those new bugs are stamped out.
Wider use may be in the cards: the SLUB allocator is in the -mm tree now
and could hit the mainline as soon as 2.6.22. The simplified code is
attractive, as is the claimed 5-10% performance increase. If merged, SLUB
is likely to coexist with the current slab allocator (and the SLOB
allocator intended for small systems) for some time. In the longer term,
the current slab code may be approaching the end of its life.
Comments (10 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
- Dmitry Torokhov: RF Kill.
(April 10, 2007)
Architecture-specific
Virtualization and containers
- Rusty Russell: lguest.
(April 10, 2007)
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>