Release status
Kernel release status
The current 2.6 prepatch is 2.6.22-rc3,
released on May 25.
"
The geeks with embedded hardware can consider themselves doubly
special (and not just because your mothers told you you are), because we've
got updates to ARM, SH and Blackfin. What more could you possibly want?
Some ATA updates? USB suspend problem solving? Infiniband? DVB and MMC
updates? Network drivers and some fixes for silly network problems? Yeah we
got them!" The
long-format
changelog has the details.
The current stable 2.6 release is 2.6.21.3, released on May 24 with a
single patch, being a security fix for the geode-aes driver. 2.6.21.2 came out on May 23
with a rather longer set of fixes.
For older kernels: 2.6.20.12 was released on
May 24 with the geode-aes fix. 2.6.16.52-rc1 showed up on
May 25 with a handful of fixes.
Comments (none posted)
Kernel development news
Quote of the week
Over the years, we've done lots of nice "extended functionality"
stuff. Nobody ever uses them. The only thing that gets used is the
standard stuff that everybody else does too.
--
Linus Torvalds
Comments (none posted)
The return of syslets
Things have been quiet on the syslet/threadlet/fibril
front for some time. Part of the reason for that, it would seem, is that
Ingo Molnar has been busy with the completely fair scheduler work and has
not been able to get back to this other little project. This work is not
dead, though; instead it has been picked up by Zach Brown (who came up with
the original "fibril" concept). Zach has released
an updated patch bringing this
work back to the foreground. He has not made a whole lot of changes to the
syslet code - yet - but that does not mean that the patch is uninteresting.
Zach's motivation for this work, remember, was to make it easier to
implement and maintain proper asynchronous I/O (AIO) support in the kernel. His
current work continues toward that goal:
For the time being I'm focusing on simplifying the mechanisms that
support the sys_io_*() interface so I never ever have to debug
fs/aio.c (also known as chewing glass to those of us with the
scars) again.
In particular, one
part of the new syslet patch is a replacement for the
io_submit() system call, which is the core of the current AIO
implementation. Rather than start the I/O and return, the new
io_submit() uses the syslet mechanism, eliminating a lot of
special-purpose AIO code in the process. Zach's stated goal is to
get rid of the internal kiocb structure altogether. The current
code is more of a proof of concept, though, with a lot of details yet to
fill in. Some benchmarks have been posted,
though, as Zach says, "They haven't wildly regressed, that's about as much as can be said
with confidence so far." It is worth noting that, with this patch,
the kernel is able to do asynchronous buffered I/O through
io_submit(), something which the mainline has never yet supported.
The biggest area of discussion, though, has been over Jeff Garzik's
suggestion that the kevent code should be integrated with syslets. Some
people like the idea, but others, including
Ingo, think that kevents do not provide any sort of demonstrable
improvement over the current epoll interface. Ulrich Drepper, the glibc
maintainer, disagreed with that assessment,
saying that the kevent interface was a step in the right direction if it
does not perform any better.
The reasoning behind that point of view is worth a look. The use of the
epoll interface requires the creation of a file descriptor. That is fine
when applications use epoll directly, but it can be problematic if glibc is
trying to poll for events (I/O completions, say) that the application does
not see directly.
There is a single space for file descriptors, and applications often think
they know what should be done with every descriptor in that space. If
glibc starts creating its own private file descriptors, it will find itself
at the mercy of any application which closes random descriptors, uses
dup() without care, etc. So there is no way for glibc to use file
descriptors independently from the application.
Possible solutions exist, such as giving glibc a set of private, hidden
descriptors. But Ulrich would rather just go with a memory-based interface
which avoids the problem altogether. And Linus would rather not create any new interfaces at
all. All told, it has the feel of an unfinished discussion; we'll be
seeing it again.
Comments (12 posted)
Slab defragmentation
Memory defragmentation is a subject which has appeared often on this page -
even if no solutions have yet found their way into the mainline kernel.
Most of the defragmentation approaches out there work at the page level
with the idea of being able to satisfy multi-page allocations reliably.
There is another type of fragmentation problem, however, which also has the
ability to complicate the kernel's memory management: fragmentation within
slab pages.
The slab allocator grabs full pages and divides them into allocations of
the same size. For example, kernel code which will often allocate a
specific structure type will create a slab for that type, allowing those
allocations to be satisfied quickly and efficiently. The slab allocator
can release pages back to the kernel when all of the objects within those
pages have been freed. In real use, however, objects tend to get spread
across many pages, leaving the allocator with a pile of partially-used
pages and no way to return memory to the system. This sort of internal
fragmentation can lead to inefficient memory usage and the inability to
reclaim memory when it is needed.
Christoph Lameter's slab
defragmentation patch aims to solve this problem by getting slab users
to cooperate in freeing specific slab pages. A defragmentation-aware slab
user will start by creating a structure of the new kmem_cache_ops
type:
struct kmem_cache_ops {
void *(*get)(struct kmem_cache *cache, int nr, void **objects);
void (*kick)(struct kmem_cache *cache, int nr, void **objects,
void *private);
};
In this structure are two methods which the slab user must define. When
the slab code picks a specific page to try to free (typically a page with a
relatively small number of allocated objects), it will make an array of
those objects and pass it to the get() method. That method has a
guarantee that all of the objects are allocated at the time of the call;
its job is to increase the reference count of each object to prevent it
from being freed while other things are happening. The return value is a
private pointer which will be used later.
Note that the get() method is called in something like interrupt
context with slab locks held. So it cannot do a whole lot, and, in
particular, it cannot call any slab operations.
After get() returns, the slab code will pass the same parameters
into kick(), along with whatever value get() returned.
Depending on the situation, the private value could be a pointer
to internal housekeeping or simply a flag saying that it will not be
possible to free all of the objects. Assuming it is possible,
kick() should attempt to free every object in the objects
array. Slab operations are permissible in kick(), and the
function is welcome to reallocate and move the objects. Reallocation will
have the effect of freeing the target page and coalescing objects into a
smaller number of fully-used pages.
There is no return value from kick(); the slab code simply checks
to see if there are any remaining objects on the page to decide whether the
operation succeeded or not. It is perfectly acceptable for the operation
to fail; that will happen, for example, if code in other parts of the
kernel holds references to the target objects.
The slab creation function has had its API changed to allow the association
of a set of operations with a given cache:
struct kmem_cache *kmem_cache_create(const char *name, size_t size,
size_t align, unsigned long flags,
void (*ctor)(void *, struct kmem_cache *, unsigned long),
const struct kmem_cache_ops *ops);
The destructor is no longer used, so it has been removed from the list of
kmem_cache_create() parameters and replaced by the ops
structure.
The patch includes code to add defragmentation support for the inode and
dentry caches - often the two largest slab caches in a running system.
There is also a new function:
int kmem_cache_vacate(struct page *page);
This function will attempt to move all slab objects out of page,
which really should be a page managed by the slab allocator; a non-zero
return value indicates success. Among other things, this function can be
used to clear specific pages which would help complete a higher-order
allocation.
There has been relatively little discussion of this patch set; the core
concept appears not to be overly controversial. It looks like a relatively
low-overhead way to improve how the kernel uses memory; even the most
critical reviewer can have a hard time getting upset about that.
Comments (1 posted)
Process containers
Back in September, LWN took a look at
Rohit
Seth's containers patch. Since that time, containers development has
moved on to Paul Menage who, like Rohit, posts from a google.com address.
The patch has evolved considerably, to the point that Rohit's name no
longer appears within it. As of the recently posted
containers V10 patch, this
mechanism is reaching a reasonably mature state.
This patch introduces a couple of new concepts into the kernel. The first
one has an old name: "subsystem". Fortunately, the driver core has just
removed its "subsystem" concept, leaving the term free. In the container
patch, a subsystem is some part of the kernel which might have an interest
in what groups of processes are doing. Chances are that most subsystems
will be involved with resource management; for example, the container patch
turns the Linux cpusets mechanism (which binds processes to specific groups
of processors) into a subsystem.
A "container" is a group of processes which shares a set of parameters used
by one or more subsystems. In the cpuset example, a container would have a
set of processors which it is entitled to use; all processes within the
container inherit that same set. Other (not yet existing) subsystems could
use containers to enforce limits on CPU time, I/O bandwidth usage, memory
usage, filesystem visibility, and so on. Containers are hierarchical, in
that one container can hold others.
As an example, consider the simple hierarchy to the right. A server used
to host containerized guests could establish two top-level containers to
control the usage of CPU time. Guests, perhaps, could be allowed 90% of
the CPU, but the administrator may want to place system tasks in a separate
container which will always get at least 10% of the processor - that way,
the mail will continue to be delivered regardless of what the guests are
doing. Within the "Guests" container, each individual guest has its own
container with specific CPU usage policies.
The container mechanism is not limited to a single hierarchy; instead, the
administrator can create as many hierarchies as desired. So, for example,
the administrator of the system described above could create an entirely
different hierarchy for the control of network bandwidth usage. By
default, all processes would be in the same container, but it is possible
to set up policy which would shift processes to a different container when
they run a specific application. So a web browser might be moved into a
container which gets a relatively high portion of the available bandwidth
while Bittorrent clients find themselves relegated to an unhappy container
with almost no bandwidth available.
Different container hierarchies need not resemble each other in any way.
Each hierarchy has one or more subsystems associated with it; a subsystem
can only be attached to a single hierarchy. If there is more than one
hierarchy, each process in the system will be in more than one container -
one in each hierarchy.
The administration of containers is performed through a special virtual
filesystem. The documentation suggests that it could be mounted on
/dev/container, which is a bit strange; it has nothing to do with
devices. One container filesystem instance will be mounted for each
hierarchy to be created. The association of subsystems with hierarchies is
done at mount time, by way of mount options. By default, all known
subsystems are associated with a hierarchy, so a command like:
mount -t container none /containers
would create a single container hierarchy with all known subsystems on
/containers. A setup like the one described above, instead, could
be created with something like:
mount -t container -o cpu cpu /containers/cpu
mount -t container -o net net /containers/net
The desired subsystems for each container hierarchy are simply provided as
options at mount time. Note that the "cpu" and "net" subsystems mentioned
above do not actually exist in the current container patch set.
Creating new containers is just a matter of making a directory in the
appropriate spot in the hierarchy. Containers have a file called
tasks; reading that file will yield a list of all processes
currently in the container. A process can be added to a container by
writing its ID to the tasks file. So a simple way to create a
container and move a shell into it would be:
mkdir /containers/new_container
echo $$ > /containers/new_container/tasks
Subsystems can add files to containers for use in setting resource limits
or otherwise controlling how the subsystem works. For example, the cpuset
subsystem (which does exist) adds a file called cpus containing
the list of CPUs established for that container; there are several other
files added as well.
It's worth noting that the container patch does not add a single system
call; all of the management is performed through the virtual filesystem.
With a basic container mechanism in place, most of the action in the future
is likely to be in the creation of new subsystems. One can imagine, for
example, hooking the existing process ID virtualization code into
containers, as well as adding no end of resource controllers. The creation
of a subsystem is relatively straightforward; the subsystem code starts by
creating and registering a container_subsys structure. That
structure contains an integer subsys_id field which should be set
to the subsystem's specific ID number; these numbers are set staticly in
<linux/container_subsys.h>. Implicit in this arrangement is
that subsystems must be built into the kernel; there is no provision for
adding subsystems as loadable modules.
Each subsystem defines a set of methods to be used by the container code,
beginning with:
int (*create)(struct container_subsys *ss, struct container *cont);
int (*populate)(struct container_subsys *ss, struct container *cont);
void (*destroy)(struct container_subsys *ss, struct container *cont);
These three are called whenever a container is created or destroyed; this is
the chance for the subsystem to set up any bookkeeping it will need for the
new container (or clean up for a container which is going away). The
populate() method is called after the successful creation of a new
container; its purpose is to allow the subsystem to add management files to
that container.
Four methods are for the addition and removal of processes:
int (*can_attach)(struct container_subsys *ss, struct container *cont,
struct task_struct *tsk);
void (*attach)(struct container_subsys *ss, struct container *cont,
struct container *old_cont, struct task_struct *tsk);
void (*fork)(struct container_subsys *ss, struct task_struct *task);
void (*exit)(struct container_subsys *ss, struct task_struct *task);
If a process is explicitly added to a container after creation, the
container code will call can_attach() to determine whether the
addition should succeed. If the subsystem allows the action to happen, it
should have performed any needed allocations to ensure that the subsequent
attach() call succeeds. When a process forks, fork()
will be called to add the new child to the container. Exiting processes
call exit() to allow the subsystem to clean up.
Clearly, there's more to the interface than described here; see the thorough documentation file packaged with
the patch for much more detail. Your editor would not venture a guess as
to when this code might be merged, but it does seem that this is the
mechanism that the containers community has decided to push. So, sooner or
later, it will likely be contained within the mainline.
Comments (12 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>