User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.22-rc3, released on May 25. "The geeks with embedded hardware can consider themselves doubly special (and not just because your mothers told you you are), because we've got updates to ARM, SH and Blackfin. What more could you possibly want? Some ATA updates? USB suspend problem solving? Infiniband? DVB and MMC updates? Network drivers and some fixes for silly network problems? Yeah we got them!" The long-format changelog has the details.

The current stable 2.6 release is, released on May 24 with a single patch, being a security fix for the geode-aes driver. came out on May 23 with a rather longer set of fixes.

For older kernels: was released on May 24 with the geode-aes fix. showed up on May 25 with a handful of fixes.

Comments (none posted)

Kernel development news

Quote of the week

Over the years, we've done lots of nice "extended functionality" stuff. Nobody ever uses them. The only thing that gets used is the standard stuff that everybody else does too.
-- Linus Torvalds

Comments (none posted)

The return of syslets

Things have been quiet on the syslet/threadlet/fibril front for some time. Part of the reason for that, it would seem, is that Ingo Molnar has been busy with the completely fair scheduler work and has not been able to get back to this other little project. This work is not dead, though; instead it has been picked up by Zach Brown (who came up with the original "fibril" concept). Zach has released an updated patch bringing this work back to the foreground. He has not made a whole lot of changes to the syslet code - yet - but that does not mean that the patch is uninteresting.

Zach's motivation for this work, remember, was to make it easier to implement and maintain proper asynchronous I/O (AIO) support in the kernel. His current work continues toward that goal:

For the time being I'm focusing on simplifying the mechanisms that support the sys_io_*() interface so I never ever have to debug fs/aio.c (also known as chewing glass to those of us with the scars) again.

In particular, one part of the new syslet patch is a replacement for the io_submit() system call, which is the core of the current AIO implementation. Rather than start the I/O and return, the new io_submit() uses the syslet mechanism, eliminating a lot of special-purpose AIO code in the process. Zach's stated goal is to get rid of the internal kiocb structure altogether. The current code is more of a proof of concept, though, with a lot of details yet to fill in. Some benchmarks have been posted, though, as Zach says, "They haven't wildly regressed, that's about as much as can be said with confidence so far." It is worth noting that, with this patch, the kernel is able to do asynchronous buffered I/O through io_submit(), something which the mainline has never yet supported.

The biggest area of discussion, though, has been over Jeff Garzik's suggestion that the kevent code should be integrated with syslets. Some people like the idea, but others, including Ingo, think that kevents do not provide any sort of demonstrable improvement over the current epoll interface. Ulrich Drepper, the glibc maintainer, disagreed with that assessment, saying that the kevent interface was a step in the right direction if it does not perform any better.

The reasoning behind that point of view is worth a look. The use of the epoll interface requires the creation of a file descriptor. That is fine when applications use epoll directly, but it can be problematic if glibc is trying to poll for events (I/O completions, say) that the application does not see directly. There is a single space for file descriptors, and applications often think they know what should be done with every descriptor in that space. If glibc starts creating its own private file descriptors, it will find itself at the mercy of any application which closes random descriptors, uses dup() without care, etc. So there is no way for glibc to use file descriptors independently from the application.

Possible solutions exist, such as giving glibc a set of private, hidden descriptors. But Ulrich would rather just go with a memory-based interface which avoids the problem altogether. And Linus would rather not create any new interfaces at all. All told, it has the feel of an unfinished discussion; we'll be seeing it again.

Comments (12 posted)

Slab defragmentation

Memory defragmentation is a subject which has appeared often on this page - even if no solutions have yet found their way into the mainline kernel. Most of the defragmentation approaches out there work at the page level with the idea of being able to satisfy multi-page allocations reliably. There is another type of fragmentation problem, however, which also has the ability to complicate the kernel's memory management: fragmentation within slab pages.

The slab allocator grabs full pages and divides them into allocations of the same size. For example, kernel code which will often allocate a specific structure type will create a slab for that type, allowing those allocations to be satisfied quickly and efficiently. The slab allocator can release pages back to the kernel when all of the objects within those pages have been freed. In real use, however, objects tend to get spread across many pages, leaving the allocator with a pile of partially-used pages and no way to return memory to the system. This sort of internal fragmentation can lead to inefficient memory usage and the inability to reclaim memory when it is needed.

Christoph Lameter's slab defragmentation patch aims to solve this problem by getting slab users to cooperate in freeing specific slab pages. A defragmentation-aware slab user will start by creating a structure of the new kmem_cache_ops type:

    struct kmem_cache_ops {
	void *(*get)(struct kmem_cache *cache, int nr, void **objects);
	void (*kick)(struct kmem_cache *cache, int nr, void **objects, 
                     void *private);

In this structure are two methods which the slab user must define. When the slab code picks a specific page to try to free (typically a page with a relatively small number of allocated objects), it will make an array of those objects and pass it to the get() method. That method has a guarantee that all of the objects are allocated at the time of the call; its job is to increase the reference count of each object to prevent it from being freed while other things are happening. The return value is a private pointer which will be used later.

Note that the get() method is called in something like interrupt context with slab locks held. So it cannot do a whole lot, and, in particular, it cannot call any slab operations.

After get() returns, the slab code will pass the same parameters into kick(), along with whatever value get() returned. Depending on the situation, the private value could be a pointer to internal housekeeping or simply a flag saying that it will not be possible to free all of the objects. Assuming it is possible, kick() should attempt to free every object in the objects array. Slab operations are permissible in kick(), and the function is welcome to reallocate and move the objects. Reallocation will have the effect of freeing the target page and coalescing objects into a smaller number of fully-used pages.

There is no return value from kick(); the slab code simply checks to see if there are any remaining objects on the page to decide whether the operation succeeded or not. It is perfectly acceptable for the operation to fail; that will happen, for example, if code in other parts of the kernel holds references to the target objects.

The slab creation function has had its API changed to allow the association of a set of operations with a given cache:

    struct kmem_cache *kmem_cache_create(const char *name, size_t size, 
                      size_t align, unsigned long flags,
 		      void (*ctor)(void *, struct kmem_cache *, unsigned long),
		      const struct kmem_cache_ops *ops);

The destructor is no longer used, so it has been removed from the list of kmem_cache_create() parameters and replaced by the ops structure.

The patch includes code to add defragmentation support for the inode and dentry caches - often the two largest slab caches in a running system. There is also a new function:

    int kmem_cache_vacate(struct page *page);

This function will attempt to move all slab objects out of page, which really should be a page managed by the slab allocator; a non-zero return value indicates success. Among other things, this function can be used to clear specific pages which would help complete a higher-order allocation.

There has been relatively little discussion of this patch set; the core concept appears not to be overly controversial. It looks like a relatively low-overhead way to improve how the kernel uses memory; even the most critical reviewer can have a hard time getting upset about that.

Comments (1 posted)

Process containers

Back in September, LWN took a look at Rohit Seth's containers patch. Since that time, containers development has moved on to Paul Menage who, like Rohit, posts from a address. The patch has evolved considerably, to the point that Rohit's name no longer appears within it. As of the recently posted containers V10 patch, this mechanism is reaching a reasonably mature state.

This patch introduces a couple of new concepts into the kernel. The first one has an old name: "subsystem". Fortunately, the driver core has just removed its "subsystem" concept, leaving the term free. In the container patch, a subsystem is some part of the kernel which might have an interest in what groups of processes are doing. Chances are that most subsystems will be involved with resource management; for example, the container patch turns the Linux cpusets mechanism (which binds processes to specific groups of processors) into a subsystem.

A "container" is a group of processes which shares a set of parameters used by one or more subsystems. In the cpuset example, a container would have a set of processors which it is entitled to use; all processes within the container inherit that same set. Other (not yet existing) subsystems could use containers to enforce limits on CPU time, I/O bandwidth usage, memory usage, filesystem visibility, and so on. Containers are hierarchical, in that one container can hold others.

[container hierarchy] As an example, consider the simple hierarchy to the right. A server used to host containerized guests could establish two top-level containers to control the usage of CPU time. Guests, perhaps, could be allowed 90% of the CPU, but the administrator may want to place system tasks in a separate container which will always get at least 10% of the processor - that way, the mail will continue to be delivered regardless of what the guests are doing. Within the "Guests" container, each individual guest has its own container with specific CPU usage policies.

The container mechanism is not limited to a single hierarchy; instead, the administrator can create as many hierarchies as desired. So, for example, the administrator of the system described above could create an entirely different hierarchy for the control of network bandwidth usage. By default, all processes would be in the same container, but it is possible to set up policy which would shift processes to a different container when they run a specific application. So a web browser might be moved into a container which gets a relatively high portion of the available bandwidth while Bittorrent clients find themselves relegated to an unhappy container with almost no bandwidth available.

Different container hierarchies need not resemble each other in any way. Each hierarchy has one or more subsystems associated with it; a subsystem can only be attached to a single hierarchy. If there is more than one hierarchy, each process in the system will be in more than one container - one in each hierarchy.

The administration of containers is performed through a special virtual filesystem. The documentation suggests that it could be mounted on /dev/container, which is a bit strange; it has nothing to do with devices. One container filesystem instance will be mounted for each hierarchy to be created. The association of subsystems with hierarchies is done at mount time, by way of mount options. By default, all known subsystems are associated with a hierarchy, so a command like:

    mount -t container none /containers

would create a single container hierarchy with all known subsystems on /containers. A setup like the one described above, instead, could be created with something like:

    mount -t container -o cpu cpu /containers/cpu
    mount -t container -o net net /containers/net

The desired subsystems for each container hierarchy are simply provided as options at mount time. Note that the "cpu" and "net" subsystems mentioned above do not actually exist in the current container patch set.

Creating new containers is just a matter of making a directory in the appropriate spot in the hierarchy. Containers have a file called tasks; reading that file will yield a list of all processes currently in the container. A process can be added to a container by writing its ID to the tasks file. So a simple way to create a container and move a shell into it would be:

    mkdir /containers/new_container
    echo $$ > /containers/new_container/tasks

Subsystems can add files to containers for use in setting resource limits or otherwise controlling how the subsystem works. For example, the cpuset subsystem (which does exist) adds a file called cpus containing the list of CPUs established for that container; there are several other files added as well.

It's worth noting that the container patch does not add a single system call; all of the management is performed through the virtual filesystem.

With a basic container mechanism in place, most of the action in the future is likely to be in the creation of new subsystems. One can imagine, for example, hooking the existing process ID virtualization code into containers, as well as adding no end of resource controllers. The creation of a subsystem is relatively straightforward; the subsystem code starts by creating and registering a container_subsys structure. That structure contains an integer subsys_id field which should be set to the subsystem's specific ID number; these numbers are set staticly in <linux/container_subsys.h>. Implicit in this arrangement is that subsystems must be built into the kernel; there is no provision for adding subsystems as loadable modules.

Each subsystem defines a set of methods to be used by the container code, beginning with:

    int (*create)(struct container_subsys *ss, struct container *cont);
    int (*populate)(struct container_subsys *ss, struct container *cont);
    void (*destroy)(struct container_subsys *ss, struct container *cont);

These three are called whenever a container is created or destroyed; this is the chance for the subsystem to set up any bookkeeping it will need for the new container (or clean up for a container which is going away). The populate() method is called after the successful creation of a new container; its purpose is to allow the subsystem to add management files to that container.

Four methods are for the addition and removal of processes:

    int (*can_attach)(struct container_subsys *ss, struct container *cont, 
                      struct task_struct *tsk);
    void (*attach)(struct container_subsys *ss, struct container *cont,
		   struct container *old_cont, struct task_struct *tsk);
    void (*fork)(struct container_subsys *ss, struct task_struct *task);
    void (*exit)(struct container_subsys *ss, struct task_struct *task);

If a process is explicitly added to a container after creation, the container code will call can_attach() to determine whether the addition should succeed. If the subsystem allows the action to happen, it should have performed any needed allocations to ensure that the subsequent attach() call succeeds. When a process forks, fork() will be called to add the new child to the container. Exiting processes call exit() to allow the subsystem to clean up.

Clearly, there's more to the interface than described here; see the thorough documentation file packaged with the patch for much more detail. Your editor would not venture a guess as to when this code might be merged, but it does seem that this is the mechanism that the containers community has decided to push. So, sooner or later, it will likely be contained within the mainline.

Comments (12 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds