Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.20-rc7, released on January 30. Says Linus: "Yes, I know I said I would only do -rc6 and then the final 2.6.20, but the thing is, the known regressions list didn't get whittled down as quickly as I hoped, and as a result we now have a -rc7." There's a fair number of fixes in this release, but not much else.
Previously, 2.6.20-rc6 was released on January 24. It includes quite a few fixes and a couple of new memory technology device (flash) drivers.
As of this writing, no patches have been added to the mainline git repository since the -rc7 release.
The current -mm tree is 2.6.20-rc6-mm3. Recent changes to -mm include a big ACPI update, a new set of dynamic tick and high-resolution timer patches, sysfs shadow directory support, a rework of page cache accounting, preemptible RCU, and a massive set of sysctl() cleanup patches.
For older kernels: 2.6.16.39 was released on January 31. It fixes a relatively small number of problems, none of which have immediately obvious security implications.
Kernel development news
Quotes of the week
Free Linux driver development offered
Greg Kroah-Hartman has sent out an offer to the hardware industry: the kernel development community will write its device drivers for free. "No longer do you have to suffer through all of the different examples in the Linux Device Driver Kit, or pick through the thousands of example drivers in the Linux kernel source tree trying to determine which one is the closest to what you need to do." There is nothing new here, of course, but it is a clear description of the benefits of providing hardware information.
A report from the Linux wireless developers meeting
Overall, the summit was very productive despite (or because of) the lack of Internet access. The main new items coming out of it were: a commitment to make an experimental wireless tarball (and driver) packages available; progress on the new cfg80211 API; and an understanding of the regulatory environment that vendors have to operate in."
A summary of 2.6.20 API changes
As of this writing the final 2.6.20 kernel has not yet happened. It is close, however. Since any internal API changes meant for 2.6.20 should have happened at least a month ago, it should be safe to put a summary of the most significant changes. There have been a few of them in this kernel cycle, some of which caused widespread churn through the code base.
- The workqueue API has seen a
major rework which requires changes in almost any code using
workqueues. In short: there are now two different types of
workqueues, depending on whether the delay feature is to be used or
not. The work function no longer gets an arbitrary data pointer; its
argument, instead, is a pointer to the work_struct structure
describing the job. If you have code which is broken by these
changes, this set of
instructions by David Howells is likely to be helpful.
- Some additional workqueue changes have been merged as well. There is
a new "freezable" workqueue type, indicating a workqueue which can be
safely frozen during the software suspend process. The new function
create_freezeable_workqueue() will create one. Another new
function, run_scheduled_work(), will cause a
previously-scheduled workqueue entry to be run synchronously. Note
that run_scheduled_work() cannot be used with delayed
workqueues.
- Much of the sysfs-related code has been changed to use struct
device in place of struct class_device. The latter
structure will eventually go away as the class and device mechanisms
are merged.
- There is a new function:
int device_move(struct device *dev, struct device *new_parent);
This function will reparent the given device to new_parent, making the requisite sysfs changes and generating a special KOBJ_MOVE event for user space.
- A number of kernel header files which included other headers no longer
do so. For example, <linux/fs.h> no longer includes
<linux/sched.h>. These changes should speed kernel
build times by getting rid of large number of unneeded includes, but
might break some out-of-tree modules which do not explicitly include
all the headers they need.
- The internal __alloc_skb() function has a new parameter,
being the number of the NUMA node on which the structure should be
allocated.
- The slab allocator API has been cleaned up somewhat. The old
kmem_cache_t typedef is gone;
struct kmem_cache should be used instead. The various
slab flags (SLAB_ATOMIC, SLAB_KERNEL, ...) were all
just aliases for the equivalent GFP_ flags, so they have been
removed.
- A new boot-time parameter (prof=sleep) causes the kernel to
profile the amount of time spent in uninterruptible sleeps.
- dma_cache_sync() has a new argument: the device
structure for the device doing DMA.
- The paravirt_ops code
has gone in, making it easier for the kernel to support multiple
hypervisors. Anybody wanting to port a hypervisor to this code should
note that it is somewhat volatile and likely to remain that way for
some time.
- The struct path
changes have been merged, with changes rippling through the
filesystem and device driver subsystems. In short, code accessing the
dentry pointer from a struct file pointer, which used to read
file->f_dentry, should now read
file->f_path.dentry. There are defines making the older
style of code work - for now.
- There is now a generic layer for human input devices; the USB HID code
has been switched over to this new layer.
- A new function, round_jiffies(), rounds a jiffies value up to
the next full second (plus a per-CPU offset). Its purpose is to
encourage timeouts to occur together, with the result that the CPU
wakes up less frequently.
- The block "activity function," a callback intended for the implementation of disk activity lights in software, has been removed; nobody was actually using it.
For those looking forward to what might happen in 2.6.21, a couple of significant changes can be predicted. The old SA_* flags used with request_irq() are likely to go away; the newer IRQF_* flags should be used instead. There is also a timer API change waiting for the next development cycle. Beyond that, a surprise or two is guaranteed; watch LWN for the details as the patches get merged.
Network namespaces
In recent times there has been quite a bit of attention paid to hypervisors and full virtualization (or paravirtualization) solutions. The proponents of the container approach - where all virtualized systems run in well-contained sandboxes on the host's kernel - have been relatively quiet. They have not been idle, however, as can be seen in the large amount of work going into network namespaces.For the container approach to work, every global resource in the system must be wrapped in some sort of namespace. This wrapping has been done for some relatively simple resources, such as the utsname information or process IDs; some of the resulting code has already found its way into the mainline. There is not a whole lot of use, however, for containers which are completely isolated from the rest of the world; usually some sort of networking capability is needed. For example, containers can usefully contain a web browser (keeping it from exposing the rest of the system should it prove vulnerable) or a web server - but only if networking works. But containers should not be able to see each others' packet streams, and, ideally, should be able to bind to the same ports without interfering with each other.
Making that work requires network namespaces. These namespaces virtualize all access to network resources - interfaces, port numbers, etc., - allowing each container the network access it needs (but no more). As with all other problems in computer science, the network namespace issue can be addressed with another layer of indirection. There is a small problem with this approach, however: the networking code is a vast pile of complex, highly-tuned code overseen by developers who have little tolerance for changes which introduce performance overhead or potential bugs. Getting any sort of network namespace implementation merged is going to require quite a bit of very careful work.
One approach can be seen in the L2 network namespace patch set posted recently by Dmitry Mishin. These patches concentrate on the lower levels of the network stack, trying to get proper namespaces established for network devices and the IPv4 layer. In an attempt to minimize churn in the networking code, the L2 namespace patch introduces the idea of the "current network namespace," kept in a per-CPU variable. The current namespace is implemented as a stack, with push and pop operations; in theory, it allows all network operations to happen within the proper namespace. Your editor was unable to convince himself that this scheme would work properly in the face of any sort of kernel preemption, but that may just be a matter of not having looked hard enough.
The net_device structure gains a net_ns field, providing the namespace to which the device belongs. It is set to whatever namespace is current when the device is created. The device lookup functions have become namespace-aware; if a device does not belong to the current namespace, it becomes invisible. A different version of the loopback device is created for each namespace. Then, the IPv4 routing code has been extended so that each namespace gets its own set of routing tables. The code which matches incoming packets to sockets has also been made namespace-aware; there is still a single hash table, but the namespace has been made part of the match criteria.
Network interfaces made up of real hardware will normally remain in the root namespace. Communication with other namespaces is made possible by way of a "virtual Ethernet" device, included with the patch set. A virtual device can be thought of as a wire into a restricted namespace; it presents one device within that namespace and one in the parent (normally root) namespace. Packets written to one end show up at the other. With the addition of a few routing rules in the root namespace, packets meeting the right criteria can be directed into (and out of) specific namespaces.
The L2 namespace patch provides the plumbing for the creation of little virtualized Internets within a single system, but they do not yet provide complete isolation. A process within its namespace can reconfigure its interfaces, perhaps creating problems for the system as a whole. Tightening things down is left to the L3 namespace patch, posted by Daniel Lezcano. An L3 namespace is always the child of an L2 namespace; it is the end of the line, however, being unable to have child namespaces of its own. There are also no network admin capabilities in an L3 namespace; once an L3 namespace is created, it is stuck with whatever network configuration its parent gave it.
The end result is that a contained system can be put within an L3 namespace and it should be able to perform networking without interfering with (or even seeing) other systems in other namespaces.
A somewhat different approach can be seen in the network namespace patches posted by Eric W. Biederman. Eric, aware of the challenges involved in getting network namespaces merged, is far more concerned with the process than the specific namespace implementation. So his patches focus mostly on getting the internal APIs right.
The first step is to figure out how network namespaces are to be represented. Rather than use a structure, Eric has opted for a mechanism which marks all network-related global resources in a special way. These resources get linked into a special section of the kernel which can be cloned when a new namespace is created. Each global variable becomes an offset into the per-namespace section; it must be accessed by way of a special macro. This approach appears cumbersome, but it has a couple of advantages. If a module with per-namespace variables is loaded, those variables can be added to each existing namespace on the fly. And, if namespaces are not in use, the overhead of the whole mechanism drops to zero. This is an important feature: to have a hope of being merged, a network namespace implementation will have to have no impact on systems which are not using it.
The patch set (31 parts strong) then works through various parts of the networking API, adding a namespace parameter to functions which need it. There is no global "current namespace" concept in Eric's patches; it is, instead, an explicit parameter everywhere. Thus, for example, every function which creates a socket (they exist in every protocol implementation) gets a namespace parameter. The sk_buff structure (which represents a packet) has a namespace field assigned from either the process creating it (for outbound packets) or the device it was received from; the various protocol-specific functions are expected to take that namespace into account. Functions dealing with netlink sockets get namespace parameters, as do those which implement network device lookup, event generation, and Unix-domain sockets. Like the L2 patches, Eric's implementation includes a virtual network device (called "etun") which can be use to route packets between namespaces.
Unlike the L2/L3 patches, Eric's work deals with the virtualization of the networking-related /proc, sysctl, and sysfs interfaces. Doing so requires adding shadow directory support to sysfs. Shadow directories loosen the connection between sysfs and the internal kobject hierarchy, allowing different namespaces to see different contents in the same locations.
A key aspect of Eric's patch is that it implements little namespace mechanism. Instead, much of the networking stack is made to test the namespace it is given and fail if the root namespace is not in use. The idea is to get the interfaces right first, then to start to fill in the mechanism in relatively small pieces. The tests ensure that the network stack will not surprise users by doing the wrong thing if it is not yet fully prepared to handle non-root namespaces.
Despite the posting of all these patches, the amount of discussion has been quite low. One gets the sense that the network developers have not yet started to take these patches seriously. This issue seems unlikely to go away, however; there remains a great deal of interest in getting container features into the mainline kernel. Sooner or later, this discussion is likely to take off.
Fibrils and asynchronous system calls
The kernel's support for asynchronous I/O is incomplete, and it always has been. While certain types of operations (direct filesystem I/O, for example) work well in an asynchronous mode, many others do not. Often implementing asynchronous operation is hard, and nobody has ever gotten around to making it work. In other cases, patches have been around for some time, but they have not made it into the mainline; AIO patches can be fairly intrusive and hard to merge. Regardless of the reason, things tend to move very slowly in the AIO area.Zach Brown has decided to stir things up by asking a basic question: could it be that the way the kernel implements AIO is all wrong? The current approach adds a fair amount of complexity, requiring explicit AIO handling in every subsystem which supports it. IOCB structures have to be passed around, and kernel code must always check whether it is supposed to block on a given operation or return one of two "it's in the works" codes. It would be much nicer if most kernel operations could simply be invoked asynchronously without having to clutter them up with explicit support.
To that end, Zach has posted a preliminary patch set which simplifies asynchronous I/O support considerably, but doesn't stop there: it also makes any system call invokable in an asynchronous mode. The key is a new type of in-kernel lightweight thread known as a "fibril."
A fibril is an execution thread which only runs in kernel space. A process can have any number of fibrils active, but only one of them can actually execute in the processor(s) at any given time. Fibrils have their own stack, but otherwise they share all of the resources of their parent process. They are kept in a linked list attached to the task structure.
When a process makes an asynchronous system call, the kernel creates a new fibril and executes the call in that context. If the system call completes immediately, the fibril is destroyed and the result goes back to the calling process in the usual way. Should the fibril block, however, it gets queued and control returns to the submitting code, which can then return the "it's in progress" status code. The "main" process can then run in user space, submit more asynchronous operations, or do just about anything else.
Sooner or later, the operation upon which the fibril blocked will complete. The wait queue entry structure has been extended to include information on which fibril was blocked; the wakeup code will find that fibril and make it runnable by adding it to a special "run queue" linked list in the parent task structure. The kernel will then schedule the fibril for execution, perhaps displacing the "main" process. That fibril might make some progress and block again, or it may complete its work. In the latter case, the final exit code is saved and the fibril is destroyed.
By moving asynchronous operations into a separate thread, Zach's patch simplifies their implementation considerably - with few exceptions, kernel code need not be changed at all to support asynchronous calls. The creation of fibrils is intended to make it all happen quickly - fibrils are intended to be less costly than kernel threads or ordinary processes. Their one-at-a-time semantics help to minimize the concurrency issues which might otherwise come up.
The user-space interface starts with a structure like this:
struct asys_input { int syscall_nr; unsigned long cookie; unsigned long nr_args; unsigned long *args; };
The application is expected to put the desired system call number in syscall_nr; the arguments to that system call are described by args and nr_args. The cookie value will be given back to the process when the operation completes. User space can create an array of these structures and pass them to:
long asys_submit(struct asys_input *requests, unsigned long nr_requests);
The kernel will then start each of the requests in a fibril and return to user space. When the process develops an interest in the outcome of its requests, it uses this interface:
struct asys_completion { long return_code; unsigned long cookie; }; long asys_await_completion(struct asys_completion *comp);
A call to asys_await_completion() will block until at least one asynchronous operation has completed, then return the result in the structure pointed to by comp. The cookie value given at submission time is returned as well.
Your editor notes that the current asys_await_completion() implementation does not check to see if any asynchronous operations are outstanding; if none are, the call is liable to wait for a long time. There are a number of other issues with the patch set, all acknowledged by their author. For example, little thought has been given to how fibrils should respond to signals. Zach's purpose was not to present a completed work; instead, he wants to get the idea out there and see what people think of it.
Linus likes the idea:
I heartily approve, although I only gave the actual patches a very cursory glance. I think the approach is the proper one, but the devil is in the details. It might be that the stack allocation overhead or some other subtle fundamental problem ends up making this impractical in the end, but I would _really_ like for this to basically go in.
There are a lot of details - Linus noted that there is no limit on how many fibrils a process can create, for example - but this seems to be the way that he would like to see AIO implemented. He suggests that fibrils might be useful in the kevent code as well.
On the other hand, Ingo Molnar is opposed to the fibril approach; his argument is long but worth reading. In Ingo's view, there are only two solutions to any operating system problem which are of interest: (1) the one which is easiest to program with, and (2) the one that performs the best. In the I/O space, he claims, the easiest approach is synchronous I/O calls and user-space processes. The fastest approach will be "a pure, minimal state machine" optimized for the specific task; his Tux web server is given as an example.
According to Ingo, the fibril approach serves neither goal:
Ingo makes the claim that Linux is sufficiently fast at switching between ordinary processes that the advantages offered by fibrils are minimal at best, and not worth their cost. Anybody wanting performance will still have to face the full kernel AIO state machine. So, he says, there is no real advantage to fibrils at this time that are worth the cost of complicating the scheduler and moving away from the 1:1 thread model.
These patches are in an early stage, and this story will clearly take some time to play out. Even if a consensus develops in favor of the fibril idea, the process of turning them into a proper, robust kernel feature could make them too expensive to be worthwhile. But it's an interesting idea which brings a much-needed fresh look at how the kernel does AIO; it's hard to complain too much about that.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>