Kernel development

Brief items

Kernel release status

The current 2.6 development kernel remains 2.6.28-rc8; no 2.6.28 prepatches have been released over the last week. The trickle of changes into the mainline git repository continues, with 46 changes (as of this writing) merged since -rc8.

The question of when the final 2.6.28 release will happen remains open. Linus seems to be leaning toward a pre-holiday release, mostly because he wants to get the merge window out of the way before the beginning of linux.conf.au in January. The regression list is quite short at this point, so it seems that a release at just about any time would be justified.

The current 2.6 stable kernel is 2.6.27.9, released with a long list of fixes on December 13. Meanwhile, the 2.6.27.10 stable release, containing another 22 patches, is in the review process as of this writing; it will likely be released on December 18.

Comments (none posted)

Kernel development news

Quotes of the week

If it is your intention to submit this for a mainline merge then I would encourage you to stop feature work at the earliest reasonable stage and then move into the document, submit, review, merge, fixfixfix phase. That might take as long as several months.

Once things have stabilised and it's usable and performs respectably, start thinking about features again.

Do NOT fall into the trap of adding more and more and more stuff to an out-of-tree project. It just makes it harder and harder to get it merged. There are many examples of this.

-- Andrew Morton (on Tux3)

What kind of action would you take to a few gigabytes of "ipt_hook: happy cracking.\n"?

-- Dave Jones

Comments (none posted)

System calls and 64-bit architectures

By Jake Edge
December 17, 2008

Adding a system call to the kernel is never done lightly. It is important to get it right before it gets merged because, once that happens, it must be maintained as part of the kernel's binary interface forever. The proposal to add preadv() and pwritev() system calls provides an excellent example of the kinds of concerns that need to be addressed when adding to the kernel ABI.

The two system calls themselves are quite straightforward. Essentially, they combine the existing pread() and readv() calls (along with the write variants of course) into a way to do scatter/gather I/O at a particular offset in the file. Like pread(), the current file position is unaffected. The calls, which are available on various BSD systems, can be used to avoid races between an lseek() call and a read or write. Currently, applications must do some kind of locking to prevent multiple threads from stepping on each other when doing this kind of I/O.

The prototypes for the functions look much like readv/writev, simply adding the offset as the final parameter:

    ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
    ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);

But, because off_t is a 64-bit quantity, this causes problems on some architectures due to the way system call arguments are passed. After Gerd Hoffmann posted version 2 of the patchset, Matthew Wilcox was quick to point out a problem:

Are these prototypes required? MIPS and PARISC will need wrappers to fix them if they are. These two architectures have an ABI which requires 64-bit arguments to be passed in aligned pairs of registers, but glibc doesn't know that (and given the existence of syscall(3), can't do much about it even if it knew), so some of the arguments end up in the wrong registers.

Several other architectures (ARM, PowerPC, s390, ...) have similar constraints. Because the offset is the fourth argument, it gets placed in the r3 and r4 32-bit registers, but some architectures need it in either r2/r3 or r4/r5. This led some to advocate reordering the parameters, putting the offset before iovcnt to avoid the problem. As long as that change doesn't bubble out to user space, Hoffmann is amenable to making the change: "I'd *really* hate it to have the same system call with different argument ordering on different systems though".

Most seemed to agree that the user-space interface as presented by glibc should match what the BSDs provide. It causes too many headaches for folks trying to write standards or portable code otherwise. To fix the alignment problem, the system call itself has the reordered version of the arguments. That led to Hoffmann's third version of the patchset, which still didn't solve the whole problem.

There are multiple architectures that have both 32 and 64-bit versions and the 64-bit kernel must support system calls from 32-bit user-space programs. Those programs will put 64-bit arguments into two registers, but the 64-bit kernel will expect that argument in a single register. Because of this, Arnd Bergmann recommended splitting the offset into two arguments, one for the high 32 bits and one for the low: "This is the only way I can see that lets us use a shared compat_sys_preadv/pwritev across all 64 bit architectures".

When a 32-bit user-space program makes a system call on a 64-bit system, the compat_sys_* version is used to handle differences in the data sizes. If a pointer to a structure is passed to a system call, and that structure has a different representation in 32-bits than it does in 64-bits, the compat layer makes the translation. Because different 64-bit architectures do things differently in terms of calling conventions and alignment requirements, the only way to share compat code is to remove the 64-bit quantity from the system call interface entirely.

That just leaves one final problem to overcome: endian-ness. As Ralf Baechle notes, MIPS can be either little or big-endian, so the compat_sys_preadv/pwritev() needs to put the two 32-bit offset values together in the proper way. He recommended moving the MIPS-specific merge_64() macro into a common compat.h include file, which could then be used by the common compat routines. So far, version 4 of the patchset has not emerged, but one suspects that the offset argument splitting and use of merge_64() will be part of it.

The implementation of the operation of preadv() and pwritev() is very obvious, certainly in comparison to the intricacies of passing its arguments. The VFS implementations of readv()/writev() already take an offset argument, so it was simply a matter of calling those. It is interesting to note that as part of the review, Christoph Hellwig spotted a bug in the existing compat_sys_readv/writev() implementations which would lead to accounting information not being updated for those calls.

This is not the first time these system calls have been proposed; way back in 2005, we looked at some patches from Badari Pulavarty that added them. Other than a brief appearance in the -mm tree, they seem to have faded away. Even if this edition of preadv() and pwritev() do not make it into the mainline—so far there are no indications that they won't—the code review surrounding it was certainly useful. Getting a glimpse of the complexities around 64-bit quantities being passed to system calls was quite informative as well.

Comments (10 posted)

Followups: performance counters, ksplice, and fsnotify

By Jonathan Corbet
December 17, 2008

There's been progress in a few areas which LWN has covered in the past. Here's a quick followup on where things stand now.

Performance monitors

In last week's episode, a new, out-of-the-blue performance monitoring patch had stirred up discussion and a certain amount of opposition. The simplicity of the new approach by Ingo Molnar and Thomas Gleixner had some appeal, but it is far from clear that this approach is sufficiently powerful to meet the needs of the wider performance monitoring community.

Since then, version 3 and version 4 of the patch have been posted. A look at the changelogs shows that work on this code is progressing quickly. A number of change have been made, including:

The addition of virtual performance counters for tracking clock time, page faults, context switches, and CPU migrations.
A new "performance counter group" functionality. This feature is meant to address criticism that the original interface would not allow multiple counters to be read simultaneously, making it hard to correlate different counter values. Counters can now be associated into multiple groups which allow them to be manipulated as a unit. There's also a new mechanism allowing all counters to be turned on or off with a single system call.
The system call interface has been reworked; see the version 3 announcement for description of the new API.
The kerneltop utility has been enhanced to work with performance counter groups.
"Performance counter inheritance" is now supported; essentially, this allows a performance monitoring utility to follow a process through a fork() and monitor the child process(es) as well.
The new "timec" utility runs a process under performance monitoring, outputting a whole set of statistics on how the process ran.

There are still concerns about this new approach to performance monitoring, naturally. Developers worry that users may not be able to get the information they need, and it still seems like it may be necessary to put a huge amount of hardware-specific programming information into the kernel. But, to your editor's eye, this patch set also seems to be gaining a bit of the sense of inevitability which usually attaches itself to patches from Ingo and company. It will probably be some time, though, before a decision is made here.

Ksplice

In November, we looked at a new version of the Ksplice code, which allows patches to be put into a running kernel. The Ksplice developers would like to see their work go into the mainline, so they recently poked Andrew Morton to see what the status was. His response was:

It's quite a lot of tricky code, and fairly high maintenance, I expect.

I'd have _thought_ that distros and their high-end customers would be interested in it, but I haven't noticed anything from them. Not that this means much - our processes for gathering this sort of information are rudimentary at best.

The response on the list, such as it was, indicated that the distributors are, in fact, not greatly interested in this feature. Dave Jones commented:

It's a neat hack, but the idea of it being used by even a small percentage of our users gives me the creeps....

If distros can't get security updates out in a reasonable time, fix the process instead of adding mechanism that does an end-run around it. Which just leaves the "we can't afford downtime" argument, which leads me to question how well reviewed runtime patches are. Having seen some of the non-ksplice runtime patches that appear in the wake of a new security hole, I can't say I have a lot of faith.

The Ksplice developers agree that the writing of custom code to fit patches into a running kernel is a scary proposition; that is why, they say, they've gone out of their way to make such code unnecessary most of the time.

This discussion leaves Ksplice in a bit of a difficult position; in the absence of clear demand, the kernel developers are unlikely to be willing to merge a patch of this nature. If this is a feature that users really want, they should probably be communicating that fact to their distributors, who can then consider supporting it and working to get it into the mainline.

fsnotify

The file scanning mechanism known as TALPA got off to a rough start with the kernel development community. Many developers have a dim view of the malware scanning industry in general, and they did not like the implementation that was posted. It is clear, though, that the desire for this kind of functionality is not going away. So developer Eric Paris has been working toward an implementation which will pass review.

His latest attempt can be seen in the form of the fsnotify patch set. This code does not, itself, support the malware scanning functionality, but, says Eric, "you better know it's coming." What it does, instead, is to create a new, low-level notification mechanism for filesystem events.

At a first look, that may seem like an even more problematic approach than was taken before. Linux already has two separate file event notifiers: dnotify and inotify. Kernel developers tend to express their dissatisfaction with those interfaces, but there has not been a whole lot of outcry for somebody to add a third alternative. So why would fsnotify make sense?

Eric's idea seems to be to make something that so clearly improves the kernel that people will lose the will to complain about the malware scanning functionality. So fsnotify has been written - employing a lot of input from filesystem developers - to be a better-thought-out, more supportable notification subsystem. Then the existing dnotify and inotify code is ripped out and reimplemented on top of fsnotify. The end result is that the impact on the rest of the VFS code is actually reduced; there is now only one set of notifier calls where, previously, there were two. And, despite that, the notification mechanism has become more general, being able to support functionality which was not there in the past.

And, to top it off, Eric has managed to make the size of the in-core inode structure smaller. Given that there can be thousands of those structures in a running system, even a small size reduction in their size can make a big difference. So, claims Eric, "That's right, my code is smaller and faster. Eat that."

What this code needs now is detailed review from the core VFS developers. Those developers tend to be a highly-contended resource, so it's not clear when they will be able to take a close look at fsnotify. But, sooner or later, it seems likely that this feature will find its way into the mainline.

Comments (13 posted)

SLQB - and then there were four

By Jonathan Corbet
December 16, 2008

The Linux kernel does not lack for low-level memory managers. The venerable slab allocator has been the engine behind functions like kmalloc() and kmem_cache_alloc() for many years. More recently, SLOB was added as a pared-down allocator suitable for systems which do not have a whole lot of memory to manage in the first place. Even more recently, SLUB went in as a proposed replacement for slab which, while being designed with very large systems in mind, was meant to be applicable to smaller systems as well. The consensus for the last year or so has been that at least one of these allocators is surplus to requirements and should go. Typically, slab is seen as the odd allocator out, but nagging doubts about SLUB (and some performance regressions in specific situations) have kept slab in the game.

Given this situation, one would not necessarily think that the kernel needs yet another allocator. But Nick Piggin thinks that, despite the surfeit of low-level memory managers, there is always room for one more. To that end, he has developed the SLQB allocator which he hopes to eventually see merged into the mainline. According to Nick:

I've kept working on SLQB slab allocator because I don't agree with the design choices in SLUB, and I'm worried about the push to make it the one true allocator.

Like the other slab-like allocators, SLQB sits on top of the page allocator and provides for allocation of fixed-sized objects. It has been designed with an eye toward scalability on high-end systems; it also makes a real effort to avoid the allocation of compound pages whenever possible. Avoidance of higher-order (compound page) allocations can improve reliability significantly when memory gets tight.

While there is a fair amount of tricky code in SLQB, the core algorithms are not that hard to understand. Like the other slab-like allocators, it implements the abstraction of a "slab cache" - a lookaside cache from which memory objects of a fixed size can be allocated. Slab caches are used directly when memory is allocated with kmem_cache_alloc(), or indirectly through functions like kmalloc(). In SLQB, a slab cache is represented by a data structure which looks very approximately like the following:

[SLQB slab data structure]

(Note that, to simplify the diagram, a number of things have been glossed over).

The main kmem_cache structure contains the expected global parameters - the size of the objects being allocated, the order of page allocations, the name of the cache, etc. But scalability means separating processors from each other, so the bulk of the kmem_cache data structure is stored in per-CPU form. In particular, there is one kmem_cache_cpu structure for each processor on the system.

Within that per-CPU structure one will find a number of lists of objects. One of those (freelist) contains a list of available objects; when a request is made to allocate an object, the free list will be consulted first. When objects are freed, they are returned to this list. Since this list is part of a per-CPU data structure, objects normally remain on the same processor, minimizing cache line bouncing. More importantly, the allocation decisions are all done per-CPU, with no bad cache behavior and no locking required beyond the disabling of interrupts. The free list is managed as a stack, so allocation requests will return the most recently freed objects; again, this approach is taken in an attempt to optimize memory cache behavior.

SLQB gets its memory in the form of full pages from the page allocator. When an allocation request is made and the free list is empty, SLQB will allocate a new page and return an object from that page. The remaining space on the page is organized into a per-page free list (assuming the objects are small enough to pack more than one onto a page, of course), and the page is added to the partial list. The other objects on the page will be handed out in response to allocation requests, but only when the free list is empty. When the final object on a page is allocated, SLQB will forget about the page - temporarily, at least.

Objects are, when freed, added to freelist. It is easy to foresee that this list could grow to be quite large after a burst of system activity. Allowing freelist to grow without bound would risk tying up a lot of system memory doing nothing while it is possibly needed elsewhere. So, once the size of the free list passes a watermark (or when the page allocator starts asking for help freeing memory), objects in the free list will be flushed back to their containing pages. Any partial pages which are completely filled with freed objects will then be returned back to the page allocator for use elsewhere.

There is an interesting situation which arises here, though: remember that SLQB is fundamentally a per-CPU allocator. But there is nothing that requires objects to be freed on the same CPU which allocated them. Indeed, for suitably long-lived objects on a system with many processors, it becomes probable that objects will be freed on a different CPU. That processor does not know anything about the partial pages those objects were allocated from, and, thus, cannot free them. So a different approach has to be taken.

That approach involves the maintenance of two more object lists, called rlist and remote_free. When the allocator tries to flush a "remote" object (one allocated on a different CPU) from its local freelist, it will simply move that object over to rlist. Occasionally, the allocator will reach across CPUs to take the objects from its local rlist and put them on remote_free list of the CPU which initially allocated those objects. That CPU can then choose to reuse the objects or free them back to their containing pages.

The cross-CPU list operation clearly requires locking, so a spinlock protects remote_free. Working with the remote_free lists too often would thus risk cache line bouncing and lock contention, both of which are not helpful when scalability is a goal. That is why processors accumulate a group of objects in their local rlist before adding the entire list, in a single operation, to the appropriate remote_free list. On top of that, the allocator does not often check for objects in its local remote_free list. Instead, objects are allowed to accumulate there until a watermark is exceeded, at which point whichever processor added the final objects will set the remote_free_check flag. The processor owning the remote_free list will only check that list when this flag is set, with the result that the management of the remote_free list can be done with little in the way of lock or cache line contention.

The SLQB code is relatively new, and is likely to need a considerable amount of work before it may find its way into the mainline. Nick claims benchmark results which are roughly comparable with those obtained using the other allocators. But "roughly comparable" will not, by itself, be enough to motivate the addition of yet another memory allocator. So pushing SLQB beyond comparable and toward "clearly better" is likely to be Nick's next task.

Comments (28 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.28-rc8 ?

Greg KH Linux 2.6.27.9 ?

Architecture-specific

Huang Ying AES: Add support to Intel AES-NI instructions ?

Mike Travis cpumask: consolidate and apply x86 cpumask changes ?

Core kernel code

Gerd Hoffmann Add preadv and pwritev system calls. ?

Magnus Damm clocksource: add enable() and disable() callbacks ?

Paul E. McKenney v10 scalable classic RCU implementation ?

Vaidyanathan Srinivasan Tunable sched_mc_power_savings=n ?

Development tools

Ingo Molnar Performance Counters for Linux, v3 ?

Ingo Molnar Performance Counters for Linux, v4 ?

Mathieu Desnoyers LTTng 0.66 (dynamic channel allocation, markers modifications) ?

Robert Richter oprofile: port to the new ring buffer ?

Device drivers

Magnus Damm early platform driver support ?

Vladislav Bolkhovitin New SCSI target framework (SCST) and 4 target drivers ?

Nicholas A. Bellinger : Target_Core_Mod/ConfigFS and LIO-Target v3.0 work ?

Balaji Rao PCF50633 support ?

Kuninori Morimoto Add tw9910 driver ?

Lennert Buytenhek initial mwl8k driver for marvell topdog wireless ?

Documentation

David Howells Simplified GIT usage guide ?

Steven Rostedt Creating the RT git tree ?

Filesystems and block I/O

Eric Paris fsnotify, dnotify, and inotify ?

Eric Paris send notification events about exec ?

Boaz Harrosh exofs (was osdfs) ?

swhiteho@redhat.com GFS2: Pre-pull patch posting ?

Evgeniy Polyakov dst: new release introduction. ?

Memory management

Nick Piggin SLQB slab allocator ?

Networking

Benjamin Thery netns: make IPv6 multicast forwarding per-namespace - V2 ?

Inaky Perez-Gonzalez merge request for WiMAX kernel stack and i2400m driver v4 ?

Herbert Xu net: Generic Receive Offload ?

Patrick Ohly hardware time stamping with optional structs in data area ?

Security-related

Al Viro audit patches for .29-rc1 ?

Geert Uytterhoeven Partial decompression API ?

Virtualization and containers

menage@google.com CGroups: CGroups: Hierarchy locking/refcount changes ?

Avi Kivity KVM Updates for 2.6.29 (part 3 of 3) ?

Serge E. Hallyn posix mqueue namespace (v11) ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.28-rc8-git1: Reported regressions from 2.6.27 ?

Rafael J. Wysocki 2.6.28-rc8-git1: Reported regressions 2.6.26 -> 2.6.27 ?

Miscellaneous

Andrew Morton mmotm git repository ?

Jeremy Katz Dracut -- Cross distribution initramfs infrastructure ?

Page editor: Jonathan Corbet
Next page: Distributions>>