User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.15-rc6, released by Linus on December 18. This one is intended to be the final -rc before 2.6.15 comes out - hopefully by the end of the year. Quite a few fixes have been merged, but no new features are added at this late stage. "But do give it a try, because Santa Claus has his CIA spooks checking y'all out, and naughty people don't get any of the loot." See the long-format changelog for the details.

About 40 post-rc6 patches are currently sitting in the mainline git repository; they are all small fixes.

The current -mm tree is 2.6.15-rc5-mm3. Recent changes to -mm include a Sony laptop ACPI driver, support for an atomic_long_t type, the removal of the swap prefetching patches ("I wasn't able to notice much benefit from it in my testing, and the number of mm/ patches in getting crazy, so we don't have capacity for speculative things at present."), the unshare() system call (see below), a set of MD updates, and the dropping of support for gcc 3.1 and prior.

The current stable 2.6 kernel is, released on December 14. It contains a relatively large number of patches with a couple of security fixes and various other important repairs.

In an exception to normal policy, the stable team also released on December 15. It contains three patches, one of which is a security fix.

Comments (none posted)

Kernel development news

Some new system calls

The addition of system calls to the kernel is a relatively rare event. Each new system call changes the interface presented to user space and creates an ABI which must be maintained forever. So new system calls are added only when there is a real need. That said, there is a fair variety of system call patches in circulation at the moment.

mknodat() and friends

Ulrich Drepper, the maintainer of glibc, isn't just trying to add a system call; his proposal creates eleven of them. They are all variants on current file operations:

    int mknodat(int dfd, const char *pathname, mode_t mode, dev_t dev);
    int mkdirat(int dfd, const char *pathname, mode_t mode);
    int unlinkat(int dfd, const char *pathname);
    int symlinkat(const char *oldname, int newdfd, const char *newname);
    int linkat(int olddfd, const char *oldname, 
               int newdfd, const char *newname);
    int renameat(int olddfd, const char *oldname,
                 int newdfd, const char *newname);
    int utimesat(int dfd, const char *filename, struct timeval *tvp);
    int chownat(int dfd, const char *path, uid_t owner, gid_t group);
    int openat(int dfd, const char *filename, int flags, int mode);
    int newfstatat(int dfd, char *filename, struct stat *buf, int flag);
    int readlinkat(int dfd, const char *pathname, char *buf, int size);

The pattern should be clear by now: each new system call extends an existing one by adding one or more "dfd" (default file descriptor) arguments. In each case, the new argument indicates a directory which is used instead of the current working directory when relative path names are provided. These calls can help applications work their way through directory trees in a race-free manner, and are also useful for implementing a virtual per-thread working directory.

There was a minor comment on the implementation - Ulrich had wanted to avoid changing an exported function, but such changes are always fair game. Beyond that, there seems to be little resistance to adding these system calls. Expect them in a future kernel.

pselect() and ppoll()

David Woodhouse, meanwhile, has been circulating a patch implementing the pselect() and ppoll() system calls. These calls each take a signal mask; that mask will be applied while the calling process waits for events, with the previous mask being restored on return. There is an emulated version of these calls in glibc now, but a truly robust implementation requires kernel support. As with most things involving signals, the new code gets somewhat complex in places. The end result, however, should be a pair of straightforward system calls which allow a process to apply a different signal mask while waiting for I/O.


The unshare() patch by Janak Desai was first covered here last May. It allows a process to disconnect from resources which are shared with others. The target application is per-user namespaces; implementing these requires the ability to detach from the global namespace normally shared by all processes on the system. The current version of this patch implements namespace unsharing, but it also allows a process to privatize its view of virtual memory and open files.

This patch has been through a fair amount of review, and has seen a number of improvements from that process. Andrew Morton's reaction to a request to include the patch in -mm suggests that there is some work yet to be done, though. Andrew wants to see a better justification for the patch; he is also concerned about the security implications of adding a relatively obscure bit of code. The end result is that Janak still has some homework to do before this patch will make it into the kernel.

preadv() and pwritev()

The kernel currently supports the pread() and pwrite() system calls; these behave like read() and write(), with the exception that they take an explicit offset in the file. They will perform the operation at the given offset regardless of whether the "current" offset in the file has been changed by another thread, and they do not change the current offset as seen by any thread. Also supported are readv() and writev(), which perform scatter/gather I/O from the current file offset. The kernel does not have, however, any system call which combines these two modes of operation.

It turns out that there are developers who wish they had system calls along the lines of:

    int preadv(unsigned int fd, struct iovec *vec, unsigned long vlen,
               loff_t pos);
    int pwritev(unsigned int fd, struct iovec *vec, unsigned long vlen,
                loff_t pos);

To satisfy this need, Badari Pulavarty has created a simple implementation which is currently part of the -mm tree. It seems that Ulrich Drepper suggested an alternative to adding two new system calls, however: change the iovec structure instead. Badari ran with that idea, posting a new patch creating a new iovec type:

    struct niovec
        void __user *iov_base;
	__kernel_size_t iov_len;
	__kernel_loff_t iov_off; /* NEW */

The new iov_off field is more flexible than plain preadv() in that it enables each segment in the I/O operation to have its own offset. The only down side is that the prototypes for the readv() and writev() methods in the file_operations structure must be changed. So every driver and filesystem which implements readv() and writev() breaks and must be changed. There are fewer of those than one might expect, but it is still a significant change.

It was suggested that the asynchronous I/O operations could be used instead. The AIO interface already allows for the creation of vectored operations with per-segment offsets. The downside is that using AIO is more complicated in user space, heavier in the kernel, and, incidentally, AIO support in the kernel was never completed to the point where it will support these operations anyway. Still, that is an option which may need more consideration before changing one of the fundamental interfaces used by filesystems and drivers.


Finally, there has been talk over many years of creating a splice() system call. The core idea is that a process could open a file descriptor for a data source, and another for a data sink. Then, with a call to splice(), those two streams could be connected to each other, and the data could flow from the source to the sink entirely within the kernel, with no need for user-space involvement and with minimal (or no) copying.

Some of the infrastructure was put in place one year ago when Linus created a circular pipe buffer mechanism. Now Jens Axboe has put together a simple splice() implementation which uses that mechanism. The patch is not ready for prime time yet (Jens: "I'm just posting this in the spirit of posting early"), but it is a beginning. In particular, it allows a file to be spliced to a pipe, as either the source or the sink. With a pair of splices, it is possible to set up an in-kernel file copy operation with no internal memory copying.

Work left for the future includes cleaning up the ("ugly," "nasty") internal interfaces, and generalizing the code so that any two file descriptors can be spliced together. The ability to splice to network sockets would be particularly useful. Some of this may take a while, so don't expect splice() to show up in the mainline in the immediate future.

Comments (9 posted)

Semaphores and mutexes

Last week's Kernel Page covered the mutex patch by David Howells. The discussion did not stop at that point, however, so here's this week's episode.

There was some fairly strong pushback against the mutex patch after last week's article was written. Linus expressed his thoughts this way:

A patch that
  • creates a non-counting mutex
  • .. that is SLOWER than the current counting one
  • .. and keeps the old "semaphore" and "up/down" naming

is simply INCREDIBLY BROKEN. It has absolutely _zero_ redeeming features. I can't understand how there are a hundred emails in my mailbox even discussing it.

Here is Andrew Morton's take:

I must say that my interest in this stuff is down in needs-an-electron-microscope-to-locate territory. down() and up() work just fine and they're small, efficient, well-debugged and well-understood. We need a damn good reason for taking on tree-wide churn or incompatible renames or addition of risk. What's the damn good reason here?

Please. Go fix some bugs. We're not short of them.

The objections should be coming into focus at this point. One problem had to do with performance; the mutex patch was supposed to be faster, but that was not the case in the posted version (which lacked architecture-specific implementations). There was a long discussion on why the semaphore code could not be improved on in this regard. It seems that, on the most popular architectures at least, the locked decrement-and-test code used by semaphores is hard to beat.

David's patch also introduced a sort of global flag day, changing the locking primitives used by vast amounts of code all at once. But it kept the old semaphore function names and applied them to the new mutex type, creating a confusing sort of interface. There was resistance to this choice of naming, but also a great deal of resistance to the idea of making major changes throughout the kernel without a very strong idea of what was being gained for it. All told, the mutex patch set looked like it had a rough road ahead of it.

Enter Ingo Molnar, who has posted a mutex patch of his own. Ingo's mutexes are derived from the code used in the realtime preemption patch, of course, but they have been heavily modified to avoid the objections which greeted David's patch. In this version, a mutex is a separate data type, with its own API:


     void mutex_lock(struct mutex *lock);
     int mutex_lock_interruptible(struct mutex *lock);
     int mutex_trylock(struct mutex *lock);
     void mutex_unlock(struct mutex *lock);
     int mutex_is_locked(struct mutex *lock);

The existing semaphore interface is not changed in any way - at least, not in any way visible to the rest of the kernel. There is an interesting feature, however: the semaphore functions (down(), up(), and friends) have been augmented to be able to handle mutex arguments as well as semaphores. This feature is a migration tool: a subsystem which is being considered for migration over to the mutex type can have its semaphores changed to mutexes, but no other code changes are required. The various checks built into the mutex type will quickly set off alarms if a mutex is being used as a counting semaphore. In that case, the locks can be changed back to semaphores and the whole episode forgotten. If, instead, all seems well, the semaphore calls can be turned into mutex calls. Eventually, when the migration work is complete, this helper code can be removed from the kernel.

The real point of all the above is that, unlike David's patch, this version of mutexes imposes no flag day on the kernel. It is a new primitive, with its own API, and bits of the kernel can be converted over one by one.

Ingo claims that his mutex code is significantly faster than semaphores used as mutexes. The code itself is a bit smaller and tighter, which helps. But he also gets some impressive performance improvements on some tests: a filesystem-based test more than doubled its speed on an eight-processor system. That is the sort of improvement which can help to motivate the quick merging of a patch.

In this case, developers started to wonder just why the semaphore code was so much slower. Some research turned up the fact that, on the x86 architecture, each cycle through a semaphore had the potential to wake up two separate waiting processes, each of which would then contend for the lock. Nobody knows why the code is this way - Linus is mystified by it. It quickly became clear, though, that taking out the redundant wakeup breaks the semaphores and causes lockups. For now, it is a bit of black magic which must remain for the whole thing to work.

Ingo quickly seized on this revelation to drive home one of his other points:

If this really is a bug that hid for years, it shows that the semaphore code is too complex to be properly reviewed and improved. Hence even assuming that the mutex code does not bring direct code advantages (which i'm disputing :-), the mutex code is far simpler and thus easier to improve.

Linus seems to have heard this argument:

And don't get me wrong: if it's easier to just ignore the performance bug, and introduce a new "struct mutex" that just doesn't have it, I'm all for it.

He doesn't like the under-the-hood semaphore changes, though, and would like that part of the patch taken out.

Ingo's initial posting contains no less than ten reasons why he thinks the mutex patch should go on; rather than try to rephrase all of those arguments, your editor suggests going straight to the source. It is worth noting that, among other things, merging this mutex patch would move another piece of the realtime preemption patch into the mainline - even though many of the realtime-specific features (priority inheritance, for example) are missing.

Comments (4 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

  • Junio C Hamano: GIT 1.0.0. (December 21, 2005)

Device drivers


Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds