Brief items
The current 2.6 prepatch is 2.6.15-rc6,
released by Linus on
December 18. This one is intended to be the final -rc before 2.6.15
comes out - hopefully by the end of the year. Quite a few fixes have been
merged, but no new features are added at this late stage. "
But do
give it a try, because Santa Claus has his CIA spooks checking y'all out,
and naughty people don't get any of the loot." See
the long-format changelog for the details.
About 40 post-rc6 patches are currently sitting in the mainline git
repository; they are all small fixes.
The current -mm tree is 2.6.15-rc5-mm3. Recent changes
to -mm include a Sony laptop ACPI driver, support for an
atomic_long_t type, the removal of the swap prefetching patches
("I wasn't able to notice much benefit from it in my testing, and the
number of mm/ patches in getting crazy, so we don't have capacity for
speculative things at present."), the unshare() system call
(see below), a set of MD updates, and the dropping of support for
gcc 3.1 and prior.
The current stable 2.6 kernel is 2.6.14.4, released on
December 14. It contains a relatively large number of patches with a
couple of security fixes and various other important repairs.
In an exception to normal policy, the stable team also released 2.6.13.5 on December 15.
It contains three patches, one of which is a security fix.
Comments (none posted)
Kernel development news
The addition of system calls to the kernel is a relatively rare event.
Each new system call changes the interface presented to user space and
creates an ABI which must be maintained forever. So new system calls are
added only when there is a real need. That said, there is a fair variety
of system call patches in circulation at the moment.
mknodat() and friends
Ulrich Drepper, the maintainer of glibc, isn't just trying to add a system
call; his proposal creates
eleven of them. They are all variants on current file operations:
int mknodat(int dfd, const char *pathname, mode_t mode, dev_t dev);
int mkdirat(int dfd, const char *pathname, mode_t mode);
int unlinkat(int dfd, const char *pathname);
int symlinkat(const char *oldname, int newdfd, const char *newname);
int linkat(int olddfd, const char *oldname,
int newdfd, const char *newname);
int renameat(int olddfd, const char *oldname,
int newdfd, const char *newname);
int utimesat(int dfd, const char *filename, struct timeval *tvp);
int chownat(int dfd, const char *path, uid_t owner, gid_t group);
int openat(int dfd, const char *filename, int flags, int mode);
int newfstatat(int dfd, char *filename, struct stat *buf, int flag);
int readlinkat(int dfd, const char *pathname, char *buf, int size);
The pattern should be clear by now: each new system call extends an
existing one by adding one or more "dfd" (default file descriptor)
arguments. In each case, the new argument indicates a directory which is
used instead of the current working directory when relative path names are
provided. These calls can help applications work their way through
directory trees in a race-free manner, and are also useful for implementing
a virtual per-thread working directory.
There was a minor comment on the implementation - Ulrich had wanted to avoid
changing an exported function, but such changes are always fair game.
Beyond that, there seems to be little resistance to adding these system
calls. Expect them in a future kernel.
pselect() and ppoll()
David Woodhouse, meanwhile, has been circulating a patch implementing the pselect()
and ppoll() system calls. These calls each take a signal mask;
that mask will be applied while the calling process waits for events, with
the previous mask being restored on return. There is an emulated version
of these calls in glibc now, but a truly robust implementation requires
kernel support. As with most things involving signals, the new code gets
somewhat complex in places. The end result, however, should be a pair of
straightforward system calls which allow a process to apply a different
signal mask while waiting for I/O.
unshare()
The unshare() patch by Janak Desai was first covered here last May. It allows a process
to disconnect from resources which are shared with others. The target
application is per-user namespaces; implementing these requires the ability
to detach from the global namespace normally shared by all processes on the
system. The current version of
this patch implements namespace unsharing, but it also allows a process
to privatize its view of virtual memory and open files.
This patch has been through a fair amount of review, and has seen a number
of improvements from that process. Andrew Morton's reaction to a request to include the patch in
-mm suggests that there is some work yet to be done, though. Andrew wants
to see a better justification for the patch; he is also concerned about the
security implications of adding a relatively obscure bit of code. The end
result is that Janak still has some homework to do before this patch will
make it into the kernel.
preadv() and pwritev()
The kernel currently supports the pread() and pwrite()
system calls; these behave like read() and write(), with
the exception that they take an explicit offset in the file. They will
perform the operation at the given offset regardless of whether the
"current" offset in the file has been changed by another thread, and they
do not change the current offset as seen by any thread. Also supported are
readv() and writev(), which perform scatter/gather I/O
from the current file offset. The kernel does not have, however, any
system call which combines these two modes of operation.
It turns out that there are developers who wish they had system calls along
the lines of:
int preadv(unsigned int fd, struct iovec *vec, unsigned long vlen,
loff_t pos);
int pwritev(unsigned int fd, struct iovec *vec, unsigned long vlen,
loff_t pos);
To satisfy this need, Badari Pulavarty has created a simple implementation
which is currently part of the -mm tree. It seems that Ulrich Drepper
suggested an alternative to adding two new system calls, however: change
the iovec structure instead. Badari ran with that idea, posting
a new patch creating a new
iovec type:
struct niovec
{
void __user *iov_base;
__kernel_size_t iov_len;
__kernel_loff_t iov_off; /* NEW */
};
The new iov_off field is more flexible than plain
preadv() in that it enables each segment in the I/O operation to
have its own offset. The only down side is that the prototypes for the
readv() and writev() methods in the
file_operations structure must be changed. So every driver and
filesystem which implements readv() and writev() breaks
and must be changed. There are fewer of those than one might expect, but
it is still a significant change.
It was suggested that the asynchronous I/O
operations could be used instead. The AIO interface already allows for the
creation of vectored operations with per-segment offsets. The downside is
that using AIO is more complicated in user space, heavier in the kernel,
and, incidentally, AIO support in the kernel was never completed to the
point where it will support these operations anyway. Still, that is an
option which may need more consideration before changing one of the
fundamental interfaces used by filesystems and drivers.
splice()
Finally, there has been talk over many years of creating a
splice() system call. The core idea is that a process could
open a file descriptor for a data source, and another for a data sink.
Then, with a call to splice(), those two streams could be
connected to each other, and the data could flow from the source to the
sink entirely within the kernel, with no need for user-space involvement
and with minimal (or no) copying.
Some of the infrastructure was put in place one year ago when Linus created
a circular pipe buffer mechanism. Now Jens Axboe has put together a simple splice()
implementation which uses that mechanism. The patch is not ready for
prime time yet (Jens: "I'm just posting this in the spirit
of posting early"), but it is a beginning. In particular, it allows
a file to be spliced to a pipe, as either the source or the sink. With a
pair of splices, it is possible to set up an in-kernel file copy operation
with no internal memory copying.
Work left for the future includes cleaning up the ("ugly," "nasty")
internal interfaces, and generalizing the code so that any two file
descriptors can be spliced together. The ability to splice to network
sockets would be particularly useful. Some of this may take a while, so
don't expect splice() to show up in the mainline in the immediate
future.
Comments (9 posted)
Last week's Kernel Page
covered the mutex patch by David Howells. The discussion did not stop at
that point, however, so here's this week's episode.
There was some fairly strong pushback against the mutex patch after last
week's article was written. Linus expressed
his thoughts this way:
A patch that
- creates a non-counting mutex
- .. that is SLOWER than the current counting one
- .. and keeps the old "semaphore" and "up/down" naming
is simply INCREDIBLY BROKEN. It has absolutely _zero_ redeeming features.
I can't understand how there are a hundred emails in my mailbox even
discussing it.
Here is Andrew Morton's take:
I must say that my interest in this stuff is down in
needs-an-electron-microscope-to-locate territory. down() and up()
work just fine and they're small, efficient, well-debugged and
well-understood. We need a damn good reason for taking on
tree-wide churn or incompatible renames or addition of risk.
What's the damn good reason here?
Please. Go fix some bugs. We're not short of them.
The objections should be coming into focus at this point. One problem had
to do with performance; the mutex patch was supposed to be faster, but that
was not the case in the posted version (which lacked architecture-specific
implementations). There was a long discussion on why the semaphore code
could not be improved on in this regard. It seems that, on the most
popular architectures at least, the locked decrement-and-test code used by
semaphores is hard to beat.
David's patch also introduced a sort of global flag day, changing the
locking primitives used by vast amounts of code all at once. But it kept
the old semaphore function names and applied them to the new mutex type,
creating a confusing sort of interface. There was resistance to this
choice of naming, but also a great deal of resistance to the idea of making
major changes throughout the kernel without a very strong idea of what was
being gained for it. All told, the mutex patch set looked like it had a
rough road ahead of it.
Enter Ingo Molnar, who has posted a mutex patch of his own.
Ingo's mutexes are derived from the code used in the realtime preemption
patch, of course, but they have been heavily modified to avoid the
objections which greeted David's patch. In this version, a mutex is a
separate data type, with its own API:
DEFINE_MUTEX(name);
mutex_init(mutex);
void mutex_lock(struct mutex *lock);
int mutex_lock_interruptible(struct mutex *lock);
int mutex_trylock(struct mutex *lock);
void mutex_unlock(struct mutex *lock);
int mutex_is_locked(struct mutex *lock);
The existing semaphore interface is not changed in any way - at least, not
in any way visible to the rest of the kernel. There is an interesting
feature, however: the semaphore functions (down(), up(),
and friends) have been augmented to be able to handle mutex arguments as
well as semaphores. This feature is a migration tool: a subsystem which is
being considered for migration over to the mutex type can have its
semaphores changed to mutexes, but no other code changes are required. The
various checks built into the mutex type will quickly set off alarms if a
mutex is being used as a counting semaphore. In that case, the locks can
be changed back to semaphores and the whole episode forgotten. If,
instead, all seems well, the semaphore calls can be turned into mutex
calls. Eventually, when the migration work is complete, this helper code
can be removed from the kernel.
The real point of all the above is that, unlike David's patch, this version
of mutexes imposes no flag day on the kernel. It is a new primitive, with
its own API, and bits of the kernel can be converted over one by one.
Ingo claims that his mutex code is significantly faster than semaphores
used as mutexes. The code itself is a bit smaller and tighter, which
helps. But he also gets some impressive performance improvements on some
tests: a filesystem-based test more than doubled its speed on an
eight-processor system. That is the sort of improvement which can help to
motivate the quick merging of a patch.
In this case, developers started to wonder just why the semaphore code
was so much slower. Some research turned up the fact that, on the x86
architecture, each cycle through a semaphore had the potential to wake up
two separate waiting processes, each of which would then contend for the
lock. Nobody knows why the code is this way - Linus is mystified by it. It quickly became clear,
though, that taking out the redundant wakeup breaks the semaphores and
causes lockups. For now, it is a bit of black magic which must remain for
the whole thing to work.
Ingo quickly seized on this revelation to
drive home one of his other points:
If this really is a bug that hid for years, it shows that the
semaphore code is too complex to be properly reviewed and
improved. Hence even assuming that the mutex code does not bring
direct code advantages (which i'm disputing :-), the mutex code is
far simpler and thus easier to improve.
Linus seems to have heard this argument:
And don't get me wrong: if it's easier to just ignore the
performance bug, and introduce a new "struct mutex" that just
doesn't have it, I'm all for it.
He doesn't like the under-the-hood semaphore changes, though, and would
like that part of the patch taken out.
Ingo's initial posting
contains no less than ten reasons why he thinks the mutex patch should go
on; rather than try to rephrase all of those arguments, your editor
suggests going straight to the source. It is worth noting that, among
other things, merging this mutex patch would move another piece of the
realtime preemption patch into the mainline - even though many of the
realtime-specific features (priority inheritance, for example) are
missing.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Junio C Hamano: GIT 1.0.0.
(December 21, 2005)
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>