The current 2.6 development kernel remains 2.6.28-rc8
; no 2.6.28
prepatches have been released over the last week. The trickle of changes
into the mainline git repository continues, with 46 changes (as of this
writing) merged since -rc8.
The question of when the final 2.6.28 release will happen remains open.
Linus seems to be leaning toward a pre-holiday release, mostly
because he wants to get the merge window out of the way before the
beginning of linux.conf.au in January. The regression list is quite short
at this point, so it seems that a release at just about any time would be
The current 2.6 stable kernel is 184.108.40.206, released with a long list of
fixes on December 13. Meanwhile, the 220.127.116.11 stable release,
containing another 22 patches, is in the review process as of this writing;
it will likely be released on December 18.
Comments (none posted)
Kernel development news
If it is your intention to submit this for a mainline merge then I
would encourage you to stop feature work at the earliest reasonable
stage and then move into the document, submit, review, merge,
fixfixfix phase. That might take as long as several months.
Once things have stabilised and it's usable and performs respectably,
start thinking about features again.
Do NOT fall into the trap of adding more and more and more stuff to
an out-of-tree project. It just makes it harder and harder to get
it merged. There are many examples of this.
-- Andrew Morton
What kind of action would you take
to a few gigabytes of "ipt_hook: happy cracking.\n"?
-- Dave Jones
Comments (none posted)
Adding a system call to the kernel is never done lightly. It is important
to get it right before it gets merged because, once that happens, it
must be maintained as part of the kernel's binary interface forever. The
proposal to add preadv()
and pwritev() system calls provides an excellent example of
the kinds of concerns that need to be addressed when adding to the kernel
The two system calls themselves are quite straightforward. Essentially,
they combine the existing pread() and readv() calls
the write variants of course) into
a way to do scatter/gather I/O at a particular offset in the file. Like
pread(), the current file position is
unaffected. The calls, which are available on various BSD systems, can be
used to avoid races between an lseek() call and a read or
write. Currently, applications must do some kind of locking to prevent
multiple threads from stepping on each other when doing this kind of I/O.
The prototypes for the functions look much like readv/writev, simply adding
the offset as the final parameter:
ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);
But, because off_t
is a 64-bit quantity, this causes problems on
some architectures due to the way system call arguments are
passed. After Gerd Hoffmann posted version 2
of the patchset
, Matthew Wilcox was quick to point out
Are these prototypes required? MIPS and PARISC will need wrappers to
fix them if they are. These two architectures have an ABI which
requires 64-bit arguments to be passed in aligned pairs of registers,
but glibc doesn't know that (and given the existence of syscall(3),
can't do much about it even if it knew), so some of the arguments end up
in the wrong registers.
Several other architectures (ARM, PowerPC, s390, ...) have similar
constraints. Because the offset is the fourth argument, it gets placed in
the r3 and r4 32-bit registers, but some architectures need it in either
r2/r3 or r4/r5. This led some to advocate reordering the
parameters, putting the offset before iovcnt to avoid the
problem. As long as that change doesn't bubble out to user space, Hoffmann
is amenable to making the change:
"I'd *really* hate it to have the same system call with different
argument ordering on different systems though".
Most seemed to agree that the user-space interface as presented by glibc
should match what the BSDs provide. It causes too many headaches for folks
trying to write standards or portable code otherwise. To fix the
alignment problem, the system call itself has the reordered version of the
arguments. That led
to Hoffmann's third version of the
patchset, which still didn't solve the whole problem.
There are multiple architectures that have both 32 and 64-bit versions and
the 64-bit kernel must support system calls from 32-bit user-space
programs. Those programs will put 64-bit arguments into two registers,
but the 64-bit kernel will expect that argument in a single register.
Because of this, Arnd Bergmann recommended
splitting the offset into two arguments, one for the high 32 bits and
one for the low: "This is the only way I can see that lets us use a
shared compat_sys_preadv/pwritev across all 64 bit architectures".
When a 32-bit user-space program makes a system call on a 64-bit system,
the compat_sys_* version is used to handle differences in the data
sizes. If a pointer to a structure is passed to a system call, and that
structure has a different representation in 32-bits than it does in
64-bits, the compat layer makes the translation. Because
different 64-bit architectures do things differently in terms of calling
conventions and alignment requirements, the only way to share
compat code is to remove the 64-bit quantity from the system call
That just leaves one final problem to overcome: endian-ness. As Ralf
Baechle notes, MIPS can be either little or
big-endian, so the compat_sys_preadv/pwritev() needs
to put the two 32-bit offset values together in the proper way. He
recommended moving the MIPS-specific merge_64() macro into a common
compat.h include file, which could then be used by the common
compat routines. So far, version 4 of the patchset has not
emerged, but one suspects that the offset argument splitting and use of
merge_64() will be part of it.
The implementation of the operation of preadv() and
pwritev() is very obvious, certainly in comparison to the
intricacies of passing its arguments. The VFS implementations of
readv()/writev() already take an offset argument, so it
was simply a matter of calling those. It is interesting to note that as
part of the review, Christoph Hellwig spotted a
bug in the existing compat_sys_readv/writev() implementations
which would lead to accounting information not being updated for those
This is not the first time these system calls have been proposed; way back
in 2005, we looked at some
patches from Badari Pulavarty that added them. Other than a brief
appearance in the -mm tree, they seem to have faded away.
Even if this edition of preadv() and pwritev() do not make
it into the
mainline—so far there are no indications that they
won't—the code review surrounding it was certainly useful. Getting a
glimpse of the complexities around 64-bit quantities being passed to system
calls was quite informative as well.
Comments (10 posted)
There's been progress in a few areas which LWN has covered in the past.
Here's a quick followup on where things stand now.
In last week's episode, a
new, out-of-the-blue performance monitoring patch had stirred up discussion
and a certain amount of opposition. The simplicity of the new approach by
Ingo Molnar and Thomas Gleixner had some appeal, but it is far from clear
that this approach is sufficiently powerful to meet the needs of the wider
performance monitoring community.
Since then, version 3 and version 4 of the patch have been
posted. A look at the changelogs shows that work on this code is
progressing quickly. A number of change have been made, including:
- The addition of virtual performance counters for tracking clock time,
page faults, context switches, and CPU migrations.
- A new "performance counter group" functionality. This feature is
meant to address criticism that the original interface would not allow
multiple counters to be read simultaneously, making it hard to
correlate different counter values. Counters can now be associated
into multiple groups which allow them to be manipulated as a unit.
There's also a new mechanism allowing all counters to be turned on or
off with a single system call.
- The system call interface has been reworked; see the version 3
announcement for description of the new API.
- The kerneltop utility has been enhanced to work with performance
- "Performance counter inheritance" is now supported; essentially, this
allows a performance monitoring utility to follow a process through a
fork() and monitor the child process(es) as well.
- The new "timec" utility runs a process under performance monitoring,
outputting a whole set of statistics on how the process ran.
There are still concerns about this new approach to performance monitoring,
naturally. Developers worry that users may not be able to get the
information they need, and it still seems like it may be necessary to put a
huge amount of hardware-specific programming information into the kernel.
But, to your editor's eye, this patch set also seems to be gaining a bit of
the sense of inevitability which usually attaches itself to patches from
Ingo and company. It will probably be some time, though, before a decision
is made here.
In November, we looked at a
new version of the Ksplice code, which allows patches to be put into a
running kernel. The Ksplice developers would like to see their work go
into the mainline, so they recently poked Andrew Morton to see what the
status was. His response was:
It's quite a lot of tricky code, and fairly high maintenance, I expect.
I'd have _thought_ that distros and their high-end customers would
be interested in it, but I haven't noticed anything from them. Not
that this means much - our processes for gathering this sort of
information are rudimentary at best.
The response on the list, such as it was, indicated that the distributors
are, in fact, not greatly interested in this feature. Dave Jones commented:
It's a neat hack, but the idea of it being used by even a small percentage
of our users gives me the creeps....
If distros can't get security updates out in a reasonable time, fix
the process instead of adding mechanism that does an end-run around it.
Which just leaves the "we can't afford downtime" argument, which leads
me to question how well reviewed runtime patches are.
Having seen some of the non-ksplice runtime patches that appear in the
wake of a new security hole, I can't say I have a lot of faith.
The Ksplice developers agree that the
writing of custom code to fit patches into a running kernel is a scary
proposition; that is why, they say, they've gone out of their way to make
such code unnecessary most of the time.
This discussion leaves Ksplice in a bit of a difficult position; in the
absence of clear demand, the kernel developers are unlikely to be willing
to merge a patch of this nature. If this is a feature that users really
want, they should probably be communicating that fact to their
distributors, who can then consider supporting it and working to get it
into the mainline.
The file scanning mechanism known as TALPA got off to a rough start
with the kernel development community. Many developers have a dim view of
the malware scanning industry in general, and they did not like the
implementation that was posted. It is clear, though, that the desire for
this kind of functionality is not going away. So developer Eric Paris has
been working toward an implementation which will pass review.
His latest attempt can be seen in the form of the fsnotify patch set. This code
does not, itself, support the malware scanning functionality, but, says
Eric, "you better know it's coming." What it does, instead,
is to create a new, low-level notification mechanism for filesystem events.
At a first look, that may seem like an even more problematic approach than
was taken before. Linux already has two separate file event notifiers:
dnotify and inotify. Kernel developers tend to express their
dissatisfaction with those interfaces, but there has not been a whole lot
of outcry for somebody to add a third alternative. So why would fsnotify
Eric's idea seems to be to make something that so clearly improves the
kernel that people will lose the will to complain about the malware
scanning functionality. So fsnotify has been written - employing a lot of
input from filesystem developers - to be a better-thought-out, more
supportable notification subsystem. Then the existing dnotify and inotify
code is ripped out and reimplemented on top of fsnotify. The end result is
that the impact on the rest of the VFS code is actually reduced; there is
now only one set of notifier calls where, previously, there were two. And,
despite that, the notification mechanism has become more general, being
able to support functionality which was not there in the past.
And, to top it off, Eric has managed to make the size of the in-core
inode structure smaller. Given that there can be thousands of
those structures in a running system, even a small size reduction in their
size can make a big difference. So, claims Eric, "That's
right, my code is smaller and faster. Eat that."
What this code needs now is detailed review from the core VFS developers.
Those developers tend to be a highly-contended resource, so it's not clear
when they will be able to take a close look at fsnotify. But, sooner or
later, it seems likely that this feature will find its way into the
Comments (13 posted)
The Linux kernel does not lack for low-level memory managers. The
venerable slab allocator has been the engine behind functions like
for many years. More
recently, SLOB was added as a pared-down allocator suitable for systems
which do not have a whole lot of memory to manage in the first place. Even
more recently, SLUB
as a proposed replacement for slab which, while being designed with very large
systems in mind, was meant to be applicable to smaller systems as well. The consensus
for the last year or so has been that at least one of these allocators is surplus
to requirements and should go. Typically, slab is seen as the odd
allocator out, but nagging doubts about SLUB (and some performance
regressions in specific situations) have kept slab in the game.
Given this situation, one would not necessarily think that the kernel needs
yet another allocator. But
Nick Piggin thinks that, despite the surfeit of low-level memory managers,
there is always room for one more. To that end, he has developed the SLQB allocator which he hopes to
eventually see merged into the mainline. According to Nick:
I've kept working on SLQB slab allocator because I don't agree with
the design choices in SLUB, and I'm worried about the push to make
it the one true allocator.
Like the other slab-like allocators, SLQB sits on top of the page allocator
and provides for allocation of fixed-sized objects. It has been designed
with an eye toward scalability on high-end systems; it also makes a real
effort to avoid the allocation of compound pages whenever possible.
Avoidance of higher-order (compound page) allocations can improve
reliability significantly when memory gets tight.
While there is a fair amount of tricky code in SLQB, the core algorithms
are not that hard to understand. Like the other slab-like allocators, it
implements the abstraction of a "slab cache" - a lookaside cache from
which memory objects of a fixed size can be allocated. Slab caches are
used directly when memory is allocated with kmem_cache_alloc(), or
indirectly through functions like kmalloc(). In SLQB, a slab
represented by a data structure which looks very approximately like the
(Note that, to simplify the diagram, a number of things have been glossed over).
The main kmem_cache structure contains the expected global
parameters - the size of the objects being allocated, the order of page
allocations, the name of the cache, etc. But scalability means separating
processors from each other, so the bulk of the kmem_cache data
structure is stored in per-CPU form. In particular, there is one
kmem_cache_cpu structure for each processor on the system.
Within that per-CPU structure one will find a number of lists of objects.
One of those (freelist) contains a list of available objects; when
a request is made to allocate an object, the free list will be consulted
first. When objects are freed, they are returned to this list. Since this
list is part of a per-CPU data structure, objects normally remain on the
same processor, minimizing cache line bouncing. More importantly, the
allocation decisions are all done per-CPU, with no bad cache behavior and
no locking required beyond the disabling of interrupts. The free list is
managed as a stack, so allocation requests will return the most recently
freed objects; again, this approach is taken in an attempt to optimize
memory cache behavior.
SLQB gets its memory in the form of full pages from the page allocator.
When an allocation request is made and the free list is empty, SLQB will
allocate a new page and return an object from that page. The remaining
space on the page is organized into a per-page free list (assuming the
objects are small enough to pack more than one onto a page, of course), and
the page is added to the partial list. The other objects on the
page will be handed out in response to allocation requests, but only when
the free list is empty. When the final object on a page is allocated, SLQB
will forget about the page - temporarily, at least.
Objects are, when freed, added to freelist. It is easy to foresee
that this list could grow to be quite large after a burst of system
freelist to grow without bound would risk tying up a lot of system
nothing while it is possibly needed elsewhere. So, once the size of the
free list passes a watermark (or when the page allocator starts asking for
help freeing memory), objects in the free list will be flushed back to
their containing pages. Any partial pages which are completely filled with
freed objects will then be returned back to the page allocator for use
There is an interesting situation which arises here, though: remember that
SLQB is fundamentally a per-CPU allocator. But there is nothing that
requires objects to be freed on the same CPU which allocated them. Indeed,
for suitably long-lived objects on a system with many processors, it
becomes probable that objects will be freed on a different CPU. That
processor does not know anything about the partial pages those objects were
allocated from, and, thus, cannot free them. So a different approach has
to be taken.
That approach involves the maintenance of two more object lists, called
and remote_free. When the allocator tries to flush a
"remote" object (one allocated on a different CPU) from its local
freelist, it will simply move that object over to rlist.
Occasionally, the allocator will reach across CPUs to take the objects from
its local rlist and put them on remote_free list of the
CPU which initially allocated those objects. That CPU can then choose to
reuse the objects or free them back to their containing pages.
The cross-CPU list operation clearly requires locking, so a spinlock
protects remote_free. Working with the remote_free lists
too often would thus risk cache line bouncing and lock contention, both of
which are not helpful when scalability is a goal. That is why processors
accumulate a group of objects in their local rlist before adding
the entire list, in a single operation, to the appropriate
remote_free list. On top of that, the allocator does not often
objects in its local remote_free list. Instead, objects are
allowed to accumulate there until a watermark is exceeded, at which point
whichever processor added the final objects will set the
remote_free_check flag. The processor owning the
remote_free list will only check that list when this flag is set,
with the result that the management of the
remote_free list can be done with little in the way of lock or
cache line contention.
The SLQB code is relatively new, and is likely to need a considerable
amount of work before it may find its way into the mainline. Nick claims
benchmark results which are roughly comparable with those obtained using
the other allocators. But "roughly comparable" will not, by itself, be
enough to motivate the addition of yet another memory allocator. So
pushing SLQB beyond comparable and toward "clearly better" is likely to be
Nick's next task.
Comments (28 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>