Brief items
The current stable 2.6 release is 2.6.15.3,
released on February 6. It
contains a single, one-line fix for a remotely-exploitable denial of
service vulnerability in the ICMP code.
The 2.6.15.4 release is under review as of
this writing. It is a rather larger patch with almost two dozen important
fixes.
The current 2.6 prepatch is 2.6.16-rc2, released by Linus on
February 2. In addition to the expected big pile of fixes, this
prepatch adds another set of semaphore-to-mutex conversions, a USB driver
for ET61X151 and ET61X251 camera controllers, a big Video4Linux update, the
direct migration patches,
some slab allocator tweaks for NUMA machines, several new system calls
(openat() and friends, pselect(), ppoll()), a
big ACPI update, and
the EDAC error detection/correction code. The long-format changelog has lots of details.
The mainline git repository contains almost 500 post-rc2 patches as of this
writing. They are dominated by fixes, but there is also a patch to export
the system's CPU topology in sysfs, parallel port support for SGI O2
systems, administrator-changeable permissions in configfs, an OCFS2 update,
the unshare() system
call, and various architecture updates.
The current -mm tree is 2.6.16-rc2-mm1. Recent changes
to -mm include a rework of the mempool code, a new version of the core
timekeeping and NTP rework patches, better scheduler support for multicore
systems, a feature for forcing kernel allocations to be spread across NUMA
nodes, and an LED driver subsystem.
Comments (none posted)
Kernel development news
We've got bin-only kernel modules, much of which are clearly
immoral, they are clearly hurting us and still we do things to keep
them going - e.g. the refusal to remove 8K stacks from the
.config. We are increasingly getting into a situation where
loopholes are found and utilized to give back as little as
possible, upsetting the balance.
so i believe _something_ should be done to tip the balance, because
the negative effects are already hurting us. I'd support the move
to the GPLv3 only as a tool to move the balance back into a fairer
situation, not as some new moral mechanism. The GPLv3 might be
overboard for that, but still the situation does exist undeniably.
-- Ingo Molnar
After seven years and hundreds of issues, I've decided to take a
break from writing Kernel Traffic for awhile. I'd like to thank all
the people who helped out, providing me with hosting space,
hardware to work on, suggestions and bug reports, and money. And
I'd especially like to thank Linus and the rest of the kernel
developers for so powerfully changing the world for the better.
-- Zack Brown
Comments (9 posted)
The
file_operations structure contains pointers to the basic I/O
operations exported by filesystems and char device drivers. This structure
currently contains three different methods for performing a read operation:
ssize_t (*read) (struct file *filp, char __user *buffer, size_t size,
loff_t *pos);
ssize_t (*readv) (struct file *filp, const struct iovec *iov,
unsigned long niov, loff_t *pos);
ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer,
size_t size, loff_t pos);
Normal read operations end up with a call to the read() method,
which reads a single segment
from the source into the supplied buffer. The readv() method
implements the system call by the same name; it will read one segment and
scatter it into several user buffers, each of which is described by an
iovec structure. Finally, aio_read() is invoked in
response to asynchronous I/O requests; it reads a single segment into the
supplied buffer, possibly returning before the operation is complete.
There is a similar set of three methods for write operations.
Back in November, Zach Brown posted a vectored AIO patch intended to
provide a combination of the vectored (readv()/writev()) operations and
asynchronous I/O. To that end, it defined a couple of new AIO operations
for user space, and added two more file_operations methods:
aio_readv() and aio_writev(). There was some resistance
to the idea of creating yet another pair of operations, and a feeling that
there was a better way. The result, after work by Christoph Hellwig and
Badari Pulavarty, is a new
vectored AIO patch with a much simpler interface - at the cost of a
significant API change.
The observation was made that a number of subsystems use vectored I/O
operations internally in all cases, even in the case of a "scalar"
read() or write() call. For example, the read()
function in the current mainline pipe driver is:
static ssize_t
pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
{
struct iovec iov = { .iov_base = buf, .iov_len = count };
return pipe_readv(filp, &iov, 1, ppos);
}
Here, the read() method is essentially superfluous; it is provided
simply because the API requires it. So, it was asked, rather than adding
more vectored I/O operations, why not just "vectorize" the standard API?
The resulting patch set brings about that change in a couple of steps.
The first of those is to change the prototypes for the asynchronous I/O
methods to:
ssize_t (*aio_read) (struct kiocb *iocb, const struct iovec *iov,
unsigned long niov, loff_t pos);
ssize_t (*aio_write) (struct kiocb *iocb, const struct iovec *iov,
unsigned long niov, loff_t pos);
Thus, the single buffer has been replaced with an array of iovec
structures, each describing one segment of the I/O operation. For the
current single-buffer AIO read and write commands, the new code creates a
single-entry iovec array and passes it to the new methods. (It's
worth noting that, as the code is currently written, that iovec
array is no longer valid after aio_read() or aio_write()
returns; that array will need to be copied for any operation which remains
outstanding when those functions finish).
The prototypes of a couple of VFS helper functions
(generic_file_aio_read() and generic_file_aio_write())
have been changed in a similar manner. These changes ripple through every
driver and filesystem providing AIO methods, making the patch reasonably
large. A second patch then adds two new AIO operations
(IOCB_CMD_PREADV and IOCB_CMD_PWRITEV) to the user-space
interface, making vectored asynchronous I/O available to applications.
The patch set then goes one step further by eliminating the
readv() and writev() methods altogether. With this patch
in place, any filesystem or driver which wishes to provide vectored I/O
operations must do so via aio_read() and aio_write()
instead. Note that this change does not imply that asynchronous operations
themselves must be supported - it is entirely permissible (if suboptimal)
for aio_read() and aio_write() to operate synchronously
at all times. But this patch does make it necessary for modules wishing to
provide vectored operations to, at a minimum, provide
the file_operations methods for asynchronous I/O. If the AIO
methods are not available for a given device or filesystem, a call to
readv() or writev() will be emulated through multiple
calls to read() or write(), as usual.
Finally, with this patch in place, it is possible for a driver or
filesystem to omit the read() and write() methods
altogether if the asynchronous versions are provided. If, for example,
only aio_read() is provided, all read() and
readv() system calls will be handled by the aio_read()
method. If, someday, all code implements the AIO methods, the regular
read() and write() methods could be removed altogether.
That would result in an interface which contained only one method for all
read operations (and one more for writes). This change would also realize
the vision expressed at the 2003
Kernel Summit that all I/O paths inside the kernel would, in the end,
be made asynchronous.
There has been little discussion of the current patch set, so it is hard to
predict what may ultimately become of it. Given that it simplifies a core
kernel API while simultaneously making it more powerful, however, chances
are that some version of this patch will find its way into the kernel
eventually.
(For more information on the AIO interface, see this Driver Porting Series
article or chapter 15 of LDD3).
Comments (1 posted)
Last week's Kernel Page
looked at one small piece of the software suspend debate. Meanwhile, the wider
discussion has flared up yet again, and looks unlikely to slow down.
Developers of the in-kernel suspend-to-disk code are working on moving
parts of it to user space and generally tweaking the existing structure.
Nigel Cunningham and other supporters of the Suspend2 patches, instead,
still hope to see that work merged, eventually replacing much of the
existing implementation. The discussion does not appear to be nearing any
sort of resolution.
One has become clear, though: Pavel Machek has a firm grip on the current
in-tree swsusp code, and that puts Suspend2 at a significant disadvantage.
Pavel has taken a strong position against many aspects of the Suspend2
code, and seems determined that it will never be merged. One gets the
sense, sometimes, that he just wishes Nigel and his code would go away.
Nigel is somewhat more persistent than that, however.
At one point, the two suggested that Linus and Andrew should make a
decision between the two implementations and settle the debate. Andrew,
however, does not want to do that:
You're unlikely to hear anything dispositive from either of us on
this... What we hope and expect is that you'll come up with an
agreed path in accordance with general kernel coding and
development principles. Linus and I don't want to have to make
tiebreak decisions - if we have to do that, the system has failed.
So much for the easy solution. Since then, the relevant parties have been
talking, but without a whole lot of apparent progress.
Perhaps the more interesting part of Andrew's note, however, was this:
If you want my cheerfully uninformed opinion, we should toss both
of them out and implement suspend3, which is based on the
kexec/kdump infrastructure. There's so much duplication of intent
here that it's not funny.
kexec(), remember, is a relatively new system call used to boot
from one kernel directly into another without going through the whole BIOS
startup ritual. The kdump code uses kexec() to perform safe crash
dumps. When the kernel panics, it uses kexec() to boot into a
small, special-purpose kernel which has been lurking in a reserved part of
memory for just this occasion. The new kernel restricts itself to the
reserved memory, so the entire memory image of the old, crashed kernel
remains intact. That image can then be written to disk in a relatively
safe manner.
It is true that suspend-to-disk can be thought of as a sort of kernel dump;
the only difference is this little desire to be able to restart the kernel
from the dump image at a future time. Using kdump for suspend-to-disk has
some obvious appeal. A great deal of effort now goes into freezing most
processes on the system - but not the ones needed to complete the suspend
process. The suspend code also must be very careful about what kernel
state it changes as it goes about its work. Simply jumping into a
separate dump kernel has the potential to make many of those problems go
away. It might almost be like the Good Old Days, when BIOS-based suspend
code simply worked most of the time.
A kdump-based suspend would not be without its costs. In particular, some
people might balk at reserving a substantial chunk of memory for the
suspend kernel. And, of course, the entire idea remains vaporware for
now.
Andrew's suggestion generated little discussion on the mailing list. But,
just maybe, it will have ignited a gleam in some hacker's eye. A simpler,
more robust suspend mechanism based on kdump which appeared out of left
field might just solve this problem - and put the whole tiresome debate in
the past - for good.
Comments (22 posted)
A set of patches for the management of virtual process IDs within
containers was discussed here
a
few weeks ago. That patch set drew some interest, but a fair amount of
concern as well. It is a large set of changes reaching all over the
kernel; it seemed to many that there should be a better way.
Since then, two candidates for the "better way" have been posted, and the
situation seems less clear than ever. This sort of virtualization is
clearly of interest to a number of projects, but there is little consensus
on how it should be done.
One of the new entrants is the OpenVZ PID virtualization code,
posted by Kirill Korotaev but originally developed by Alexey Kuznetsov.
These patches introduce a container called a VPS (virtual private server),
each of which can virtualize a number of aspects of the host system,
including process IDs. Each process has a real and virtual PID; all PIDs
of the virtual variety are identified by having a specific bit set. In the
simple case, the virtual-PID bit is the only difference between the real
and virtual IDs, but more complex mappings are possible as well.
There is the usual set of functions to convert between real and virtual
PIDs (and group, process group, and thread IDs as well). All code which
deals with user space must work with virtual PIDs, but internal code uses
real PIDs, so a certain amount of awareness is called for. Since there is
a specific bit used to mark virtual PIDs, the code is at least able to
catch situations where the wrong type of PID is used. There is also a
change to the internal fork() implementation allowing a process to
be created with a specific virtual PID; this feature can be used to launch
a new container with its top-level process having PID 1.
The other implementation is this
"process ID namespace" patch set from Eric Biederman. It does away
with the concept of virtual PIDs in favor of a different view of the
problem. For starters, every process gets a "wait ID" - the process ID
by which its parents know it. In most cases, the "wait ID" will be the
same as the PID, but, in cases where a process is the leader of a
virtualized group, the two will be different.
Then Eric adds process ID spaces. A process ID space (pspace) is simply a
range of independent PIDs, associated with tree of processes. By
default, the entire system shares one process space, but, by way of a
clone() flag, a new process can be created in its own space.
Process IDs are unique within any one pspace, but may be duplicated in
other spaces. So the kernel, when it must identify a process unambiguously
using a PID, must now use a (pspace, PID) tuple. Functions which deal
in PIDs - kill_pg() or find_task_by_pid(), for example -
get a new pspace parameter.
This approach has the advantage that there is no distinction between real
and virtual PIDs - all PIDs are interpreted relative to a PID
space. There is no real possibility of confusing real and virtual PIDs, or
interpreting PIDs relative to the wrong pspace. So it should be a
relatively safe addition to the kernel. On the other hand, Eric's patches
don't even try to address the larger virtualization problem; anybody
wanting to implement complete containers will still have to do that work
separately. Of course, as has been seen, a few projects have already done
that work; it's just a matter of seeing which implementation, if any, gets
into the mainline.
On that question, it is far too early to say what might happen. Linus has
indicated that he likes the container
concept from the OpenVZ patches, but that does not necessarily extend to
the PID virtualization part of it. Eric has tried to focus the discussion
with a summary of the relevant issues and
questions which must be resolved going forward. But there is a certain
amount of disagreement, and a few projects which have each invested
significant time into their particular approaches. It may be a while
before the dust settles on this one.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>