Brief items
The current 2.6 kernel is 2.6.12, which was
released on June 17. Quite a few fixes -
but no substantial changes - were merged after the last release candidate.
For those who might not remember back to last March: 2.6.12 contains, among
other things, a driver for the "trusted computing" (TPM) chip found in
Thinkpads (and elsewhere),
SuperHyway bus
support, a multilevel security implementation for SELinux,
device mapper multipath support,
the
address space
randomization patches, a restored Philips webcam driver (still lacking
full functionality), full I/O barrier support for serial ATA drives,
resource limits which can be
used to allow unprivileged users to run tasks with realtime priority, and a
huge pile of fixes. See
the long-format
changelog for the details back to 2.6.12-rc2. For details prior to
that, see the long-format changelogs for
2.6.12-rc2 and
2.6.12-rc1.
No 2.6.13 prepatches have yet been released. There are, however, a few
hundred patches in Linus's git repository, including a big SCSI subsystem
update (the venerable SCSI changer driver has finally been merged), version
18 of the wireless extensions (with WPA2
security support), a new SysKonnect ethernet driver, some audit subsystem
improvements, some networking updates, a set of device model updates (see
below), a number of virtual memory improvements, some Rock Ridge filesystem
improvements, a new set of framebuffer fonts, some RAID (MD) improvements,
and a number of fixes.
The current -mm tree is 2.6.12-mm1. Recent changes to
-mm include a new version of the completely fair queueing (CFQ) I/O
scheduler, some VFS scalability work, and lots of fixes.
Comments (2 posted)
Kernel development news
Andrew Morton, looking forward to 2.6.13, has
posted a list of major patches
which, in his opinion, will (or will not) be merged soon. Reviewing the
list, along with the subsequent discussion, gives a good sense for what the
next 2.6 kernel might look like. Of course, the final product is still
likely to contain a few surprises.
Some of the decisions are not particularly controversial. Andrew is
likely to merge the OCFS2
filesystem, some Xen precursor patches, execute in place support,
software suspend support for SMP systems, some kernel timer performance
improvements, various KProbes updates, the RapidIO subsystem, some
scheduler tweaks, and some memory management work. Nobody has really
complained about the inclusion of any of these patches (yet), so their path into
the kernel might be relatively smooth.
One patch which has gotten surprising support is kexec, which was first
covered here in November,
2002. The ability to quickly boot a new kernel without going through
the system firmware is nice, but the real payoff for kexec comes when it is
combined with kernel crash
dumps. Crash dumps can be a useful diagnostic tool, especially for
vendors who are trying to track down a bizarre crash which only occurs at a
customer's site. So various distributors have included some sort of crash
dump capability in their kernels for some time. These patches will
typically write kernel memory to a disk or network device, then reboot the
system.
The approaches taken to crash dumps so far share one significant problem:
they all rely on the kernel to create its own dump. But this is a kernel
which has just gone into panic mode; it is not in a stable state.
The chances of an oopsing kernel completing a satisfactory crash dump are
not all that high (Arjan van de Ven estimates that it works about 10% of the
time). The real problem, however, is the risk involved in allowing an
unstable kernel to continue performing I/O; there is a very real
possibility that a (corrupted) crash dump could end up being written on top
of something that the owner would have preferred to keep.
The kexec approach gets around this problem by rebooting the system before
performing the dump. The normal, production kernel is configured to set
aside a small range of memory, which it never uses. Instead, a different
kernel is loaded into that memory; this kernel will be small, and
configured to do little other than performing crash dumps. If the system
should panic, kexec is used to immediately boot into the crash dump
kernel. This kernel, which will be starting fresh and in a known state,
can then write the contents of memory to some sort of permanent store
before rebooting into a new production kernel. This approach is safer and
more reliable; the mailing list discussion has been favorable enough that
kexec/kdump appears likely to be merged.
The reiser4 filesystem has sat in the -mm tree for some time, and Andrew
indicated that he might merge it this time around. Reiser4 has run into trouble into the past,
mostly as a result of its "file as a directory" semantics which change how
Linux works, can confuse tools, and, crucially, can lead to system
deadlocks. This feature has been disabled for now, but there is still
opposition to merging reiser4 into the mainline.
The main issue this time around would appear to be the plugin architecture
used by reiser4. Plugins can be used to change the behavior of the
filesystem in many ways, from adding compression to completely changing how
the file is laid out on disk. The plugin mechanism is a key part of Hans
Reiser's longer-term vision of how filesystems should work; he hopes to
eventually move all kinds of functionality into the filesystem level. The
kernel developers, however, do not think that this sort of mechanism should
be built into a filesystem; instead, much of what plugins do belongs in the
VFS layer. So they would like to see reiser4 slimmed down into a much
smaller, dumber system, with the plugin capability added on top of it and
made available for other filesystems as well.
Hans is resisting making this (large) change; he asks that the review process take a different
tack:
How about review by benchmark instead? It works, it runs faster
than the competition, users like it, we addressed the core kernel
patch complaints, it should go in and receive the exposure that
will result in lots of useful improvements and suggestions. It
seems like we are getting an unusual review process.
Things appear to be at a standoff which could block the inclusion of
reiser4 for some time.
Yet another change under consideration is configurable clock frequencies
for the i386 and ia-64 architectures. The current value (1KHz) turns out
not to be optimal for all users; lower clock frequencies can improve
throughput on some systems at the cost of coarser timer resolution and
possibly increased latencies. There have been complaints about the new
default (250Hz) and the fact that the patch is going in at all when more
sweeping changes to the timer system (such as the dynamic tick patch) are waiting
on the wings. Your editor's guess is that the patch will be merged, but
the default may be changed to keep the current HZ value.
FUSE (user-space filesystems) is being discussed again. FUSE has run into opposition due to the way it
overrides the file permissions checking done at the VFS level. There does
not appear to be any solution to this issue that pleases everybody, so it
is hard to say where this one might go. It is possible that FUSE will be
merged, but without its particular permissions behavior - a solution which
would leave a number of FUSE users still needing to apply a patch to get
the behavior they want.
It didn't appear on Andrew's list, but the removal of devfs has also been a
discussion item. Andrew didn't entirely like the full patch set which
completely removed devfs from the kernel; he wondered what would happen if
enough people complained and devfs had to be restored at some point in the
future. So the current approach is to simply remove the devfs
configuration option, making the functionality inaccessible. Eventually,
if no major problems turn up, the code can be removed for real.
Comments (12 posted)
Greg Kroah-Hartman has gotten 2.6.13 off to a good start with
a massive set of driver core
patches. There are a fair number of API changes that come with this
patch set, so the whole thing is worth a look. In-tree code has been fixed
to use the new API, but, as always, maintainers of external code are on
their own.
Two of the more significant changes were covered here last March. The interfaces have
not changed since then, so that coverage will not be duplicated. The first
of these changes is the complete rework of the "class" API. The interface
known as "class_simple" turned out to be the best way to work with classes,
so Greg reworked it as the class API, changing everything as he
went. The interface known as class_simple is no more, but the new class
API looks much like class_simple used to. The other change is the addition
of the "klist" type: an extension to the kernel linked list type which
includes its own, built-in reference counting and locking.
The next change is in the prototypes of the store() and
show() callbacks for device attributes. These callbacks now look
like:
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
In each case, the callbacks have picked up a pointer to the actual
attribute being accessed, allowing one callback to handle multiple
attributes.
There have been a number of internal changes to device model data
structures which really shouldn't affect other code, but which might
anyway. Various internal lists have been removed; in some cases, they have
been replaced with klists. And a number of character pointers are now
explicitly const pointers.
Code wanting to look through the devices bound to a driver can use a new
function to iterate through the list:
int driver_for_each_device(struct device_driver *driver,
struct device *start,
void *data,
int (*fn)(struct device *, void *));
This function will call fn() for each device bound to the given
driver, stopping at the end of the list or when fn()
returns a non-zero value.
Inodes in sysfs now have an i_op->setattr() function, meaning that
their permissions can be changed and those changes will last for as long as
the system runs. Changing of sysfs permissions was never really supported
in the past; it would work for a bit, but the permissions could be reverted
at seemingly random times. This is not really an API change, but
creators of sysfs attributes should bear in mind that the permissions on
those attributes might be changed from their original values.
Comments (none posted)
Filesystem authors try hard to avoid losing data. Many of them have
discovered, the hard way, that failure to return a user's bits in exactly
the same condition as when they were entrusted to the filesystem can lead
to serious disgruntlement down the road. There are limits to what a
filesystem can do, however, when the hardware starts to fail. If a disk
drive begins to go bad, or somebody yanks out a hotpluggable device,
problems are simply going to happen.
So what should a filesystem do in such a case? The behavior shown by most
Linux filesystems (and partially enforced by the VFS layer) is to return an
I/O error status (EIO) when things start to fail, then remount the
filesystem in a read-only mode in an attempt to avoid any further damage.
The end result is that a user-space application might see an
EIO error return once - or it might not, since not all in-kernel
error codes make it all the way back to user space. After that, the
returned error will be EROFS (read-only filesystem), which is not
entirely illuminating.
Back in the good old days, we would just look in the system log file to see
what was really going on. The new crowd of Linux users would rather not
have to do that, however; they expect the system to tell them, politely,
that their hardware is on fire and that they are about to deeply regret not
having run any backups since sometime last winter. The problem is that the
POSIX API is simply not set up to return that sort of detailed error
information. Breaking compatibility with POSIX is not an option, so
something complicated would have to be done to return error information
within the bounds of the current API. Beyond that, however, is the simple
fact that the application which is currently beating its head against disk
errors might not be the right one to be having a pleasant conversation with
the user about those errors.
These issues have led Ted Ts'o to suggest
that a different mechanism should be used. Rather than try to shove
additional information through the existing API, the kernel should simply
report events like disk disasters via an out-of-band mechanism. For
example, errors could be reported with the user notification mechanism and
fed into DBus for
distribution. The user could then be informed of the trouble and given the
opportunity to panic in a desktop-specific manner.
There seems to be a high level of agreement that the out-of-band
notification is the right way of doing things. All that is needed is for
somebody to do the hacking to actually make it happen.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Marco Costalba: qgit-0.6.
(June 20, 2005)
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>