Release status
Kernel release status
The current development kernel is 2.5.72, which was
released by Linus on June 16. This
relatively small patch contains an x86-64 merge, a partial reversion of the
IDE taskfile switchover, a PA-RISC update, and various fixes and cleanups.
The long-format changelog has the details.
Linus had released the 2.5.71 ("sticky
turtle") kernel only two days before. This long-awaited patch included a
fair amount of driver model work, some extensive PCI bus cleanups (dealing
with potential race conditions there), the big IDE changeover to taskfile
I/O, a new /proc/kallsyms file, support for per-CPU variables in
modules, a change the kmalloc_percpu() interface, an Atmel
at76c50x wireless driver, a long-sought fix for hanging TCP sessions, an
improved slab allocator which performs better in busy, multi-processor
situations, some kbuild tweaks, an ALSA update, a set of hash function
changes to deal with algorithmic complexity attacks, a FAT filesystem
rework (if you have been waiting to be able to create FAT partitions
greater than 128GB, this patch is for you), a v850 subarchitecture merge, a
RAID update, the removal of the long-deprecated callout TTY device
(/dev/cua) support, numerous architecture updates, and several
other fixes and updates. As always, the
long-format changelog has the gory details.
Linus's BitKeeper tree contains an extensive ext3 and JBD rework (see
below), an OProfile update, some NFS server fixes, and a few other fixes
and updates.
With the 2.5.72 announcement, Linus announced that he is taking a leave of
absence from Transmeta to go work at the Open Source Development Lab.
"Transmeta has always been very good at letting me spend even an
inordinate amount of time on Linux, but as a result I've been feeling a
little guilty at just how little 'real work' I got done lately. To fix
that, I'll instead be working at OSDL, finally actually doing Linux as my
main job."
The current stable kernel is 2.4.21, released, at last, on June 13. There were
no changes since -rc8.
No 2.4.22 prepatches have come out yet. Marcelo's plan, at this point, is to have 2.4.22 contain
an updated aic7xx driver and the current ACPI tree (both items that people
had wanted in 2.4.21), along with some interactivity and memory management
fixes.
Comments (none posted)
Kernel development news
What's needed to fix user-space device enumeration?
Back in April, LWN
looked at udev, a simple
user-space daemon which handles the dynamic creation and removal of device
nodes. Udev is an answer to devfs which uses hotplug events and sysfs to
manage the device tree in user space. Things have been fairly quiet on the udev front -
at least, on the public lists. That changed, however, when Steven Dake
posted
a patch aimed at fixing some problems
he sees with how udev works. At that point, it become clear that an
off-list discussion has been going on for some time.
Mr. Dake has a list of four problems that he is trying to fix with his
patch, which creates an event queue within the kernel and a virtual device
for retrieving events from that queue. These problems are:
- The current implementation (which invokes /sbin/hotplug for
each device event) has performance problems when the number of devices
is large.
- There is no policy controlling how many /sbin/hotplug
processes can be created simultaneously, a shortcoming which can lead
to out-of-memory situations.
- /sbin/hotplug is not available during the early part of
the system initialization process, so early device enumeration is
not possible.
- Hotplug events can be processed out of order, leading to device
directory corruption.
The posting elicited some strongly-worded
responses. The general view is that the first three of the problems
listed above do not actually exist. The cost of /sbin/hotplug is
small relative to the cost of device probing and initialization, so, in the
real world, system load and performance are not problems. Early
initialization can be handled with initramfs or by reconstructing things in
user space from the sysfs tree. The hotplug developers thus feel no
pressure to "fix" any of those problems. Linus also chimed in with a condemnation of event daemon
schemes.
When the dust settled, however, the problem of event reordering remained.
Device events can come quickly, and the vagaries of scheduling, page
faults, etc. can cause them to be processed in an order different from that
in which they were generated. Some fairly complicated schemes were
presented for dealing with this problem, but they were set aside when
Andrew Morton suggested the (in retrospect)
obvious: add a sequence number to hotplug events. With a unique,
increasing sequence number, it is simple for a user-space process to detect
(and fix) misordered events. Problem solved.
Comments (1 posted)
Avoiding sysfs surprises
One of the nice (and increasingly important) features of the 2.5 device
model is sysfs. This virtual filesystem exports a view of the system's
structure to user space; it also provides a nice control interface - and
/proc replacement - by allowing attributes to be attached to sysfs
entries. Sysfs is not without its traps, however, and many kernel
developers are just now beginning to realize the sort of care that is
necessary to avoid making mistakes.
The hardware supported by Linux is increasingly dynamic; devices can appear
and disappear at any time. The sysfs filesystem adjusts itself in response
to hardware events by creating and removing directories associated with
devices, classes, and other objects. Kernel code typically implements this
functionality by allocating (and registering) device structures and other
objects when a device is plugged in, and deleting those structures when the
device is removed. It tends to work quite well.
But consider the following possible sequence of events:
- A user plugs in a shiny new hotplug PCI frobnicator.
- The driver creates a device structure and registers it; as a result,
the directory /sys/devices/pci0/00:11.0/ (or some such) gets
created and filled with attributes.
- A user process moves into that directory, opens one of the attribute
files, but doesn't get around to reading it yet.
- The user, having done enough frobnication for one day, unplugs the
device.
- The driver unregisters and frees the device structures.
All seems well, except for the small problem of that user process. By
sitting in the directory, it maintains a reference there. The open
attribute file is yet another reference. If the driver has truly cleaned
up and freed the devices, the user process will be holding structures with
pointers into freed memory. An attempt to read the (already open)
attribute file at this point is almost certain to crash the system.
The above scenario is not hypothetical; a fair number of such conditions
exist in the 2.5 kernel now. That is why this issue (titled "kobject
refcounting") appears in the 2.6 must-fix
list. It truly must be fixed.
The infrastructure exists to handle these problems, but it must be used
properly to be effective. The solution lies in the same place as the
problem - the kobject structure. The 2.5.72 version of this
structure looks like:
struct kobject {
char name[KOBJ_NAME_LEN];
atomic_t refcount;
struct list_head entry;
struct kobject *parent;
struct kset *kset;
struct kobj_type *ktype;
struct dentry *dentry;
};
Entries in sysfs are closely tied to kobjects; there is a kobject
associated with each directory in the filesystem. When a process moves
into a sysfs directory or opens a sysfs file, the associated kobject has
its refcount field incremented. As long as the reference count is
above zero, the kobject cannot be deleted.
The same kobjects, of course, are embedded deeply within the structures
used to represent devices and other system objects. So a nonzero reference
count in a kobject means that the entire device structure (and, perhaps,
the module infrastructure supporting it) is still in use. Safely putting
things into sysfs is really just a matter of not deleting objects until
their reference counts hits zero.
Of course, that is easily said, but the current mechanism for implementing
such a policy is not entirely obvious. An example might help, so we'll
look at the block subsystem, which does things right. Disks, within the
kernel, are represented by the gendisk
structure. The function used to create a gendisk is
alloc_disk(), which, after allocating and initializing a
gendisk structure (which contains a kobject), executes this
mysterious line of code:
kobj_set_kset_s(disk,block_subsys);
This line tweaks the kobject within disk (the gendisk
structure) to make it a part of block_subsys. The block subsystem
structure, in turn, contains a pointer to a kobj_type structure,
which, in this case, looks like:
static struct kobj_type ktype_block = {
.release = disk_release,
.sysfs_ops = &disk_sysfs_ops,
.default_attrs = default_attrs,
};
We'll come back to this structure in a moment. For now, suffice to say
that it identifies the kobject (and the gendisk structure that
contains it) as something belonging to the block code, and provides some
methods implementing the object's operations.
The function which puts a new disk into the system is add_disk();
it creates the associated sysfs structure, and increments the disk's
reference count. The disk then goes through its lifecycle, with the
reference count going up and down as it is mounted and unmounted, and as
its sysfs files are accessed. Should the disk disappear, the driver will
do some cleanup and call del_gendisk() to return the
gendisk structure to the system.
del_gendisk() does not actually free the structures, however. It
removes the sysfs entries and generally shuts things down; it then finishes
by decrementing the reference count. That operation releases the reference
which was first obtained in add_disk(). The driver also must
release its own reference with put_disk(). These operations may
drop the reference count to zero - if nobody else is holding a reference to
the disk. But there is no way to know ahead of time.
Sooner or later, however, the last reference will go away. The function
which actually decrements the count (kobject_put()) tests that
count for zero. If no references remain, kobject_put() will go
back to the kobj_type structure associated with the kobject
(the ktype_block we saw above, in the case of a gendisk) and
call the release() method found there. That method, knowing that
nobody is referring to the object, can actually remove it from the system.
That is how sysfs objects must be managed. They must have a
destructor associated with them, by way of the kobj_type
structure, and that destructor must understand the higher-level objects
that it is dealing with. With this mechanism in place, objects will
continue to exist as long as references to them are held.
Of course, things can get more complicated than that. If, for example, a
module adds attributes to sysfs entries, that module cannot be removed
until it is certain that all of the relevant references have gone away.
It gets even worse if kernel code tries to attach attributes to objects
which it does not own; in that case it can be very hard to get everything
right. It may eventually prove necessary to rework some of the sysfs
interfaces to make it easier to avoid mistakes, but that seems unlikely for
2.5 at this point. In the mean time, connecting the pieces together
correctly can be an intimidating task the first time around, but the
alternative is to put denial of service vulnerabilities into the kernel.
Comments (1 posted)
Big changes to ext3 and journaling
The ext3 filesystem is, for many, the standard journaling filesystem for
the Linux kernel. So it has been somewhat embarrassing that ext3 still
uses a number of deprecated interfaces, including the big kernel lock and
sleep_on(). The big kernel lock (BKL) is a holdover from the
initial Linux symmetric multiprocessing implementation, when it was not
safe for more than one processor to run in the kernel at the same time.
Its presence in ext3 is not just considered archaic and inelegant; it is
also a serious performance constraint on larger SMP systems.
As of 2.5.73, the BKL has been abolished from ext3, thanks to a lengthy
series of patches by Andrew Morton and Alex Tomas. These patches never did
show up on linux-kernel, but they have been part of the -mm kernel tree for
some time. Says Andrew:
My gut feeling is that there should be one, maybe two bugs left in
it, but no problems have been discovered...
So, as with all development kernels, a bit of caution is called for.
Removing the BKL from ext3 was actually a simple thing to do. That
filesystem, itself, had no need for the BKL - it is the generic journaled
block device (JBD) layer that required that protection. So the first step
was to push the BKL
down a layer, and ext3 was BKL-free. Of course, that didn't solve the real
problem, but it was a start. While ext3 was being worked on, a few other
patches went in:
- Concurrent block and inode allocation, much like ext2 has had for
some time. This patch puts a separate spinlock on each cylinder group
in a filesystem, allowing allocation to happen in multiple groups
simultaneously.
- "Fuzzy counters," which implements approximate counters for free
blocks and inodes using per-CPU variables.
- The ext3 "data=journal" mode has been fixed. This mode,
which journals all data written to the disk (rather than just the
metadata) has been broken for a long time.
With ext3 done, it was time to fix up the JBD layer. This job was not done
halfway - a lengthy series of patches adds several locks and a whole,
complicated, fine-grained scheme. Each transaction gets two separate locks
(t_handle_lock and t_jcb_lock) controlling access to
various data structures. There is another set for the journal:
j_state_lock for scalar state information, j_list_lock
for lists and buffers, and j_revoke_lock for the list of revoked
blocks. Two more locks protect aspects of the buffer head/journal
head combination. And, of course, there is a whole set of ordering rules
to control which locks must be taken before which others. Believe it or
not, there is even a certain amount of documentation in the code comments
describing which locks protect which data structures.
The whole body of work clearly needs wider testing (and benchmarking), so
it's probably a good time for it to go into the mainline kernel. Hopefully
there won't be too many surprises lurking for the unwary (or unbacked-up).
As this work stabilizes, however, another big item can be scratched off the
"must-fix" list.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>