Kernel development
Brief items
Kernel release status
The current development kernel is still 2.6.0-test5, which was released back on September 8.The pile of patches in Linus's BitKeeper repository continues to grow. The most notable change is probably the dev_t expansion (see below); other patches which have been merged include a device mapper update, some NFS updates, a big I2C update, Con Kolivas's and Ingo Molnar's scheduler interactivity patches, a Coda filesystem update, some initramfs tweaks, improvements in random driver locking, the removal of some ext3 debugging hooks, direct I/O support for reiserfs, some CPU frequency work, an Intel SpeedStep-SMI driver, a substantial amount of janitorial work, and various fixes.
The current stable kernel is 2.4.22. Marcelo continues to work on 2.4.23; he released 2.4.23-pre5 on September 21. This prepatch adds some ACPI fixes, an omitted piece of the VM patch set that went into -pre4, and various other fixes.
It remains a relatively slow period in kernel development, so this is not the longest LWN Kernel Page we have ever produced. It was hard, but we have resisted the urge to fill it out with coverage of the latest BitKeeper flame war.
Kernel development news
dev_t expands at last
The expansion of the dev_t device number type has been on the list of goals for 2.6 since the beginning. The only problem is that it has stayed on that list through the entire 2.5 development process; for various reasons, work on that project stalled for a long time. As of September 24, however, the dev_t expansion can be checked off the list; Linus has merged the required changes into his BitKeeper tree. They will appear in the 2.6.0-test6 release.For some time, it had appeared that dev_t would expand to 64 bits, with 32 bits each for the major and minor numbers. The actual change, however, is to 32 bits, with a 12-bit major number and 20 bits for the minor. That should be adequate for some time, especially given that the new registration mechanisms and sysfs make it much easier for the system to use device numbers more effectively.
Internally, the new kernel dev_t type uses the encoding one would
expect: the major number sits in the top twelve bits of a 32-bit value,
with the minor number in the bottom 20 bits. The encoding seen by user
space is different, however, as shown in the diagram to the right. Here,
the major number sits in bits 8-19, while the minor number is split across
bits 20-31 and 0-7. This representation may seem strange, but it has one
very nice property: old 16-bit device numbers are still valid in the new
scheme. Encoding device numbers this way helps keep no end of applications
from breaking with the new device number type. One might wonder why this
workaround is necessary, given that the C library can convert device
numbers as needed for the few system calls (mknod(),
stat(), etc.) that actually need them. The problem is that device
number pop up in a number of other contexts, such as in filesystems and
ioctl() calls, where the C library is unable to help.
There are places, however, where an explicitly 16-bit value is passed. There is no way to change that without breaking applications. In such cases, the kernel checks whether 16 bits is sufficient; if not, the system call has no choice but to fail with an EOVERFLOW error.
Beyond that, most of the groundwork for the new dev_t had already been laid over the last few months. There are, however, certain to be a few surprises left after such a fundamental change. The next couple kernels could be interesting to use while the remaining issues get ironed out.
Selectable I/O schedulers for 2.6
The 2.5 development series saw the creation of a few different I/O schedulers ("elevators") for the block I/O subsystem. I/O schedulers attempt to perform requested block I/O operations in an order that maximizes performance. Given that different people (and applications) measure performance differently, it is not surprising that more than one I/O scheduler exists. So, for example, the "deadline" scheduler attempts to minimize seeks while ensuring that no request waits for more than a certain period of time. The anticipatory scheduler pauses after completing read operations on the assumption that another nearby read will show up quickly. The CFQ ("completely fair queueing") scheduler tries to divide up the available I/O bandwidth equally among processes. And there is a "noop" scheduler for devices (such as memory-based devices) which do not benefit from I/O scheduling logic at all.What has been lacking is any sort of way for a system administrator to choose between these schedulers. A system I/O scheduler can be designated with the elevator= boot parameter, but that choice applies to all drives on the system, and it cannot be changed. This restriction makes experimenting with the various schedulers difficult; in the real world, it may also be appropriate to use different schedulers for different drives.
So Nick Piggin has released a patch which makes I/O schedulers selectable at run time. With the patch, a new io_scheduler sysfs attribute appears under /sys/block/<device>/queue; changing a scheduler is simply a matter of writing the name of the new scheduler into that attribute. So, for example, to go to CFQ on the first SCSI drive:
echo cfq >/sys/block/sda/queue/io_scheduler
Changing schedulers requires pausing and emptying the I/O queue, so it might not be advisable in the middle of writing a CD or controlling a nuclear power plant shutdown. But it certainly can be a useful thing to do at system initialization time, or while experimenting with scheduler performance under a certain kind of load.
The beginning of the end for devfs
One of the patches that will appear in 2.6.0-test6 is one marking the devfs subsystem as being obsolete. The patch from Christoph Hellwig reads:
Devfs was the subject of countless heated linux-kernel battles in the years leading up to its inclusion in 2.3. It made rather less of a spash afterwards; none of the major distributors have enabled devfs in their kernels, with the (arguable) exception of Gentoo. When a subsystem does not get used, and especially when its maintainer stops working on it, that subsystem's future tends to be dim. Such is the case with devfs. Christoph has said he will continue to fix a few problems, but will do no more with it. 2.6 may be the last major kernel series that includes the devfs subsystem.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Networking
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
