LWN.net Logo

Advertisement

Smart VPS: 192 MB RAM, 10 GB disc space, 50 GB data transfer and Virtuozzo OS virtualization solution.

Advertise here

Kernel development

Release status

Kernel release status

The current development kernel is 2.6.0-test4, which was released by Linus on August 22. This large patch includes several hundred changesets, including numerous networking fixes, a new free_netdev() method for networking drivers (see below), a new cpumask_t type for systems with more processors than bits in a long integer, a CONFIG_BROKEN option to control access to drivers known to be broken, a magic, fast new strncpy() implementation, the addition of wireless statistics to sysfs, Twofish and Serpent support for IPSec, a bunch of power management code, new sysfs attributes to control scanning of SCSI devices, a number of IDE patches, a new sysfs "attribute group" mechanism which enables the addition of attributes in a safer way and with less boilerplate code, an ALSA update, and a mind-numbing array of other fixes and updates. See the long-format changelog for the details.

As of this writing, Linus's BitKeeper tree contains only a handful of fixes. Linus is currently on vacation, so patches are not currently being merged.

The current stable kernel is 2.4.22, released by Marcelo on August 25. Marcelo is not resting, however; he has already put out 2.4.23-pre1, which includes a merge of the IP virtual server code, an LVM update, various driver updates, a possible first step toward the eventual inclusion of XFS, and a number of fixes.

Comments (5 posted)

Kernel development news

dev_t expansion status

The expansion of the dev_t type to 64 bits has been stalled for a few months now. Most of the work, it seems, has been done, but the patches have yet to find their way into the mainline kernel. Among other things, the dev_t expansion has been held up waiting for another set of patches from the elusive Alexander Viro. Mr. Viro still only surfaces rarely on the mailing lists, but it seems he has been busy; a set of large dev_t patches has turned up in 2.6.0-test4-mm2.

Many of the patches are essentially cleanups, such as removals of final uses of the kdev_t type which can be replaced with something else. After all, if a piece of code does not use device numbers at all, it should not run into trouble if the size of those numbers changes. Others begin to address more problematic code; for example, the JFFS filesystem incorporates device numbers directly into its on-media data structures; a change in the device number size would make older filesystems unreadable. In this case, for now, the (16-bit) size of this field has been made explicit.

Some of the patches take care of some (seemingly) unrelated block device layer cleanups. A few things, it seems, didn't work quite as well as expected once Al went back and took another serious look at the code.

Then, there is a simple addition to <linux/fs.h>:

	static inline unsigned iminor(struct inode *inode)
	{
		return minor(inode->i_rdev);
	}

This little function is the subject of the largest patch in the series: it replaces references to inode->i_rdev in a vast number of drivers and a few filesystems as well. The purpose, of course, is to allow access to the minor number of the device behind an inode without requiring any knowledge of how that number is actually stored within the inode. Not surprisingly, there is also an imajor() helper function.

Al mentions another series of patches which have not yet made an appearance. They will include a change to the inode structure, turning the i_rdev field into a dev_t type (from kdev_t). At that point, the addition of all those iminor() and imajor() calls will make sense; code using those calls will be unaffected by the inode structure change. There will also be patches to ensure that the conversion of device numbers between the internal representation and that used on-disk by filesystems is done properly.

So the expanded dev_t project is moving forward once again. This is an important feature to have in 2.6, so this is a good thing. There is, however, a large set of fairly invasive patches coming which may bring a surprise or two when it hits the 2.6.0-test mainline. (The actual patches can be seen in the 2.6.0-test4-mm2 patch, or separately on kernel.org; a good place to start is Al's overview of the patch series).

Comments (none posted)

The ongoing interactive scheduling effort

The interactive scheduling response of the 2.6.0-test kernels is a controversial topic. Some (including your editor) find the recent kernels to be noticeably more responsive than the 2.4 series; others complain loudly. It does seem that, despite the fact that some users are happy, the job is not yet entirely finished.

Con Kolivas has continued to produce his scheduler patches, which concentrate mostly on tweaking the interactivity estimation code. The basic idea remains that, if the system can get a good handle on which tasks are truly interactive, it can then be made to do the right thing. In many cases, that appears to be the case. Andrew Morton has, however, recently called for Con to take a step back and rethink things after being made aware of some significant performance regressions that appear to have been caused by the scheduler patches:

I suggest that what we need to do is to await some more complete testing of the CPU scheduler patch alone from Steve and co. If it is fully confirmed that the CPU scheduler changes are the culprit we need to either fix it or go back to square one and start again with more careful testing and a less ambitious set of changes.

Con did some quick testing and narrowed the problem down to Ingo Molnar's latest interactivity patch. There does not, as yet, appear to be a real understanding of what is going on, however.

Con has also recently posted a lengthy document on how the scheduler works and what changes his patches have made.

Nick Piggin is, perhaps, best known for scheduling disks - he is the author of the anticipatory I/O scheduler in 2.6.0-test. Nick recently decided to get into the CPU scheduler tuning game, and has started posting patches; his most recent is Nick's scheduler policy v7. These patches take a different approach, starting by hacking out almost all of the code that tries to calculate interactivity. They remove almost as much code as they add.

The key part of Nick's policy seems to be the manipulation of time slices. Processes at different priority levels get very different time slices - much more so than with the current scheduler. Time slices also depend on what else is running; if there aren't any high priority processes waiting to run, lower-priority processes will get larger slices. Process priorities also vary more quickly, allowing processes which sleep a lot to get back into the CPU quickly. Finally, this patch restores the "priority transfer" idea: when one process wakes another, a portion of the waking process's priority (and time slice) is given over to the process being awakened. This feature helps to keep the X server responsive. With Nick's patch, the X server benefits from being given a higher priority; this is not the case with Con's scheduler patches.

Getting scheduling right is hard, as can be seen by the amount of effort being put to the problem. By many accounts, 2.6 will be better than earlier kernels in this regard. But it would not be surprising if developers were still trying to improve it long after 2.6.0 is released.

Comments (8 posted)

Freeing network devices safely

Recent development kernels include a great deal of networking information under /sys/class. For the moment, it is mostly physical layer stuff, but one should expect more information to show up there over time, as it migrates out of /proc/sys. The current networking sysfs files draw their information from the interface's associated net_device structure. That scheme works nicely, in that network drivers need not concern themselves with providing the sysfs infrastructure; it just sort of happens. But consider what happens if a suitably privileged user executes something like:

	rmmod e100 < /sys/class/net/eth0/statistics/tx_bytes

This command will keep the indicated sysfs file open past the time when the module containing the net_device structure behind that file is removed from the system. Unless special care is taken, the open file will be left pointing to structures which no longer exist, leading to all kinds of potential trouble. Most drivers do not take that care.

Until 2.6.0-test4, that is. After a series of patches by Stephen Hemminger, drivers are expected to use kmalloc() to create net_device structures dynamicly. Most drivers already worked that way; the difference now is that drivers can no longer just return those structures with kfree() when they are no longer needed. Instead, there is a new function which is used to get rid of a net_device structure:

    void free_netdev (struct net_device *dev);

This function, of course, helps the networking system maintain reference counts for net_device structures, and avoid freeing them until they are truly unused. This whole structure is relatively simple, but it demonstrates, again, the higher level of care required to avoid creating race conditions in the 2.6 kernel.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

  • Con Kolivas: O18int. (August 22, 2003)
  • Con Kolivas: O18.1int. (August 24, 2003)

Device drivers

Documentation

Filesystems and block I/O

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds