LWN.net Logo

Kernel development

Release status

Kernel release status

The current development kernel is 2.6.0-test5, which was released by Linus on September 8. Changes this time include new, type-safe ioctl() command code checker (see below), a USB "gadget" framework which enables the creation of user-space drivers, a new CONFIG_64BIT configuration option, a number of futex improvements, a reworked de4x5 driver, "very basic" VIA 8237 serial ATA controller support, support for a software-implemented hard disk activity LED, Intel High Precision Event Timers support, Al Viro's first set of large dev_t support patches (covered here two weeks ago), and his second set (which fixes up filesystems and removes the kdev_t type) as well, some IDE work, a large USB update, lots of network driver fixes, a new set of iptables modules, and many other fixes. The long-format changelog has all the details.

Linus's BitKeeper tree contains a number of patches including some initramfs tweaks, improvements in random driver locking (which was "consuming 60% of CPU resources in Anton's monster power5 boxes"), the removal of some ext3 debugging hooks, direct I/O support for reiserfs, some CPU frequency work, an Intel SpeedStep-SMI driver, and various fixes.

The current stable kernel is 2.4.22; Marcelo has not released any 2.4.23 prepatches since 2.4.23-pre3 on September 3.

Comments (none posted)

Kernel development news

kdev_t is no more

Al Viro's second set of patches aimed at enabling the support of a larger dev_t type has been merged into the 2.6.0-test5 kernel. The bulk of the work is fixing up code in filesystems which made assumptions about the size of dev_t. As part of this whole process, however, Al has been converting kernel code from the kdev_t type over to using dev_t directly.

kdev_t, of course, was introduced several major releases ago as a way of hiding the actual structure of device numbers. The comments in <linux/kdev_t.h> read:

As a preparation for the introduction of larger device numbers, we introduce a type kdev_t to hold them. No information about this type is known outside of this include file.

In practice it didn't work quite that way. When Linus changed the format of kdev_t early in the 2.5 development series, everything broke. And when the time came to really change the size of dev_t, it turned out to be easier and more clear to simply use dev_t directly. Kernel hackers tend to be skeptical of abstraction interfaces which are created without being immediately useful; kdev_t is an example of why that is so.

The seventh patch (of 15) in Al Viro's second dev_t series changes the type of the much-used i_rdev inode structure field; it is, of course, a dev_t now. Since Al had already converted users of that field over to the new iminor() and imajor() macros, the effect of this change was small. But, as it turns out, i_rdev was the last kdev_t object in the kernel. So patch eight removed the type altogether.

Out-of-tree drivers will, of course, be broken as a result of this change, but the fixes should not be that difficult. At this point, the bulk of the large dev_t preparation work should be done. About all that's left is to decide what the format of the new dev_t will really be and make the change. Once the dust settles, another one of the 2.6.0 "must fix" items will have been taken care of.

Comments (1 posted)

Straightening out ioctl() size confusion

The ioctl() system call includes a general "command" argument which specifies which operation the calling program wishes to perform. The Linux kernel has long had a mechanism for defining these command arguments, with the goal of keeping them all unique. If no two drivers implement the same command codes, there is no danger of strange things happen if the wrong code is passed to the wrong driver. A world where "rewind the tape" for one driver never translates to "initiate self destruct" for another is a safer place to be for all of us.

The Linux kernel takes things a little further by encoding some useful information in the command codes. Along with driver-specific "magic" and command numbers, the ioctl() command code includes the direction of data movement (if any) between kernel and user space and the size of the data to be moved. The kernel itself does not do anything with those values, but their presence does enable a driver to perform some checks. If, for example, the size of a structure used as an ioctl() argument changes, the driver can use the size field in the command code to determine whether the application is using the older version or not. Some kernel code actually does check the sizes to be sure that things match up.

The command codes are created using some macros in <asm/ioctl.h>. A driver defining codes would use one of these macros:

    _IOR(type, number, size)
    _IOW(type, number, size)
    _IORW(type, number, size)

The macro used specifies whether the ioctl() operation reads or writes kernel-space data (or both); type is the driver's "magic" code, and number is the command-specific code. The confusion comes in with the argument called size; it is supposed to be the type of the data to be passed between kernel and user space. So, for example, the "get tape position" code is defined as:

    #define MTIOCPOS _IOR('m', 3, struct mtpos)

The problem is that a number of hackers saw the size argument and assumed that they were expected to pass the size of the expected data transfer. The result was a number of definitions like:

    #define CIOC_KERNEL_VERSION _IOWR('c', 10, sizeof (int))

As a result, the actual size value, as encoded within the command, was the size of the size value, or, on most architectures, four bytes. Since most code never looks at that size value, things worked, but the values defined were not as intended. Another problem that occasionally came up was that some code used very large size values, overflowing the space allotted in the command word, thus corrupting the rest of the command code. Once again, things worked, but not quite in the way people expected.

One of the themes of 2.6 development has been the addition of type checking anywhere that the compiler can be coerced into doing it. So the obvious thing to do was to add checking to the generation of command codes; Arnd Bergmann submitted a patch which does exactly that. It adds a bit of preprocessor magic in the form of this macro:

     #define _IOC_TYPECHECK(t) \
        ((sizeof(t) == sizeof(t[1]) && \
          sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
          sizeof(t) : __invalid_size_argument_for_IOC)

The first test ensures that an actual type (as opposed to a simple size) has been passed in; the second makes sure it is not too large.

All that remains is the inconvenient fact that the old, erroneous codes have found their way into a number of application programs. Changing those codes would break those applications, and that's something the kernel hackers try never to do. So, for these cases, a new set of macros (with names like _IOW_BAD() has been introduced, and the erroneous uses have been moved over to the new macros. The command codes remain unchanged, but the mistake is noted so that it is not replicated when somebody copies the code in question.

Comments (3 posted)

A wealth of suspend options

Patrick Mochel has posted a new set of power management patches. Power management is, of course, one of the last unfinished projects in the 2.6.0-test kernel. So developments in that area are of interest.

Much energy has gone into the suspend-to-disk implementation. Patrick has been unable to come to an understanding with (2.6) swsusp maintainer Pavel Machek; rather than keep trying, he has chosen to create his own implementation (starting with swsusp) called "pmdisk." Should Linus accept the patches, the 2.6.0-test kernel will have two separate, competing implementations of the suspend-to-disk functionality. The swsusp version has been reverted to its previous state; the patch includes the comment "Note that I would never publically admit to putting such code into the kernel."

The new pmdisk implementation has since seen some fixes, though it still does not work on SMP systems, and apparently will not for some time. There is a /sys/power/state file used to control pmdisk; writing "disk" to that file will cause the system to suspend itself to disk. Beyond that, pmdisk is still mostly the swsusp implementation with a lot of cleanup work and the names of the functions and variables changed.

One remaining question with the suspend-to-disk functionality is what will happen to all of Nigel Cunningham's work. Nigel has put a great deal of effort into the 2.4 swsusp implementation, with the result that it has become a reliable option for many users; see our review of that work from August. Nigel would like to port his work forward to 2.6, but is uncertain about what to port to.

This whole situation could be resolved by Linus, who has not yet accepted the "fork swsusp" patch. Releasing a 2.6.0 kernel with two different suspend implementations seems like a suboptimal course which could reflect poorly on the Linux development process. Linus has made no public noises to this effect, but it would not be surprising if he imposed some sort of solution that led to a single suspend subsystem in 2.6.0.

Comments (3 posted)

Modules move into sysfs

Greg Kroah-Hartman has posted a patch with the rather uninspiring title of "add kobject to struct module." What the patch really does, however, is enable the creation of a /sys/module directory which will contain information about the modules currently loaded into the kernel. With this patch, the only available information (beyond the name of the module) is the reference count, but that will be expanded in the future. Eventually all of the information found in /proc/modules will also appear in the /sys/module tree, though in the standard sysfs "one value per file" format. The values of parameters passed to the module will also be made available for inspection and (permissions willing) change.

This patch continues the process of moving system information from /proc to /sys. It may take a couple more development series worth of work, but /proc might just end up being pared down to the process information it was originally created to hold.

Comments (none posted)

Kernel debugging via the net

One nice feature that was quietly slipped into the 2.6.0-test4-mm6 release is the kgdb-over-ethernet patch, by Robert Walsh and San Mehat. As described in the included documentation, kgdbeth makes it frighteningly easy to hook into a running Linux kernel over the network and prowl around in it. It's really just a matter of setting four boot parameters:

  • gdbeth=number the device number of the ethernet interface to use for debugging. Usually zero for eth0.

  • gdbeth_remoteip to set the IP address of the machine which is able to hook in with gdb.

  • gdbeth_remotemac to set the remote system's MAC address.

  • gdbeth_localmac to tell the kgdb stub what the local system's MAC address is.

As one would expect, the target system will only respond to debugger traffic coming from the system designated by the boot-time arguments. Once you've booted a kernel with the kgdbeth patch and the proper parameters, hooking in with gdb is simple. Here's a (slightly cleaned up) log from a quick session done here at LWN Labs:

gdb ./vmlinux
    (gdb startup stuff...)
(gdb) target remote udp:victim:6443
warning: The remote protocol may be unreliable over UDP.
warning: Some events may be lost, rendering further debugging impossible.
Remote debugging using udp:victim:6443
do_IRQ (regs=
      {ebx = -1069465600, ecx = -1054087008, edx = -216755, esi = 624384, 
       edi = -1072664576, ebp = 581632, eax = 0, xds = 123, xes = 123, 
       orig_eax = -251, eip = -1072652202, xcs = 96, eflags = 582, 
       esp = -1072652057, xss = 0}) at arch/i386/kernel/irq.c:514
warning: shared library handler failed to enable breakpoint
(gdb) print ioport_resource
$2 = {name = 0xc0362e75 "PCI IO", start = 0, end = 65535, flags = 256, 
      parent = 0x0, sibling = 0x0, child = 0xc03a2a80}
(gdb) print *ioport_resource->child
$3 = {name = 0xc035d94f "dma1", start = 0, end = 31, flags = 2147483648, 
      parent = 0xc03a40e0, sibling = 0xc03a2a9c, child = 0x0}
(gdb) c
Continuing.

For anybody who has wanted to be able to use gdb on a running kernel, but who has never gotten around to setting up the requisite serial lines and such, kgdbeth promises to make things easier than ever.

Matt Mackall has noticed that a number of patches - including Ingo Molnar's network console code and kgdbeth - each provide their own low-level ethernet functions. Code which hooks into the kernel at such a fundamental level needs to be able to send and receive packets without involving the entire networking subsystem. As a way of addressing this duplication of code and effort, Matt put together and posted a netpoll API. The patch came accompanied by new versions of netconsole and kgdbeth, both of which are somewhat cleaned up and significantly reduced in size. An added bonus is that netpoll supports almost all interfaces out there without the need for any driver changes. As of this writing, netpoll has not found its way into an -mm release, but that could change.

Of course, Linus's feelings on kernel debuggers are well known, so kgdbeth, while potentially useful for developers, is unlikely to find its way into the 2.6 mainline. So Andrew Morton will have to keep this one in -mm. At least, until Linus hands off the 2.6 kernel - to Andrew.

Comments (none posted)

Patches and updates

Kernel trees

  • Andrew Morton: 2.6.0-test4-mm6. "<span>Dropped out Nick's CPU scheduler changes, brought back Con's interactivity work.</span>" (September 5, 2003)

Core kernel code

  • Con Kolivas: O20.1int. (September 10, 2003)

Development tools

Device drivers

Filesystems and block I/O

  • Dave Kleikamp: JFS 1.1.2. (September 7, 2003)

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds