LWN.net Logo

Kernel development

Release status

Kernel release status

The current 2.6 prepatch remains 2.6.10-rc3.

The trickle of patches into Linus's BitKeeper repository continues; currently merged patches include a CIFS update, an IDE update, some networking fixes (including a fix for the IGMP vulnerabilities), a DVB update and various other fixes.

The current patch from Andrew Morton is 2.6.10-rc3-mm1. Recent additions to -mm include a reworking of the VFS readahead code, parts of the page fault handler scalability patch set, hooks needed for the merging of the Xen architecture, a big set of user-mode Linux patches, in-inode extended attribute support for ext3, unlocked ioctl() support (see below), a set of SELinux patches, and lots of fixes.

The current 2.4 prepatch remains 2.4.29-pre1; Marcelo has released no prepatches since November 25.

Comments (2 posted)

Kernel development news

Quote of the week

Nothing like changing the byte order of structure fields to really drive the "out-of-tree" driver writers crazy. I like it :)

-- Greg Kroah-Hartman

Comments (40 posted)

Debugfs

Kernel hackers often need to be able to export debugging information to user space. This information is not needed for the regular operation of the system, but it can be highly useful for a developer who is trying to figure out why things are behaving strangely. Sometimes putting in a few printk() calls is sufficient, but, often, that is not the best way to go. The debugging information may only be useful occasionally, but the printed output clogs up the logs all the time. Using printk() also does not help if the developer wishes to be able to change values from user space.

A common way of making debugging information available only when needed (and possibly for write access) is to create one or more files in a virtual filesystem. There are a few ways in which that can be done:

  • Creating files in /proc. This approach works, but there is little more enthusiasm for creating more files in /proc at this point, and the /proc filesystem functions can be a bit of a pain to work with.

  • 2.6 kernels have the /sys (sysfs) filesystem. In many cases, debugging information can be put there, but sysfs is meant for information used in administering the system, and the rules for sysfs require that each file contain a single value. For that reason, it is not even possible to use the seq_file interface with sysfs. The result is that sysfs is relatively consistent, but it is unwieldy for a developer who wishes to dump out a complicated data structure.

  • Creating an entirely new filesystem with libfs. This approach is highly flexible; a developer who creates a new filesystem can write the rules that go with it. The libfs interface makes things relatively simple, but the task of creating a new filesystem is more than most people want to take on just to make some debugging information available - especially since that filesystem will require some debugging of its own.

As a way of making life easier for developers, Greg Kroah-Hartman has created debugfs, a virtual filesystem devoted to debugging information. Debugfs is intended to be a relatively easy and lightweight subsystem which gracefully disappears when configured out of the kernel.

A developer wishing to use debugfs starts by creating a directory within the filesystem:

    struct dentry *debugfs_create_dir(const char *name, 
                                      struct dentry *parent);

The parent argument will usually be NULL, causing the directory to be created in the debugfs root. If debugfs is not configured into the system, the return value is -ENODEV; a NULL return, instead, indicates some other sort of error.

The general-purpose function for creating a file in debugfs is:

    struct dentry *debugfs_create_file(const char *name, mode_t mode,
                                       struct dentry *parent, void *data,
                                       struct file_operations *fops);

The structure pointed to by fops should, of course, contain pointers to the functions which implement the actual operations on the file. In many cases, most of those functions can be the helpers provided by seq_file, making the task of exporting a file easy.

Some additional helpers have been provided to make exporting a single value as easy as possible:

    struct dentry *debugfs_create_u8(const char *name, mode_t mode, 
                                     struct dentry *parent, u8 *value);
    struct dentry *debugfs_create_u16(const char *name, mode_t mode, 
                                      struct dentry *parent, u16 *value);
    struct dentry *debugfs_create_u32(const char *name, mode_t mode, 
                                      struct dentry *parent, u32 *value);
    struct dentry *debugfs_create_bool(const char *name, mode_t mode, 
                                       struct dentry *parent, u32 *value);

Debugfs does not automatically clean up files when a module shuts down, so, for every file or directory created with the above functions, there must be a call to:

    void debugfs_remove(struct dentry *dentry);

The debugfs interface is quite new, and it may well see changes before finding its way into the mainline kernel. In particular, Greg has considered adding a kobject parameter to the creation calls; the kobject would then provide the name for the resulting files.

Comments (8 posted)

Boot-time clock frequency selection

The timer interrupt is the kernel's way of keeping track of the passage of time. Every so often, a programmable timer interrupts the kernel, which responds by updating its internal time value, performing various housekeeping tasks, and executing any delayed kernel work whose time has come. In the 2.6 kernel, on the x86 architecture, by default, the timer interrupt comes 1000 times per second; other architectures and configurations can vary.

Playing with the timer tick frequency is almost as old as the kernel itself. The frequency with which the hardware timer interrupts the processor is well parameterized into a single compile-time variable (HZ); running the system with a nonstandard clock frequency is simply a matter of changing the definition of HZ (within reasonable bounds) and building a new kernel.

There are legitimate reasons for playing with the timer frequency. A faster clock can allow the system to perform more precise delays, and to respond to events more quickly. Systems running at a higher clock frequency should have lower latencies in many situations. There is an overhead associated with the timer interrupt, however; a higher-frequency interrupt will take more CPU time. So, for server loads (where latency is less important), the overhead of a higher timer frequency is not worth it. On laptops, the default 1KHz timer can also defeat the CPU's power management features and significantly reduce battery life.

In other words, there is no single value for the timer frequency which works for all users. Changing the frequency is still relatively hard, however; some people are more comfortable with building new kernels than others. Wouldn't it be nice if the frequency could be made into a boot-time parameter, so that it could be changed from one boot to the next without a kernel rebuild?

As it turns out, Andrea Arcangeli has a patch which does exactly that. It's not even a new patch: SUSE has been shipping 2.4 kernels with boot-time timer frequency selection for some time. Andrea is now interested in merging this patch into the mainline, should the other developers be willing.

The patch is relatively intrusive - it touches 143 files around the tree. The core change is the transformation of HZ from a constant value into a variable. Much of the kernel does not notice the change at all; a call like:

    schedule_timeout(HZ/10);

will still set up a wakeup for 100ms in the future. There is some new overhead associated with fetching the value of HZ and performing the division at run time, but Andrea states that it is not really measurable.

There are places in the kernel which require further changes, however. Compile-time initializations which depend on a constant HZ value will no longer work; those initializations must be moved to run time, or recast in terms of a known constant value. There are also places where values in timer-tick units are provided by user space. The kernel tries to hide its internal clock frequency from user space, but there are still places where it leaks through. A number of boot-time parameters are expressed in ticks, and some device drivers take parameters in ticks as well.

To address these problems, Andrea's expands the use of a symbol called USER_HZ. It is a constant value, though its actual definition is architecture dependent, varying from 32 to 1200 - though most architectures set it to 100. All remaining compile-time initializations, and all values obtained from user space, are interpreted as being in USER_HZ and must be translated to internal values before being used. To that end, some new macros have been provided:

	jiffies_to_clock_t(internal_hz);
	user_to_kernel_hz(user_hz);

With these in place, it's just a matter of keeping track of which type of clock value is being used where. Andrea's patch renames variables containing user-space tick values (it prepends "__" to the name) as a way of indicating that a special value is contained there.

Andrew Morton has said that some form of this patch is likely to be merged:

So I guess we're going to have to do this sometime - I don't think there's any other solution apart from going fully tickless, which would be considerably more intrusive.

Before the patch can be merged, however, a few details must be dealt with - porting it from 2.4 to 2.6, for example. So it's unlikely to go in immediately. Given time, however, it seems likely to be merged in some form.

Comments (2 posted)

ioctl(), the big kernel lock, and 32-bit compatibility

Despite efforts to remove the big kernel lock (BKL) from the 2.6 kernel, it still covers large amounts of code. Much of that code is implementations of the ioctl() method in device drivers and filesystems throughout the kernel. A poorly-implemented ioctl() method can block other processors for some time, wasting CPU time and creating high latencies. Fixing ioctl()'s BKL use has been on the "to do" list for some time, but nobody has dived in to get the job done.

Mike Werner has recently taken a step in that direction, however, with this patch which aims to make it easy to wean driver ioctl() methods off the BKL one at a time. To that end, it creates a new method in the file_operations structure:

    int (*unlocked_ioctl) (struct inode *inode, struct file *file, 
                           unsigned int cmd, unsigned long arg);

This method behaves just like one would expect: if it is non-NULL, it will be called in preference to the regular ioctl() method, and the BKL will not be taken for that call. New drivers can be written to use this method, and the ioctl() methods of old drivers can be shifted over once they are known to be safe to call without the BKL.

This is a different approach than was taken to get the BKL out of lseek() methods. In that case, the interface was changed by decree, and lseek() was called without the BKL. First, however, every in-tree lseek() method was enhanced with an explicit lock_kernel() call of its own. As a result, those methods still executed with the BKL held, but the taking of the BKL was made explicit and put into a place where it could be removed when it was no longer needed. A typical ioctl() method can be more complicated than most lseek() methods, however, so the creation of a new method must seem like the easier approach this time around.

One commenter has suggested that the new method should not include the inode argument, since it is trivially obtained from the file structure anyway. The version of this patch which was merged into 2.6.10-rc3-mm1 retains that argument, however.

Meanwhile, Michael Tsirkin has posted a different ioctl() patch which, while it provides a non-BKL migration path for that method, also solves another problem. One of the biggest challenges in writing portable ioctl() methods is dealing with 32-bit compatibility on 64-bit systems. When user space is running in 32-bit mode, it will have a different view of any structures passed into ioctl(), and the kernel must translate the 32-bit versions into something it can work with.

The kernel provides some help with this translation in the form of a function called register_ioctl32_conversion():

    typedef int (*ioctl_trans_handler_t)(unsigned int, unsigned int,
                                         unsigned long, struct file *);
    int register_ioctl32_conversion(unsigned int cmd, 
                                    ioctl_trans_handler_t handler)

After this call, any 32-bit ioctl() call using the given cmd will be passed to the handler function, which, presumably, knows how to deal with it. This mechanism works, but it has a few shortcomings. It relies on a global space for ioctl() command codes, for example. Every command is supposed to be unique, but things do not always happen that way - especially with out-of-tree drivers. The use of a hash table to look up handler functions slows things down a bit. And, as Andi Kleen pointed out recently, the current mechanism suffers from race conditions which appear to be unfixable without changing the interface.

But, if you're going to change the interface, you might as well do it right. So Michael's patch adds two new ioctl() methods to the file_operations structure. The ioctl_native() method handles calls made from user-space processes which are using the same architecture model as the kernel, while ioctl_compat() is called in cases where the two differ. With this approach, the global table of commands can be eliminated, and its problems go away as well. Since the new ioctl_compat() method is invoked directly from the file_operations structure, it is easy to manage the module reference count to avoid unload races.

Oh, and the kernel does not acquire the big kernel lock before calling either of the new methods; they are expected to be implemented with proper locking from the beginning.

Michael's patch seems to solve all of the problems addressed by the unlocked_ioctl() approach, plus a few more. The debate has not yet begun, but it would not be surprising to see the two new methods win out in the end.

Comments (1 posted)

Patches and updates

Kernel trees

Core kernel code

  • Andrea Arcangeli: dynamic-hz. (December 11, 2004)

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.