The current 2.6 prepatch remains 2.6.10-rc3
The trickle of patches into Linus's BitKeeper repository continues;
currently merged patches include a CIFS update, an IDE update, some
networking fixes (including a fix for the IGMP
a DVB update and various other fixes.
The current patch from Andrew Morton is 2.6.10-rc3-mm1. Recent additions to -mm
include a reworking of the VFS readahead code, parts of the page fault handler scalability patch set,
hooks needed for the merging of the Xen architecture, a big set of
user-mode Linux patches, in-inode extended
attribute support for ext3, unlocked ioctl() support (see
below), a set of SELinux patches, and lots of fixes.
The current 2.4 prepatch remains 2.4.29-pre1; Marcelo has released
no prepatches since November 25.
Comments (2 posted)
Kernel development news
Nothing like changing the byte order of structure fields to really
drive the "out-of-tree" driver writers crazy. I like it :)
-- Greg Kroah-Hartman
Comments (40 posted)
Kernel hackers often need to be able to export debugging information to
user space. This information is not needed for the regular operation of
the system, but it can be highly useful for a developer who is trying to
figure out why things are behaving strangely. Sometimes putting in a few
calls is sufficient, but, often, that is not the best way
to go. The debugging information may only be useful occasionally, but the
printed output clogs up the logs all the time. Using printk()
also does not help if the developer wishes to be able to change values from
A common way of making debugging information available only when needed
(and possibly for write access) is
to create one or more files in a virtual filesystem. There are a few ways
in which that can be done:
- Creating files in /proc. This approach works, but there is
little more enthusiasm for creating more files in /proc at
this point, and the /proc filesystem functions can be a bit
of a pain to work with.
- 2.6 kernels have the /sys (sysfs) filesystem. In many cases,
debugging information can be put there, but sysfs is meant for
information used in administering the system, and the rules for sysfs
require that each file contain a single value. For that reason, it is
not even possible to use the seq_file
interface with sysfs. The result is that sysfs is relatively
consistent, but it is unwieldy for a developer who wishes to dump out
a complicated data structure.
- Creating an entirely new filesystem with libfs. This approach is highly flexible;
a developer who creates a new filesystem can write the rules that go
with it. The libfs interface makes things relatively simple, but the
task of creating a new filesystem is more than most people want to
take on just to make some debugging information available - especially
since that filesystem will require some debugging of its own.
As a way of making life easier for developers, Greg Kroah-Hartman has
created debugfs, a virtual filesystem
devoted to debugging information. Debugfs is intended to be a relatively
easy and lightweight subsystem which gracefully disappears when configured
out of the kernel.
A developer wishing to use debugfs starts by creating a directory within
struct dentry *debugfs_create_dir(const char *name,
struct dentry *parent);
The parent argument will usually be NULL, causing the
directory to be created in the debugfs root. If debugfs is not configured
into the system, the return value is -ENODEV; a NULL
return, instead, indicates some other sort of error.
The general-purpose function for creating a file in debugfs is:
struct dentry *debugfs_create_file(const char *name, mode_t mode,
struct dentry *parent, void *data,
struct file_operations *fops);
The structure pointed to by fops should, of course, contain
pointers to the functions which implement the actual operations on the
file. In many cases, most of those functions can be the helpers provided
by seq_file, making the task of exporting a file easy.
Some additional helpers have been provided to make exporting a single value
as easy as possible:
struct dentry *debugfs_create_u8(const char *name, mode_t mode,
struct dentry *parent, u8 *value);
struct dentry *debugfs_create_u16(const char *name, mode_t mode,
struct dentry *parent, u16 *value);
struct dentry *debugfs_create_u32(const char *name, mode_t mode,
struct dentry *parent, u32 *value);
struct dentry *debugfs_create_bool(const char *name, mode_t mode,
struct dentry *parent, u32 *value);
Debugfs does not automatically clean up files when a module shuts down, so,
for every file or directory created with the above functions, there must be
a call to:
void debugfs_remove(struct dentry *dentry);
The debugfs interface is quite new, and it may well see changes before
finding its way into the mainline kernel. In particular, Greg has considered adding a kobject parameter to the
creation calls; the kobject would then provide the name for the resulting
Comments (8 posted)
The timer interrupt is the kernel's way of keeping track of the passage of
time. Every so often, a programmable timer interrupts the kernel, which
responds by updating its internal time value, performing various
housekeeping tasks, and executing any delayed kernel work whose time has
come. In the 2.6 kernel, on the x86 architecture, by default, the timer
interrupt comes 1000 times per second; other architectures and
configurations can vary.
Playing with the timer tick frequency is almost as old as the kernel
itself. The frequency with which the hardware timer interrupts the
processor is well parameterized into a single compile-time variable
(HZ); running the system with a nonstandard clock frequency is
simply a matter of changing the definition of HZ (within
reasonable bounds) and building a new kernel.
There are legitimate reasons for playing with the timer frequency. A
faster clock can allow the system to perform more precise delays, and to
respond to events more quickly. Systems running at a higher clock
frequency should have lower latencies in many situations. There is an
overhead associated with the timer interrupt, however; a higher-frequency
interrupt will take more CPU time. So, for server loads (where latency is
less important), the overhead of a higher timer frequency is not worth it.
On laptops, the default 1KHz timer can also defeat the CPU's power management
features and significantly reduce battery life.
In other words, there is no single value for the timer frequency which
works for all users. Changing the frequency is still relatively hard,
however; some people are more comfortable with building new kernels than
others. Wouldn't it be nice if the frequency could be made into a
boot-time parameter, so that it could be changed from one boot to the next
without a kernel rebuild?
As it turns out, Andrea Arcangeli has a
patch which does exactly that. It's not even a new patch: SUSE has
been shipping 2.4 kernels with boot-time timer frequency selection for some
time. Andrea is now interested in merging this patch into the mainline,
should the other developers be willing.
The patch is relatively intrusive - it touches 143 files around the tree.
The core change is the transformation of HZ from a constant value
into a variable. Much of the kernel does not notice the change at all; a
will still set up a wakeup for 100ms in the future. There is some new overhead
associated with fetching the value of HZ and performing the
division at run time, but Andrea states that it is not really measurable.
There are places in the kernel which require further changes, however.
Compile-time initializations which depend on a constant HZ value
will no longer work; those initializations must be moved to run time, or
recast in terms of a known constant value. There are also places where
values in timer-tick units are provided by user space. The kernel tries to
hide its internal clock frequency from user space, but there are still
places where it leaks through. A number of boot-time parameters are
expressed in ticks, and some device drivers take parameters in ticks as
To address these problems, Andrea's expands the use of a symbol called
USER_HZ. It is a constant value, though its actual definition is
architecture dependent, varying from 32 to 1200 - though most architectures
set it to 100. All remaining compile-time initializations, and all values
obtained from user space, are interpreted as being in USER_HZ and
must be translated to internal values before being used. To that end, some
new macros have been provided:
With these in place, it's just a matter of keeping track of which type of
clock value is being used where. Andrea's patch renames variables
containing user-space tick values (it prepends "__" to the name)
as a way of indicating that a special value is contained there.
Andrew Morton has said that some form of
this patch is likely to be merged:
So I guess we're going to have to do this sometime - I don't think
there's any other solution apart from going fully tickless, which
would be considerably more intrusive.
Before the patch can be merged, however, a few details must be dealt with -
porting it from 2.4 to 2.6, for example. So it's unlikely to go in
immediately. Given time, however, it seems likely to be merged in some
Comments (2 posted)
Despite efforts to remove the big kernel lock (BKL) from the 2.6 kernel, it
still covers large amounts of code. Much of that code is implementations
of the ioctl()
method in device drivers and filesystems throughout
the kernel. A poorly-implemented ioctl()
method can block other
processors for some time, wasting CPU time and creating high latencies.
's BKL use has been on the "to do" list for some
time, but nobody has dived in to get the job done.
Mike Werner has recently taken a step in that direction, however, with this patch which aims to make it easy to wean
driver ioctl() methods off the BKL one at a time. To that end, it
creates a new method in the file_operations structure:
int (*unlocked_ioctl) (struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg);
This method behaves just like one would expect: if it is non-NULL,
it will be called in preference to the regular ioctl() method, and
the BKL will not be taken for that call. New drivers can be written to use this method,
and the ioctl() methods of old drivers can be shifted over once
they are known to be safe to call without the BKL.
This is a different approach than was taken to get the BKL out of
lseek() methods. In that case, the interface was changed by
decree, and lseek() was called without the BKL. First, however,
every in-tree lseek() method was enhanced with an explicit
lock_kernel() call of its own. As a result, those methods still
executed with the BKL held, but the taking of the BKL was made explicit and
put into a place where it could be removed when it was no longer needed.
A typical ioctl() method can be more complicated than most
lseek() methods, however, so the creation of a new method must
seem like the easier approach this time around.
One commenter has suggested that the new method should not include the
inode argument, since it is trivially obtained from the
file structure anyway. The version of this patch which was merged
into 2.6.10-rc3-mm1 retains that argument, however.
Meanwhile, Michael Tsirkin has posted a
different ioctl() patch which, while it provides a non-BKL
migration path for that method, also solves another problem. One of the
biggest challenges in writing portable ioctl() methods is dealing
with 32-bit compatibility on 64-bit systems. When user space is running in
32-bit mode, it will have a different view of any structures passed into
ioctl(), and the kernel must translate the 32-bit versions into
something it can work with.
The kernel provides some help with this translation in the form of a function called
typedef int (*ioctl_trans_handler_t)(unsigned int, unsigned int,
unsigned long, struct file *);
int register_ioctl32_conversion(unsigned int cmd,
After this call, any 32-bit ioctl() call using the given
cmd will be passed to the handler function, which,
presumably, knows how to deal with it. This mechanism works, but it has a
few shortcomings. It relies on a global space for ioctl() command
codes, for example. Every command is supposed to be unique, but
things do not always happen that way - especially with out-of-tree
drivers. The use of a hash table to look up handler functions slows things
down a bit. And, as Andi Kleen pointed
out recently, the current mechanism suffers from race conditions which appear to
be unfixable without changing the interface.
But, if you're going to change the interface, you might as well do it
right. So Michael's patch adds two new ioctl() methods to the
file_operations structure. The ioctl_native() method
handles calls made from user-space processes which are using the same
architecture model as the kernel, while ioctl_compat() is called
in cases where the two differ. With this approach, the global table of
commands can be eliminated, and its problems go away as well. Since the
new ioctl_compat() method is invoked directly from the
file_operations structure, it is easy to manage the module
reference count to avoid unload races.
Oh, and the kernel does not acquire the big kernel lock before calling
either of the new methods; they are expected to be implemented with proper
locking from the beginning.
Michael's patch seems to solve all of the problems addressed by the
unlocked_ioctl() approach, plus a few more. The debate has not
yet begun, but it would not be surprising to see the two new methods win
out in the end.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>