The current development kernel is 2.5.65
, which was released
by Linus on March 17. It
includes a bunch of scheduler work (see last
week's LWN kernel page
), some IDE work, some devfs trimming, NUMA
updates, a PCI update, a number of kbuild updates (including the
long-awaited GTK front end for "make xconfig"), various architecture
updates, and a long list of other fixes.
has the details.
Linus's BitKeeper tree includes an interesting patch which makes the "magic
sysrq" functionality available to remote users (via
/proc/sysrq-trigger), a PA-RISC update, and a small number of
fixes and performance improvements.
The current prepatch from Alan Cox is 2.5.65-ac1, which adds a small set of new
The current stable kernel is 2.4.20; Marcelo has released no 2.4.21
prepatches since 2.4.21-pre5, which came out
on February 27.
Note that 2.4.20 contains a local root
vulnerability; if you are running systems with untrusted users, you
should apply an update from your vendor or the patch supplied with the
Alan Cox has released 2.2.25, which contains the ptrace vulnerability fix
(and nothing else).
Comments (5 posted)
Kernel development news
Andries Brouwer released a new set of patches this week which brings the
long-planned expansion of dev_t
closer to reality. These patches
rework the character device infrastructure to make it safe for much larger
numbers of devices. For now, at least, it is not even necessary to change
any char drivers to work properly with the new code.
The first patch clears out the char device
code within the filesystem area. This code included a whole structure for
tracking devices, managing reference counts, etc. That structure was only
used in one place, however, and Andries decided that, rather than fix it up
to work with larger device numbers, he would just hack it out altogether.
The rest of the kernel will not really notice its absence, for now.
The core of the work is in the second
patch. Here, the longstanding static chrdevs array is
removed. A static array of devices works reasonably well when there is a
maximum of 255 of them; it's rather less convenient when there can be
thousands of device numbers. In its place is a simple hash table with
linked lists of registered char drivers.
There is a new way of registering a char driver:
int register_chrdev_region(unsigned int major,
unsigned int baseminor,
const char *name,
struct file_operations *fops);
The new baseminor and minorct arguments describe the
range of minor numbers that the driver is prepared to deal with. Char
drivers should eventually be converted to the new interface, but there is
no great hurry; the register_chrdev() interface is still supported
int register_chrdev(unsigned int major, const char *name,
struct file_operations *fops)
return register_chrdev_region(major, 0, 256, name, fops);
So unchanged char drivers will still work, and will not be confronted with
minor numbers greater than 255.
For now, drivers requesting a dynamic major number may continue to use the
same mechanism: passing major as zero. The mechanism implemented
in the patch is not entirely robust, however, and is marked as being
The third patch just cleans things up a bit,
and removes the MAX_CHRDEV macro. For the truly adventurous,
there is a fourth patch which actually
changes dev_t to 32 bits, using a 16:16 split.
These patches have found their way into the -mm kernel tree, and are now in
need of some serious testing. Should things work out, the 32-bit
dev_t expansion may finally get crossed off the 2.5 development
Comments (none posted)
The 2.5 kernel development process has put a strong emphasis on scalability
and performance issues. So it is somewhat interesting that the core Linux
filesystems - ext2 and ext3 - have seen relatively little scalability work
in 2.5. That is beginning to change, at least for ext2, but this work is
raising some interesting questions about what the role of these two
filesystems really is.
Alex Tomas has recently been working on performance bottlenecks in ext2.
His first concurrent block allocation patch
attacks the problem of allocating blocks within a filesystem. The current
ext2 code takes out the superblock lock before performing block allocation;
this means that only one thread can be trying to allocate space in a given
filesystem at a time. The first patch created a separate "allocation lock"
which protects the small piece of code which actually makes allocation
decisions; a later revision creates a
separate lock for each block group within the filesystem, thus reducing
lock contention further.
The patch was greeted with positive reviews. William Lee Irwin reported a throughput increase from
62 MB/s to 104 MB/s on a benchmark he ran, and exclaimed
"This patch is a godsend. Whoever's listening, please apply!.
Martin Bligh, instead, said "SDET on
my machine (16x NUMA-Q) has fallen in love with your patch, and has decided
to elope with it to a small desert island." Not bad for a patch
which is really a pretty straightforward exercise in finer-grained
The block allocation patch was quickly joined by a concurrent inode allocation patch and a distributed counters patch. None of these have
found their way into the mainline kernel yet, but they offer enough
performance benefits that they will likely get there eventually. Assuming
the block allocation patch can be coaxed back from its desert island
experience, that is.
A question was raised, however: is ext2 the right place for this sort of
work? ext2 is generally thought of as the relatively simple Linux
filesystem; ext3 is the place for fancy new stuff. There are a couple of
reasons why this sort of work tends to find its way into ext2 first,
One of those reasons is the simple fact that ext3 still has bigger scaling
problems. The ext3 filesystem is one of the few places in the Linux kernel
that still makes heavy use of the big kernel lock (BKL). As a result, ext3
does not scale well to large systems, and tweaking things like block
allocation will not help the real problem. Until the BKL dependency is
removed from ext3, most other performance work will not make much sense.
Removing the BKL is apparently a somewhat tricky job; at this point, it may
well not happen before 2.6 is released.
The other reason is that, large-systems scaling issues notwithstanding,
ext3 is developing into the default Linux filesystem. For most users,
there is little or no incentive to prefer ext2 over ext3; all it takes is
one power failure to make the advantages of a journaling filesystem clear.
So, as Daniel Phillips put it:
Ext2 is growing into the role of experimental filesystem; Ext3 is
now the stable filesystem. Hopefully, the experiments will make
Ext2 smaller, cleaner and at the same time, more powerful, over
time. Sort of like the role that RAMFS plays: besides being
useful, Ext2 should be thought of as a showcase for best filesystem
The role reversal, it seems, is nearly complete. Soon, it will be the ext2
users who are living on the bleeding edge.
Comments (1 posted)
The driver porting series continues to look at block drivers this week.
Below you'll find an article on the gendisk
interface, which has
become rather more important in 2.5. Also available is this article
which looks, in detail, at the
simplest possible block driver - a naive ramdisk driver for 2.5. As
always, the entire series (up to 19 articles now) can be found on this page
Comments (none posted)
The 2.4 kernel gendisk
structure is used almost as an
afterthought; its main purpose is to help in keeping track of disk
partitions. In 2.6, the gendisk
is at the core of the block
subsystem; if you need to work with or find something out about a disk,
probably has what you need.
This article will cover the details of the gendisk
a disk driver's perspective. If you have not already read them, a quick
look at the LWN block driver overview
and simple block driver
articles is probably
The best way of looking at the contents of a gendisk
from a block driver's point of view is to examine what that driver must do
to set the structure up in the first place. If your driver makes a disk (or
disk-like) device available to the system, it will have to provide an
structure. (Note, however, that it is
necessary - or correct - to set up gendisk
for disk partitions).
The first step is to create the gendisk structure itself; the
function you need is alloc_disk() (which is declared in
struct gendisk *alloc_disk(int minors);
The argument minors is the maximum number of minor numbers that
this disk can have. Minor numbers correspond to partitions, of course
(except the first, which is the "whole disk" device), so the value passed
here controls the maximum number of partitions. If a single minor number
the device cannot be partitioned at all. The return value is a pointer to
the gendisk structure; the allocation can fail, so this value
should always be checked against NULL before proceeding.
There are several fields of the gendisk structure which must be
initialized by the block driver. They include:
- int major;
- The major number of this device; either a static major assigned to a
specific driver, or one that was obtained dynamically from
- int first_minor;
- The first minor device number corresponding to this disk. This number
will be determined by how your driver divides up its minor number
- char disk_name;
- The name of this disk (i.e. hda). This name is used in places
like /proc/partitions and in creating a sysfs directory for
- struct block_device_operations *fops;
- The device operations (open, release, ioctl, media_changed, and
revalidate_disk) for this device. Each disk has its own set of
operations in 2.6.
- struct request_queue *queue;
- The request queue which will handle the list of pending operations for
this disk. The queue must be created and initialized separately.
- int flags;
- A set of flags controlling the management of this device. They
include GENHD_FL_REMOVABLE for removable devices,
GENHD_FL_CD for CDROM devices, and
GENHD_FL_DRIVERFS which certainly means something interesting,
but which is not actually used anywhere.
- void *private_data;
- This field is reserved for the driver; the rest of the block subsystem
will not touch it. Usually it holds a pointer to a driver-specific
data structure describing this device.
The gendisk structure also holds the size of the disk, in
sectors. As part of the initialization process, the driver should set that
void set_capacity(struct gendisk *disk, sector_t size);
The size value should be in 512-byte sectors, even if the hardware
sector size used by your device is different.
For removable disks, setting its capacity to zero indicates to the block
subsystem that there is currently no media present in the device.
Once you have your gendisk
structure set up, you have to add it to
the list of active disks; that is done with:
void add_disk(struct gendisk *disk);
After this call, your device is active. There are a few things worth
keeping in mind about add_disk():
- add_disk() can create I/O to the device (to read partition
tables and such). You should not call add_disk() until your
driver is sufficiently initialized to handle requests.
- If you are calling add_disk() in your driver initialization
routine, you should not fail the initialization process after the
- The call to add_disk() increments the disk's reference count;
if the disk structure is ever to be released, the driver is
responsible for decrementing that count (with put_disk()).
Should you need to remove a disk from the system, that is accomplished
void del_gendisk(struct gendisk *disk);
This function cleans up all of the information associated with the given
disk, and generally removes it from the system. After a call to
del_gendisk(), no more operations will be sent to the given
device. Your driver's reference to the gendisk object remains,
though; you must explicitly release it with:
void put_disk(struct gendisk *disk);
That call will cause the gendisk structure to be freed, as long as
no other part of the kernel retains a reference to it.
Should you need to set a disk into a read-only mode, use:
void set_disk_ro(struct gendisk *disk, int flag);
If flag is nonzero, all partitions on the disk will be marked
read-only. The kernel can track read-only status individually for each
partition, but no utility function has been exported to manipulate that
status for single partitions.
Partition management is handled within the block subsystem in 2.6; drivers
need not worry about partitions at all. Should the need arise, the
functions add_partition() and delete_partition() can be
used to manipulate the (in-kernel) partition table directly. These
functions are used in the generic block ioctl() code; there should
be no need for a block driver to call them directly.
Registering block device number ranges
A call to add_disk()
implicitly allocates the a set of minor
numbers (under the given major number) from first_minor
. If your driver must only respond to
operations to disks that exist at initialization time, there is no need to
worry further about number allocation. Even the traditional call to
is optional, and may be removed soon. Some
drivers, however, need to be able to claim responsibility for a larger
range of device numbers at initialization time.
If this is your case, the answer is to call blk_register_region(),
which has this rather involved prototype:
void blk_register_region(dev_t dev,
unsigned long range,
struct module *module,
struct kobject *(*probe)(dev_t, int *, void *),
int (*lock)(dev_t, void *),
Here, dev is a device number (created with MKDEV())
containing the major and first minor number of the region of interest;
range is the number of minor numbers to allocate, module
is the loadable module (if any) containing the driver, probe is a
driver-supplied function to probe for a single disk, lock is a
driver-supplied locking function, and data is a driver-private
pointer which is passed to probe() and lock().
When blk_register_region() is called, it simply makes a note of
the desired region and returns. Note that there can be more than one
registration within a specific region! At lookup time, the most "specific"
registration (the one with the smallest range) wins.
At some point in the future, an attempt
may be made to access a device number within the allocated region. At that
point, there will be a call to the lock() function (if it was not
passed as NULL) with the device
number of interest. If lock() succeeds, probe() will be
called to find the specific disk of interest. The full prototype of the
probe function is:
struct kobject *(*probe)(dev_t dev, int *partition, void *data);
Here, dev is the device number of interest, partition is
a pointer to a partition number (sort of), and data is the
driver-private pointer passed to blk_register_region(). The
partition number is actually just the offset into the allocated range; it's
the minor number from dev with the beginning of the range
The probe() function should attempt to identify a specific
gendisk structure which corresponds to the requested number. If
it is successful, it should return a pointer to the kobject
structure contained within the gendisk. Kobjects are covered in
a separate article; for all, all you really
need to know is that you should call get_disk() with the
gendisk structure as the argument, and return the value from
get_disk() to the caller.
The probe() function can also modify the
partition number so that it corresponds to the actual partition offset in
the returned device. If the function cannot handle the request at all, it
can return NULL.
Some probe() functions do not, themselves, locate and initialize
the device of interest. Instead, they call some other function to set in
motion that whole process. For example, a number of probe()
functions simply call request_module() in an attempt to load a
module which can handle the device. In this mode of operation, the
function should return NULL, which will cause the block layer to
look at the device number allocations one more time. If a "better"
allocation (with a smaller range) has happened in the mean time, the
probe() function for the new driver will be called. So, for
example, if a module is loaded which allocates a smaller device number
range corresponding to the devices it actually implements, its
probe() routine will be called on the next iteration.
Of course, there is the usual assocated unregister function:
void blk_unregister_region(dev_t dev, unsigned long range);
The next step
Once you have a handle on how the gendisk structure works, the
next thing to do is to learn about BIO
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>