LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements Letters to the editor ->One big page
This page Previous weekFollowing week |
Kernel developmentRelease status Kernel release status The current development kernel is 2.5.65, which was released by Linus on March 17. It includes a bunch of scheduler work (see last week's LWN kernel page), some IDE work, some devfs trimming, NUMA updates, a PCI update, a number of kbuild updates (including the long-awaited GTK front end for "make xconfig"), various architecture updates, and a long list of other fixes. The long-format changelog has the details.Linus's BitKeeper tree includes an interesting patch which makes the "magic sysrq" functionality available to remote users (via /proc/sysrq-trigger), a PA-RISC update, and a small number of fixes and performance improvements. The current prepatch from Alan Cox is 2.5.65-ac1, which adds a small set of new fixes. The current stable kernel is 2.4.20; Marcelo has released no 2.4.21 prepatches since 2.4.21-pre5, which came out on February 27. Note that 2.4.20 contains a local root vulnerability; if you are running systems with untrusted users, you should apply an update from your vendor or the patch supplied with the vulnerability announcement. Alan Cox has released 2.2.25, which contains the ptrace vulnerability fix (and nothing else).
Kernel development news 32-bit dev_t progress Andries Brouwer released a new set of patches this week which brings the long-planned expansion of dev_t closer to reality. These patches rework the character device infrastructure to make it safe for much larger numbers of devices. For now, at least, it is not even necessary to change any char drivers to work properly with the new code.The first patch clears out the char device code within the filesystem area. This code included a whole structure for tracking devices, managing reference counts, etc. That structure was only used in one place, however, and Andries decided that, rather than fix it up to work with larger device numbers, he would just hack it out altogether. The rest of the kernel will not really notice its absence, for now. The core of the work is in the second patch. Here, the longstanding static chrdevs array is removed. A static array of devices works reasonably well when there is a maximum of 255 of them; it's rather less convenient when there can be thousands of device numbers. In its place is a simple hash table with linked lists of registered char drivers. There is a new way of registering a char driver:
int register_chrdev_region(unsigned int major,
unsigned int baseminor,
int minorct,
const char *name,
struct file_operations *fops);
The new baseminor and minorct arguments describe the range of minor numbers that the driver is prepared to deal with. Char drivers should eventually be converted to the new interface, but there is no great hurry; the register_chrdev() interface is still supported as:
int register_chrdev(unsigned int major, const char *name,
struct file_operations *fops)
{
return register_chrdev_region(major, 0, 256, name, fops);
}
So unchanged char drivers will still work, and will not be confronted with minor numbers greater than 255. For now, drivers requesting a dynamic major number may continue to use the same mechanism: passing major as zero. The mechanism implemented in the patch is not entirely robust, however, and is marked as being temporary. The third patch just cleans things up a bit, and removes the MAX_CHRDEV macro. For the truly adventurous, there is a fourth patch which actually changes dev_t to 32 bits, using a 16:16 split. These patches have found their way into the -mm kernel tree, and are now in need of some serious testing. Should things work out, the 32-bit dev_t expansion may finally get crossed off the 2.5 development list.
Speeding up ext2 The 2.5 kernel development process has put a strong emphasis on scalability and performance issues. So it is somewhat interesting that the core Linux filesystems - ext2 and ext3 - have seen relatively little scalability work in 2.5. That is beginning to change, at least for ext2, but this work is raising some interesting questions about what the role of these two filesystems really is.Alex Tomas has recently been working on performance bottlenecks in ext2. His first concurrent block allocation patch attacks the problem of allocating blocks within a filesystem. The current ext2 code takes out the superblock lock before performing block allocation; this means that only one thread can be trying to allocate space in a given filesystem at a time. The first patch created a separate "allocation lock" which protects the small piece of code which actually makes allocation decisions; a later revision creates a separate lock for each block group within the filesystem, thus reducing lock contention further. The patch was greeted with positive reviews. William Lee Irwin reported a throughput increase from 62 MB/s to 104 MB/s on a benchmark he ran, and exclaimed "This patch is a godsend. Whoever's listening, please apply!. Martin Bligh, instead, said "SDET on my machine (16x NUMA-Q) has fallen in love with your patch, and has decided to elope with it to a small desert island." Not bad for a patch which is really a pretty straightforward exercise in finer-grained locking. The block allocation patch was quickly joined by a concurrent inode allocation patch and a distributed counters patch. None of these have found their way into the mainline kernel yet, but they offer enough performance benefits that they will likely get there eventually. Assuming the block allocation patch can be coaxed back from its desert island experience, that is. A question was raised, however: is ext2 the right place for this sort of work? ext2 is generally thought of as the relatively simple Linux filesystem; ext3 is the place for fancy new stuff. There are a couple of reasons why this sort of work tends to find its way into ext2 first, though. One of those reasons is the simple fact that ext3 still has bigger scaling problems. The ext3 filesystem is one of the few places in the Linux kernel that still makes heavy use of the big kernel lock (BKL). As a result, ext3 does not scale well to large systems, and tweaking things like block allocation will not help the real problem. Until the BKL dependency is removed from ext3, most other performance work will not make much sense. Removing the BKL is apparently a somewhat tricky job; at this point, it may well not happen before 2.6 is released. The other reason is that, large-systems scaling issues notwithstanding, ext3 is developing into the default Linux filesystem. For most users, there is little or no incentive to prefer ext2 over ext3; all it takes is one power failure to make the advantages of a journaling filesystem clear. So, as Daniel Phillips put it:
Ext2 is growing into the role of experimental filesystem; Ext3 is
now the stable filesystem. Hopefully, the experiments will make
Ext2 smaller, cleaner and at the same time, more powerful, over
time. Sort of like the role that RAMFS plays: besides being
useful, Ext2 should be thought of as a showcase for best filesystem
coding practices
The role reversal, it seems, is nearly complete. Soon, it will be the ext2 users who are living on the bleeding edge.
Driver porting News from the driver porting series The driver porting series continues to look at block drivers this week. Below you'll find an article on the gendisk interface, which has become rather more important in 2.5. Also available is this article which looks, in detail, at the simplest possible block driver - a naive ramdisk driver for 2.5. As always, the entire series (up to 19 articles now) can be found on this page.
Driver porting: the gendisk interface
Gendisk initializationThe best way of looking at the contents of a gendisk structure from a block driver's point of view is to examine what that driver must do to set the structure up in the first place. If your driver makes a disk (or disk-like) device available to the system, it will have to provide an associated gendisk structure. (Note, however, that it is not necessary - or correct - to set up gendisk structures for disk partitions).The first step is to create the gendisk structure itself; the function you need is alloc_disk() (which is declared in <linux/genhd.h>):
struct gendisk *alloc_disk(int minors);
The argument minors is the maximum number of minor numbers that this disk can have. Minor numbers correspond to partitions, of course (except the first, which is the "whole disk" device), so the value passed here controls the maximum number of partitions. If a single minor number is requested, the device cannot be partitioned at all. The return value is a pointer to the gendisk structure; the allocation can fail, so this value should always be checked against NULL before proceeding. There are several fields of the gendisk structure which must be initialized by the block driver. They include:
The gendisk structure also holds the size of the disk, in sectors. As part of the initialization process, the driver should set that size with:
void set_capacity(struct gendisk *disk, sector_t size);
The size value should be in 512-byte sectors, even if the hardware sector size used by your device is different. For removable disks, setting its capacity to zero indicates to the block subsystem that there is currently no media present in the device.
Manipulating gendisksOnce you have your gendisk structure set up, you have to add it to the list of active disks; that is done with:
void add_disk(struct gendisk *disk);
After this call, your device is active. There are a few things worth keeping in mind about add_disk():
Should you need to remove a disk from the system, that is accomplished with:
void del_gendisk(struct gendisk *disk);
This function cleans up all of the information associated with the given disk, and generally removes it from the system. After a call to del_gendisk(), no more operations will be sent to the given device. Your driver's reference to the gendisk object remains, though; you must explicitly release it with:
void put_disk(struct gendisk *disk);
That call will cause the gendisk structure to be freed, as long as no other part of the kernel retains a reference to it. Should you need to set a disk into a read-only mode, use:
void set_disk_ro(struct gendisk *disk, int flag);
If flag is nonzero, all partitions on the disk will be marked read-only. The kernel can track read-only status individually for each partition, but no utility function has been exported to manipulate that status for single partitions. Partition management is handled within the block subsystem in 2.6; drivers need not worry about partitions at all. Should the need arise, the functions add_partition() and delete_partition() can be used to manipulate the (in-kernel) partition table directly. These functions are used in the generic block ioctl() code; there should be no need for a block driver to call them directly.
Registering block device number rangesA call to add_disk() implicitly allocates the a set of minor numbers (under the given major number) from first_minor to first_minor+minors-1. If your driver must only respond to operations to disks that exist at initialization time, there is no need to worry further about number allocation. Even the traditional call to register_blkdev() is optional, and may be removed soon. Some drivers, however, need to be able to claim responsibility for a larger range of device numbers at initialization time.If this is your case, the answer is to call blk_register_region(), which has this rather involved prototype:
void blk_register_region(dev_t dev,
unsigned long range,
struct module *module,
struct kobject *(*probe)(dev_t, int *, void *),
int (*lock)(dev_t, void *),
void *data);
Here, dev is a device number (created with MKDEV()) containing the major and first minor number of the region of interest; range is the number of minor numbers to allocate, module is the loadable module (if any) containing the driver, probe is a driver-supplied function to probe for a single disk, lock is a driver-supplied locking function, and data is a driver-private pointer which is passed to probe() and lock(). When blk_register_region() is called, it simply makes a note of the desired region and returns. Note that there can be more than one registration within a specific region! At lookup time, the most "specific" registration (the one with the smallest range) wins. At some point in the future, an attempt may be made to access a device number within the allocated region. At that point, there will be a call to the lock() function (if it was not passed as NULL) with the device number of interest. If lock() succeeds, probe() will be called to find the specific disk of interest. The full prototype of the probe function is:
struct kobject *(*probe)(dev_t dev, int *partition, void *data);
Here, dev is the device number of interest, partition is a pointer to a partition number (sort of), and data is the driver-private pointer passed to blk_register_region(). The partition number is actually just the offset into the allocated range; it's the minor number from dev with the beginning of the range subtracted. The probe() function should attempt to identify a specific gendisk structure which corresponds to the requested number. If it is successful, it should return a pointer to the kobject structure contained within the gendisk. Kobjects are covered in a separate article; for all, all you really need to know is that you should call get_disk() with the gendisk structure as the argument, and return the value from get_disk() to the caller. The probe() function can also modify the partition number so that it corresponds to the actual partition offset in the returned device. If the function cannot handle the request at all, it can return NULL. Some probe() functions do not, themselves, locate and initialize the device of interest. Instead, they call some other function to set in motion that whole process. For example, a number of probe() functions simply call request_module() in an attempt to load a module which can handle the device. In this mode of operation, the function should return NULL, which will cause the block layer to look at the device number allocations one more time. If a "better" allocation (with a smaller range) has happened in the mean time, the probe() function for the new driver will be called. So, for example, if a module is loaded which allocates a smaller device number range corresponding to the devices it actually implements, its probe() routine will be called on the next iteration. Of course, there is the usual assocated unregister function:
void blk_unregister_region(dev_t dev, unsigned long range);
The next stepOnce you have a handle on how the gendisk structure works, the next thing to do is to learn about BIO structures.
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Kernel building
Memory management
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.