Brief items
The current development kernel is 2.6.0-test9, which was released
almost a full month ago now. Fixes continue to trickle slowly into Linus's
BitKeeper tree, however.
The current stable kernel is 2.4.22, but not for much longer.
Marcelo has turned loose the second 2.4.23
release candidate which, he hopes, will be the final one.
Comments (none posted)
Kernel development news
Anybody who has been following Linus's BitKeeper tree knows that very few
patches have gone in recently. Linus is doing his best to restrict things
to only the most important fixes. As a result, one might get the
impression that 2.6 development has stalled. Development continues, of
course, and bug fixes are being produced, but most of that work is not
getting into the tree in the interests of getting a highly stable 2.6.0
release out.
Linus explains his policy this way:
I've been trying to be an absolute _bastard_ when it comes to
patches. Yeah, I just looked. Lately they've been averaging about
3-4kB per day. And the sick thing is, I'm still not satisfied. I
want it to become an absolute _trickle_ of one-liners that fix real
bugs.
This policy makes some sense; it should quiet the waters enough to help the
developers find most of the final serious problems in 2.6.0. The only
problem, though, is that there is an increasingly large pile of patches
which will have to go in after 2.6.0. As a way of thinking about what
happens then, consider what Linus said
almost three years ago, when 2.4.0 came out:
The linux kernel has had an interesting release pattern: usually
the .0 release was actually fairly good (there's almost always
_something_ stupid, but on the whole not really horrible). And
every single time so far, .1 has been worse. It usually takes
until something like .5 until it has caught up and surpassed the
stability of .0 again.
Why? Because there are a lot of pent-up patches waiting for
inclusion, that didn't get through the "we need to get a release
out, that patch can wait" filter. So early on in the stable tree,
some of those patches make it. And it turns out to be a bad idea.
To an extent, things have to be opened up a bit after the 2.6.0 release.
The wider testing that the "dot-zero" release gets is certain to turn up
new bugs that will need fixing. And a number of the fixes out there do
need to go in before 2.6 can be deployed in a lot of production
situations. So chances are good that the usual pattern will be followed;
things will destabilize a little before 2.6 is truly ready for wider use.
That, perhaps, is simply the way kernels have to be made.
Comments (11 posted)
The "must fix" and "should fix" lists which were frequently posted some
months ago have been keeping a low profile recently. They do still exist,
however, and some effort has gone into keeping them up to date. The latest
version is bundled with Andrew Morton's -mm patches. For the curious, here
are the
must-fix and
should-fix lists from 2.6.0-test9-mm4.
The must-fix remains surprisingly long, given that 2.6.0 is considered to
be right around the corner. It includes (among many other items):
- A lot of locking problems in the tty, parport, PCMCIA, SCSI, and input
drivers.
- Expanding dev_t to 64 bits is there, though the list
acknowledges that the current 32-bit size will be enough for 2.6.0.
Reaching 64 bits will require additional work with certain filesystems
(such as older NFS protocols) which are not prepared for it.
- The char device rework remains incomplete, though it is in a
functioning state now. It would not be surprising to see some changes
in the char device API early in 2.6.x. Such things cause endless
annoyance to people trying to write driver books.
- There are still fixes from the 2.4 tree - including security fixes -
which must be ported to 2.6. Alan Cox surfaced from his studies long enough to
note that this work is currently being done.
- The "misc device" interface is marked for removal, since the new char
device interface does all the same stuff. That change seems unlikely
for 2.6.x, however.
- Asynchronous I/O remains a work in progress. It has a number of
potentially lethal race conditions, and fairly straightforward things
(regular file I/O, for example) are not fully implemented. The -mm
tree contains a lot of AIO patches which should move over at some
point, but they clearly not the "one line fixes" that Linus is looking
for currently.
- Scheduler interactivity remains on the list, though the level of
complaining is lower than it used to be.
The "should-fix" list is even longer. It includes more IDE driver work,
various device mapper cleanups, the incorporation of a number of wireless
driver patches, the kexec patch (booting one kernel directly from another),
merging klibc (for initramfs images), MPLS support for IPSec, sorting out
the three-way software suspend disagreement, a kernel interface for
reporting errors to user space, improving the external module build
process, and numerous other things.
This list also still includes fixing module initialization races by not
enabling calls into the module until initialization is complete. With the
new module loading infrastructure, this change is an easy one to make. The
only problem is that it breaks certain things (like disk drivers, where the
kernel attempts to read the partition table when a disk is registered with
the system). These problems can be worked around, but there appears to be
little will to do so at this time.
No kernel will ever be perfect when it is released - making one perfect
would take so long that the kernel would no longer be relevant. Even so,
these lists are still long. Expect a bit of churn in the early 2.6.x
releases as the developers work at shortening them.
Comments (2 posted)
Driver porting
The updating of the Driver Porting Series is almost complete; as of this
writing, only the device model articles need to be done (they will take a
bit of work). The following article is another rerun, but it has seen
enough changes to be worth another pass. The "simple block driver" is even
simpler now; it is significantly shorter (less than 200 lines), but it
implements a fully functional, partitionable block device.
Comments (none posted)
Given the large number of changes to the 2.6 block layer, it is hard to
know where to start describing them. We'll begin by examining the simplest
possible block driver. The sbd ("simple block device") driver simulates a
block device with a region of kernel memory; it is, essentially, a naive
ramdisk driver implemented in less than 200 lines of code. It will allow
the demonstration of some changes in how
block drivers work with the rest of the system without the need for all the
complexity required when one is dealing with real hardware. Code fragments
will be shown below; the full driver source can be found
on this page.
If you have not read the block layer
overview, you might want to head over there for a moment; this article
will still be here when you get back.
Initialization
In our simple driver, the module initialization function is called
sbd_init(). Its job, of course, is to get set up for block
operations and to make its disk available to the system. The first step is
to set up our internal data structure; within the driver a disk (
the
disk, in this case) is represented by:
static struct sbd_device {
unsigned long size;
spinlock_t lock;
u8 *data;
struct gendisk *gd;
} Device;
Here size is the size of the device (in bytes), data is
the array where the "disk" stores its data, lock is a spinlock for
mutual exclusion, and gd is the kernel
representation of our device.
The device initialization is pretty straightforward; it is just a matter of
allocating the memory to actually store the data and initializing the
spinlock:
Device.size = nsectors*hardsect_size;
spin_lock_init(&Device.lock);
Device.data = vmalloc(Device.size);
if (Device.data == NULL)
return -ENOMEM;
(nsectors and hardsect_size are load-time parameters that
control how big the device should be).
About now is where block drivers traditionally register themselves with the
kernel, and sbd does that too:
major_num = register_blkdev(major_num, "sbd");
if (major_num <= 0) {
printk(KERN_WARNING "sbd: unable to get major number\n");
goto out;
}
Note that, in 2.6, no device operations structure is passed to
register_blkdev(). As it turns out,
a block driver can happily get by without calling
register_blkdev() at all. That function does little work, at this
point, and will likely be removed sooner or later. About
the only remaining tasks performed by register_blkdev() are the
assignment of a dynamic major number (if requested), and causing the block
driver to show up in /proc/devices.
Generic disks
If
register_blkdev() no longer does anything, where does the real
work get done? The answer lies in the much improved 2.6 "generic disk" (or
"gendisk") code. The gendisk interface is covered in
a separate article, so we'll look only quickly
at how
sbd does its gendisk setup.
The first step is to get a gendisk structure to represent the
sbd device:
Device.gd = alloc_disk(16);
if (! Device.gd)
goto out_unregister;
Note that a memory allocation is involved, so the return value should be
checked. The parameter to alloc_disk() indicates the number of
minor numbers that should be dedicated to this device. We have requested
16 minor numbers, meaning that the device will support 15 partitions.
The gendisk must be initialized; the sbd driver starts that task
as follows:
Device.gd->major = major_num;
Device.gd->first_minor = 0;
Device.gd->fops = &sbd_ops;
Device.gd->private_data = &Device;
strcpy (Device.gd->disk_name, "sbd0");
set_capacity(Device.gd, nsectors*(hardsect_size/KERNEL_SECTOR_SIZE));
Most of the above should be relatively self-explanatory. The fops
field is a pointer to the block_device_operations structure for
this device; we'll get to that shortly. The private_data field
can be used by the driver, so we stick a pointer to our sbd_device
structure there. The set_capacity() call tells the kernel how
large the device is. Note that the kernel can handle block devices which
have sectors greater than 512 bytes, but it always deals with 512-byte
sectors internally. So we need to normalize the sector count before
passing it to the kernel.
Another thing that (usually) goes into the gendisk is the request queue to
use. The BLK_DEFAULT_QUEUE macro from 2.4 is no more; a block
driver must explicitly create and set up the request queue(s) it will use.
Furthermore, request queues must be allocated dynamicly, at run time. The
sbd driver sets up its request queue as follows:
static struct request_queue *Queue;
/* ... */
Queue = blk_init_queue(sbd_request, &Device.lock);
if (Queue == NULL)
goto out;
blk_queue_hardsect_size(Queue, hardsect_size);
Device.gd->queue = Queue;
Here, sbd_request is the request function, which we will get to
soon.
Note that a spinlock must be passed into blk_init_queue(). The
global io_request_lock is gone forevermore, and each block driver
must manage its own locking. Typically, the lock used by the driver to
serialize access to internal resources is the best choice for controlling
access to the request queue as well. For
that reason, the block layer expects the driver to provide a lock of its
own for the queue. If a nonstandard hard sector size (i.e. not 512 bytes)
is in use, the sector size should be stored into the request queue with
blk_queue_hardsect_size(). Finally, a pointer to the queue must
be stored in the gendisk structure.
At this point, the gendisk setup is complete. All that remains is to add
the disk to the system:
add_disk(Device.gd);
Note that add_disk() may well generate I/O to the device before it
returns - the
driver must be in a state where it can handle requests before adding
disks. The driver also should not fail initialization after it has
successfully added a disk.
What you don't have to do
That is the end of the initialization process for the
sbd driver.
What you don't have to do is as notable as what does need to be done.
For example, there are no assignments to global arrays; the whole set of
global variables that used to describe block devices is gone. There is
also nothing here for dealing with partition setup. Partition handling is
now done in the generic block layer, and there is almost nothing that
individual drivers must do at this point. "Almost" because the driver must
handle one
ioctl() call, as described below.
Open and release
The
open and
release methods (which are kept in the
block_device_operations structure) actually have not changed since
2.4. The
sbd driver has nothing to do at open or release time, so
it doesn't even bother to define these methods. Drivers for real hardware
may need to lock and unlock doors, check for media, etc. in these methods.
The request method
The core of a block driver, of course, is its
request method. The
sbd driver has the simplest possible
request function; it
does not concern itself with things like request clustering, barriers,
etc. It does not understand the new
bio structure used to
represent requests at all. But it works. Real drivers will almost
certainly require a more serious
request method; see the other
Driver Porting Series articles for the
gory details on how to do that.
Here is the whole thing:
static void sbd_request(request_queue_t *q)
{
struct request *req;
while ((req = elv_next_request(q)) != NULL) {
if (! blk_fs_request(req)) {
end_request(req, 0);
continue;
}
sbd_transfer(&Device, req->sector, req->current_nr_sectors,
req->buffer, rq_data_dir(req));
end_request(req, 1);
}
}
The first thing to notice is that all of the old
<linux/blk.h> cruft has been removed. Macros like
INIT_REQUEST (with its hidden return statement),
CURRENT, and QUEUE_EMPTY are gone. It is now necessary
to deal with the request queue functions directly, but, as can be seen,
that is not particularly hard.
Note that the Device.lock will be held on entry to the
request function, much like io_request_lock is in 2.4.
The function for getting the first request in the queue is now
elv_next_request(). A NULL return means that there are
no more requests on the queue that are ready to process.
A simple request loop like this one can simply run until the request queue
is empty; drivers for real hardware will also have to take into account how
many operations the device can handle, of course.
Note that this function does not
actually remove the request from the queue; it just returns a properly
adjusted view of the top request.
Note also that, in 2.6, there can be
multiple types of requests. Thus the test:
if (! blk_fs_request(req)) {
end_request(req, 0);
continue;
}
A nonzero return value from the blk_fs_request() macro says "this
is a normal filesystem request."
Other types of requests (i.e. packet-mode or device-specific diagnostic
operations) are not something that sbd supports, so it
simply fails any such requests.
The function sbd_transfer() is really just a memcpy()
with some checking; see the full source if you are interested. The key is
in the parameters: the various fields of the request structure
(sector, current_nr_sectors, and buffer) look
just like they did in 2.4. They also have the same meaning: they are a
window looking at the first part of a (possibly larger) request. If you
deal with block requests at this level, you need know nothing of the
bio structures underlying the request. This approach only works
for the simplest of drivers, however.
Note that the direction of the request is now found in the flags
field, and can be tested with rq_data_dir(). A nonzero value
(WRITE) indicates that this is a write request. Note also the
absence of any code adding partition offsets; all of that is handled in the
higher layers.
Finally, end_request() is called to finish processing of this
request. This function has picked up a new parameter in 2.6, being the
pointer to the request structure.
Other block operations
The two other
block_device_operations methods from 2.4 -
check_media_change() and
revalidate() - have seen prototype
changes in 2.5. They are now called
media_changed() and
revalidate_disk(), and both take a
gendisk structure as
their only argument. The basic task performed by these methods remains
unchanged, however.
In 2.4, a block driver's ioctl() method would handle any commands
it understood, and pass the rest on to blk_ioctl() for generic
processing. In 2.6, the generic code gets the first crack at any
ioctl() calls, and only invokes the driver for those it can't
implement itself. As a result, ioctl() methods in drivers can
often be pretty small. The sbd driver includes an ioctl
method which handles a single command:
int sbd_ioctl (struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
long size;
struct hd_geometry geo;
switch(cmd) {
/*
* The only command we need to interpret is HDIO_GETGEO, since
* we can't partition the drive otherwise. We have no real
* geometry, of course, so make something up.
*/
case HDIO_GETGEO:
size = Device.size*(hardsect_size/KERNEL_SECTOR_SIZE);
geo.cylinders = (size & ~0x3f) >> 6;
geo.heads = 4;
geo.sectors = 16;
geo.start = 4;
if (copy_to_user((void *) arg, &geo, sizeof(geo)))
return -EFAULT;
return 0;
}
return -ENOTTY; /* unknown command */
}
The notion of a regular geometry has been fiction for most devices for some
years now. Tools like fdisk still work with cylinders, however,
so a driver must make up some sort of convincing geometry story. The
sbd implementation claims four heads and 16 sectors per cylinder,
but anything else reasonable would have worked as well.
Shutting down
The last thing to look at is what happens when the module is unloaded. We
must, of course, clean up our various data structures and free memory - the
usual stuff. The
sbd cleanup function looks like this:
static void __exit sbd_exit(void)
{
del_gendisk(Device.gd);
put_disk(Device.gd);
unregister_blkdev(major_num, "sbd");
blk_cleanup_queue(Queue);
vfree(Device.data);
}
del_gendisk() cleans up any partitioning information, and
generally makes the system forget about the gendisk passed to it. The call
to put_disk() then releases our reference to the gendisk
structure (obtained
when we first called alloc_disk()) so that it can be freed. Then, of
course, we must free the memory used for the device itself (only after the
gendisk has been cleaned up, so we know no more operations can be
requested), release the request queue, and unregister the block device.
Conclusion
That is about as simple as it gets; the above implements a true virtual
block device that can support a filesystem. Real drivers, of course, will
tend to be more complicated. For details on how to make them more
complicated, continue with the
Driver
Porting Series; the next block driver article is
The Gendisk Interface.
Comments (8 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Networking
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>