Given the large number of changes to the 2.6 block layer, it is hard to
know where to start describing them. We'll begin by examining the simplest
possible block driver. The sbd ("simple block device") driver simulates a
block device with a region of kernel memory; it is, essentially, a naive
ramdisk driver. It will allow the demonstration of some changes in how
block drivers work with the rest of the system without the need for all the
complexity required when one is dealing with real hardware. Code fragments
will be shown below; the full driver source can be found
on this page.
If you have not read the block layer
overview, you might want to head over there for a moment; this article
will still be here when you get back.
Initialization
In our simple driver, the module initialization function is called
sbd_init(). Its job, of course, is to get set up for block
operations and to make its disk available to the system. The first step is
to set up our internal data structure; within the driver a disk (
the
disk, in this case) is represented by:
static struct sbd_device {
unsigned long size;
u8 *data;
spinlock_t lock;
struct gendisk *gd;
struct block_device *bdev;
} Device;
Here size is the size of the device (in bytes), data is
the array where the "disk" stores its data, lock is a spinlock for
mutual exclusion, and gd and bdev are kernel
representations of our device.
The device initialization is pretty straightforward:
Device.size = nsectors*hardsect_size;
spin_lock_init(&Device.lock);
Device.data = vmalloc(Device.size);
if (Device.data == NULL)
return -ENOMEM;
Device.bdev = NULL; /* we'll get it at open time */
(nsectors and hardsect_size are load-time parameters that
control how big the device should be).
About now is where block drivers traditionally register themselves with the
kernel, and sbd does that too:
major_num = register_blkdev(major_num, "sbd", &sbd_ops);
if (major_num <= 0) {
printk(KERN_WARNING "sbd: unable to get major number\n");
goto out;
}
The above example works in 2.5.64; as of 2.5.65, the third parameter (the
block operations structure sbd_ops) should be removed. Note that,
in 2.5, a block driver can happily get by without calling
register_blkdev() at all. That function does little work, at this
point, and will likely be removed before the end of the 2.5 process. About
the only remaining tasks performed by register_blkdev() are the
assignment of a dynamic major number (if requested), and causing the block
driver to show up in /proc/devices.
Generic disks
If
register_blkdev() no longer does anything, where does the real
work get done? The answer lies in the much improved 2.5 "generic disk" (or
"gendisk") code. The gendisk interface merits (and will get) an article of
its own, so we'll look only quickly at how
sbd does its gendisk
setup.
The first step is to get a gendisk structure to represent the
sbd device:
Device.gd = alloc_disk(1);
if (! Device.gd)
goto out_unregister;
Note that a memory allocation is involved, so the return value should be
checked. The parameter to alloc_disk() indicates the number of
minor numbers that should be dedicated to this device. The sbd
gendisk is set up with a single minor number, meaning that it cannot be
partitioned. Real devices will probably want more minors.
The gendisk must be initialized; the sbd driver starts that task
as follows:
Device.gd->major = major_num;
Device.gd->first_minor = 0;
Device.gd->fops = &sbd_ops;
Device.gd->private_data = &Device;
strcpy (Device.gd->disk_name, "sbd0");
set_capacity(Device.gd, nsectors);
Most of the above should be relatively self-explanatory. The fops
field is a pointer to the block_device_operations structure for
this device; we'll get to that shortly. The private_data field
can be used by the driver, so we stick a pointer to our sbd_device
structure there.
Another thing that (usually) goes into the gendisk is the request queue to
use. The BLK_DEFAULT_QUEUE macro from 2.4 is no more; a block
driver must explicitly create and set up the request queue(s) it will use.
So sbd declares one:
static struct request_queue Queue;
The queue must then be initialized and stored into the gendisk:
blk_init_queue(&Queue, sbd_request, &Device.lock);
Device.gd->queue = &Queue;
Here, sbd_request is the request function, which we will get to.
Note that a spinlock must be passed into blk_init_queue(). The
global io_request_lock is gone forevermore, and each block driver
must manage its own locking. Typically, the lock used by the driver to
serialize access to internal resources is the best choice for controlling
access to the request queue as well. For
that reason, the block layer expects the driver to provide a lock of its
own for the queue.
At this point, the gendisk setup is complete. All that remains is to add
the disk to the system:
add_disk(Device.gd);
Note that add_disk() may well generate I/O to the device - the
driver must be in a state where it can handle requests before adding
disks. The driver also should not fail initialization after it has
successfully added a disk.
What you don't have to do
That is the end of the initialization process for the
sbd driver.
What you don't have to do is as notable what what does need to be done.
For example, there are no assignments to global arrays; the whole set of
global variables that used to describe block devices is gone. There is
also nothing here for dealing with partition setup. The
sbd
device is, as mentioned above, not partitionable, but the only change (to
the entire driver) that
is required to enable partitioning is to increase the argument to
alloc_disk(). The block layer now handles everything else.
Open and release
The
open and
release methods (which are kept in the
block_device_operations structure) actually have not changed since
2.4. The
sbd open method looks like:
static int sbd_open(struct inode *inode, struct file *filp)
{
/*
* Remember the block_device pointer - but only once.
*/
if (! Device.bdev) {
Device.bdev = inode->i_bdev;
atomic_inc(&Device.bdev->bd_count);
}
return 0;
}
Here, we take the opportunity to snarf a copy of the block_device
structure representing our device; this is about the only place where that
is easy to do. There's not actually much for a driver to do with
that structure; we won't touch it until module unload time.
The release method is a no-op:
static int sbd_release(struct inode *inode, struct file *filp)
{
return 0;
}
The request method
The core of a block driver, of course, is its
request method. The
sbd driver has the simplest possible
request function; it
does not concern itself with things like request clustering, barriers,
etc. It does not understand the new
bio structure used to
represent requests at all. But it works. Real drivers will almost
certainly require a more serious
request method; we'll get into
how those can be written in future articles.
Here is the whole thing:
static void sbd_request(request_queue_t *q)
{
struct request *req;
while ((req = elv_next_request(q)) != NULL) {
if (! blk_fs_request(req)) {
end_request(req, 0);
continue;
}
sbd_transfer(&Device, req->sector, req->current_nr_sectors,
req->buffer, rq_data_dir(req));
end_request(req, 1);
}
}
The first thing to notice is that all of the old
<linux/blk.h> cruft has been removed. Macros like
INIT_REQUEST (with its hidden return statement),
CURRENT, and QUEUE_EMPTY are gone. It is now necessary
to deal with the request queue functions directly, but, as can be seen,
that is not particularly hard.
Note that the Device.lock will be held on entry to the
request function, much like io_request_lock is in 2.4.
The function for getting the first request in the queue is now
elv_next_request(). A NULL return means that there are
no more requests on the queue that are ready to process.
A simple request loop like this one can simply run until the request queue
is empty; drivers for real hardware will also have to take into account how
many operations the device can handle, of course.
Note that this function does not
actually remove the request from the queue; it just returns a properly
adjusted view of the top request.
Note also that, in 2.5, there can be
multiple types of requests. Thus the test:
if (! blk_fs_request(req)) {
end_request(req, 0);
continue;
}
A nonzero return value from the blk_fs_request() macro says "this
is a normal filesystem request."
Other types of requests (i.e. packet-mode or device-specific diagnostic
operations) are not something that sbd supports, so it
simply fails any such requests.
The function sbd_transfer() is really just a memcpy()
with some checking; see the full source if you are interested. The key is
in the parameters: the various fields of the request structure
(sector, current_nr_sectors, and buffer) look
just like they did in 2.4. They also have the same meaning: they are a
window looking at the first part of a (possibly larger) request. If you
deal with block requests at this level, you need know nothing of the
bio structures underlying the request. This approach only works
for the simplest of drivers, however.
Note that the direction of the request is now found in the flags
field, and can be tested with rq_data_dir(). A nonzero value
(WRITE) indicates that this is a write request. Note also the
absence of any code adding partition offsets; all of that is handled in the
higher layers.
Finally, end_request() is called to finish processing of this
request. This function has picked up a new parameter in 2.5, being the
pointer to the request structure.
Other block operations
The two other
block_device_operations methods from 2.4 -
check_media_change() and
revalidate() - have seen prototype
changes in 2.5. They are now called
media_changed() and
revalidate_disk(), and both take a
gendisk structure as
their only argument. The basic task performed by these methods remains
unchanged, however.
Shutting down
The last thing to look at is what happens when the module is unloaded. We
must, of course, clean up our various data structures and free memory - the
usual stuff. The
sbd cleanup function starts with this:
if (Device.bdev) {
invalidate_bdev(Device.bdev, 1);
bdput(Device.bdev);
}
In 2.5, invalidate_buffers() has become invalidate_bdev,
and it expects a block_device structure as an argument.
fsync_dev() has become fsync_bdev() in a similar way. We
call invalidate_bdev() to clean up any buffers that may still
refer to our device - even though they should not really exist if the
module is being unloaded. Then we release the reference to the
block_device structure that we grabbed back in our open()
method.
Note that, in 2.5, it is distressingly easy to oops the kernel with
invalidate_bdev() and fsync_bdev(). The latter, in
particular, expects there to be an open inode associated with your
device, which should not be the case if the module is being unloaded.
fsync_bdev() is safe to call when you know the device is open (in
an ioctl() method, for example), but should be avoided otherwise.
To finish the cleanup:
del_gendisk(Device.gd);
put_disk(Device.gd);
vfree(Device.data);
unregister_blkdev(major_num, "sbd");
del_gendisk() cleans up any partitioning information, and
generally makes the system forget about the gendisk passed to it. The call
to put_disk() then releases our reference to the gendisk
structure (obtained
when we first called alloc_disk()) so that it can be freed. Then, of
course, we must free the memory used for the device itself (only after the
gendisk has been cleaned up, so we know no more operations can be
requested), and unregister the block device.
Conclusion
That is about as simple as it gets; the above implements a true virtual
block device that can support a filesystem. Everything we look at from
here will make things more complicated. Future articles will look in
detail at gendisks, request queue management, the
bio structure,
and more. Stay tuned.
Update: one of the good and bad things about writing for LWN
is that the people who really know what's going on tend to be
reading. Accordingly, this article has been updated to address Jens's
comments, below.
(
Log in to post comments)