Kernel development
Brief items
Kernel release status
The current development kernel is 2.5.66; there have been no development kernel releases since March 24.Linus's BitKeeper repository is full of patches, however, including some loadable module tweaks, some softirq handling changes (fixing a bug where bottom halves could be run when interrupts are disabled), an improved scsi_debug module, a number of PCMCIA/Cardbus improvements, a new initcall_debug boot parameter for tracking down early boot crashes, an ACPI update, an NFS update, some IDE changes, an XFS update, a big x86-64 update, and numerous other cleanups and fixes.
The current stable kernel is 2.4.20; the last 2.4.21 prepatch was 2.4.21-pre6, released on March 26.
Kernel development news
The USB Gadget driver framework
David Brownell has sent out an announcement regarding the availability of the new USB "gadget" API. The Linux kernel has long had support for USB host controllers - the subsystem which lets the kernel drive attached USB devices. But what if Linux is running inside the device itself? Implementing the USB protocol is a very different job when you're approaching it from the other end of the bus, and the current in-kernel USB implementation will not be particularly helpful.Thus this announcement. The chosen terminology calls attached devices "gadgets," which need a gadget driver to make them work. (The USB standard, instead, calls gadgets "devices," but reusing the term "device driver" in this context would lead only to confusion). The new gadget implementation supports the NetChip 2280 controller, and comes with a couple of drivers: "gadget zero" (a skeleton example driver) and a network driver. There's also a dummy controller driver, allowing gadget development to be done in the absence of real hardware (and, perhaps, on a more friendly development platform).
The project has reached the point where it needs to get more people involved writing drivers. The substrate is there, so a lot of the hard decisions have been made, but the actual implementation for various hardware controllers and gadget classes is missing. This could be a fun area of development for people who would like to get into kernel programming.
A new set of IDE changes
Bartlomiej Zolnierkiewicz did a fair amount of IDE work during the Martin Dalecki period last year. He has been more quiet in recent times, however, leaving the current IDE cleanup effort to others. This week marked an end to that silence, however, as Bartlomiej surfaced with a significant set of IDE patches.The first part, based on work originally by Suparna Bhattacharya, changes the way the block layer BIO structure and request structure work together. The BIO structure contains the pointers needed to keep track of which I/O transfers have been completed, and which have not. What is lacking, however, is any way of tracking which operations have been commenced. That tracking has traditionally been the driver's job.
The new "BIO traversal" patches change things by adding a new set of pointers which mark where the next I/O operation should begin. A new support function (with the unwieldy name process_that_request_first()) gives drivers a way to indicate that processing has begun on part of the request. Overall, it could be a useful infrastructure for drivers with relatively complicated request processing.
The real change, however, is a new set of IDE programmed I/O (PIO) handlers. One wouldn't think that PIO matters much, given that DMA operations are available. But, in fact, quite a few IDE drives still don't handle DMA well, and the Linux kernel tends to be very conservative about enabling DMA. Unless you've taken steps to change things, chances are that your IDE-based Linux system is running with programmed I/O. So the quality of the PIO handlers matters.
Bartlomiej's new PIO code simplifies (and clarifies) things considerably.
It also uses the ATA taskfile mode of operation. The taskfile code has
been broken for some time, with the result that it is disabled in 2.5 and
was going to stay that way through the 2.6 release. After seeing the new
PIO handlers, however, Alan Cox has changed
his strategy: "Looks like the revised plan is 'pure taskfile for
2.6 care of Bartlomiej'
".
IDE patches (after last year's IDE regime change) are treated with great care; they don't even make it into development kernels until the confidence level is quite high. So Bartlomiej's changes probably won't appear in the 2.5 mainstream for a little while yet. They should eventually get there, however, and the result will be an improved IDE subsystem for 2.6. (The full set of patches can be found in the "Patches and Updates" section, below.)
Driver porting
The end of the block layer series
The driver porting series this week features two articles on the management of request queues and structures. Those of you who are uninterested in the Linux block layer will be glad to know that these articles finish out the series on writing block drivers. We'll move on to some other, as yet undetermined, subject next week.Driver porting: Request Queues I
| This article is part of the LWN Porting Drivers to 2.5 series. |
Request queues
Request queues are represented by a pointer to struct request_queue or to the typedef request_queue_t, defined in <linux/blkdev.h>. One request queue can be shared across multiple physical drives, but the normal usage is to create a separate queue for each drive. Request queues must be allocated and initialized by the block subsystem; this allocation (and initialization) is done by:
request_queue_t *blk_init_queue(request_fn_proc *request_fn,
spinlock_t *lock);
Here request_fn is the driver's function which will process requests, and lock is a spinlock which controls access to the queue. The return value is a pointer to the newly-allocated request queue if the initialization succeeded, or NULL otherwise. Since setting up a request queue requires memory allocation, failure is possible. A couple of other changes from 2.4 should be noted: a spinlock must be provided to control access to the queue (io_request_lock is no more), and there is no per-major "default" queue provided in 2.6.
When a driver is done with a request queue, it should pass it back to the system with:
void blk_cleanup_queue(request_queue_t *q);
Note that neither of these functions is normally called if a "make request" function is being used (make request functions are covered in part II).
Basic request processing
The request function prototype has not changed from 2.4; it gets the request queue as its only parameter. The queue lock will be held when the request function is called.All request handlers, from the simplest to the most complicated, will find the next request to process with:
struct request *elv_next_request(request_queue_t *q);
The return value is the next request that should be processed, or NULL if the queue is empty. If you look through the kernel source, you will find references to blk_queue_empty() (or elv_queue_empty()), which tests the state of the queue. Use of this function in drivers is best avoided, however. In the future, it could be that a non-empty queue still has no requests that are ready to be processed.
In 2.4 and prior kernels, a block request contained one or more buffer heads with sectors to be transferred. In 2.6, a request contains a list of BIO structures instead. This list can be accessed via the bio member of the request structure, but the recommended way of iterating through a request's BIOs is instead:
struct bio *bio;
rq_for_each_bio(bio, req) {
/* Process this BIO */
}
Drivers which use this macro are less likely to break in the future. Do note, however, that many drivers will never need to iterate through the list of BIOs in this way; for DMA transfers, use bio_rq_map_sg() (described below) instead.
As your driver performs the transfers described by the BIO structures, it will need to update the kernel on its progress. Note that drivers should not call bio_endio() as transfers complete; the block layer will take care of that. Instead, the driver should call end_that_request_first(), which has a different prototype in 2.6:
int end_that_request_first(struct request *req, int uptodate,
int nsectors);
Here, req is the request being handled, uptodate is nonzero unless an error has occurred, and nsectors is the number of sectors which were transferred. This function will clean up as many BIO structures as are covered by the given number of sectors, and return nonzero if any BIOs remain to be transferred.
When the request is complete (end_that_request_first() returns zero), the driver should clean up the request. The cleanup task involves removing the request from the queue, then passing it to end_that_request_last(), which is unchanged from 2.4. Note that the queue lock must be held when calling both of these functions:
void blkdev_dequeue_request(struct request *req);
void end_that_request_last(struct request *req);
Note that the driver can dequeue the request at any time (as long as it keeps track of it, of course). Drivers which keep multiple requests in flight will need to dequeue each request as it is passed to the drive.
If your device does not have predictable timing behavior, your driver should contribute its timing information to the system's entropy pool. That is done with:
void add_disk_randomness(struct gendisk *disk);
BIO walking
The "BIO walking" patch was added in 2.5.70. This patch adds some request queue fields and a new function to help complicated drivers keep track of where they are in a given request. Drivers using BIO walking will not use rq_for_each_bio(); instead, they rely upon the fact that the cbio field of the request structure refers to the current, unprocessed BIO, nr_cbio_segments tells how many segments remain to be processed in that BIO, and nr_cbio_sectors tells how many sectors are yet to be transferred. The macro:
int blk_rq_idx(struct request *req)
returns the index of the next segment to process. If you need to access the current segment buffer directly (for programmed I/O, say), you may use:
char *rq_map_buffer(struct request *req, unsigned long *flags);
void rq_unmap_buffer(char *buffer, unsigned long flags);
These functions potentially deal with atomic kmaps, so the usual constraints apply: no sleeping while the mapping is in effect, and buffers must be mapped and unmapped in the same function.
When beginning I/O on a set of blocks from the request, your driver can update the current pointers with:
int process_that_request_first(struct request *req,
unsigned int nr_sectors);
This function will update the various cbio values in the request, but does not signal completion (you still need end_that_request_first() for that). Use of process_that_request_first() is optional; your driver may call it if you would like the block subsystem to track your current position in the request for I/O submission independently from how much of the request has actually been completed.
Barrier requests
Requests will come off the request queue sorted into an order that should give good performance. Block drivers (and the devices they drive) are free to reorder those requests within reason, however. Drives which support features like tagged command queueing and write caching will often complete operations in an order different from that in which they received the requests. Most of the time, this reordering leads to improved performance and is a good thing.At times, however, it is necessary to inhibit this reordering. The classic example is that of journaling filesystems, which must be able to force journal entries to the disk before the operations they describe. Reordering of requests can undermine the filesystem integrity that a journaling filesystem is trying to provide.
To meet the needs of higher-level layers, the concept of a "barrier request" has been added to the 2.6 kernel. Barrier requests are marked by the REQ_HARDBARRIER flag in the request structure flags field. When your driver encounters a barrier request, it must complete that request (and all that preceded it) before beginning any requests after the barrier request. "Complete," in this case, means that the data has been physically written to the disk platter - not just transferred to the drive.
Tweaking request queue parameters
The block subsystem contains a long list of functions which control how I/O requests are created for your driver. Here's a few of them.Bounce buffer control: in 2.4, the block code assumed that devices could not perform DMA to or from high memory addresses. When I/O buffers were located in high memory, data would be copied to or from low-memory "bounce" buffers; the driver would then operate on the low-memory buffer. Most modern devices can handle (at a minimum) full 32-bit DMA addresses, or even 64-bit addresses. For now, 2.6 will still use bounce buffers for high-memory addresses. A driver can change that behavior with:
void blk_queue_bounce_limit(request_queue_t *q, u64 dma_addr);
After this call, any buffer whose physical address is at or above dma_addr will be copied through a bounce buffer. The driver can provide any reasonable address, or one of BLK_BOUNCE_HIGH (bounce high memory pages, the default), BLK_BOUNCE_ANY (do not use bounce buffers at all), or BLK_BOUNCE_ISA (bounce anything above the ISA DMA threshold).
Request clustering control. The block subsystem works hard to coalesce adjacent requests for better performance. Most devices have limits, however, on how large those requests can be. A few functions have been provided to instruct the block subsystem on how not to create requests which must be split apart again.
void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors);
Sets the maximum number of sectors which may be transferred in a single
request; default is 255. It is not possible to set the maximum below the
number of sectors contained in one page.
void blk_queue_max_phys_segments(request_queue_t *q,
unsigned short max_segments);
void blk_queue_max_hw_segments(request_queue_t *q,
unsigned short max_segments);
The maximum number of discontiguous physical segments in a single request; this is the maximum size of a scatter/gather list that could be presented to the device. The first functions controls the number of distinct memory segments in the request; the second does the same, but it takes into account the remapping which can be performed by the system's I/O memory management unit (if any). The default for both is 128 segments.
void blk_queue_max_segment_size(request_queue_t *q,
unsigned int max_size);
The maximum size that any individual segment within a request can be. The default is 65536 bytes.
void blk_queue_segment_boundary(request_queue_t *q,
unsigned long mask);
Some devices cannot perform transfers which cross memory boundaries of a certain size. If your device is one of these, you should call blk_queue_segment_boundary() with a mask indicating where the boundary is. If, for example, your hardware has a hard time crossing 4MB boundaries, mask should be set to 0x3fffff. The default is 0xffffffff.
Finally, some devices have more esoteric restrictions on which requests may or may not be clustered together. For situations where the above parameters are insufficient, a block driver can specify a function which can examine (and pass judgement on) each proposed merge.
typedef int (merge_bvec_fn) (request_queue_t *q, struct bio *bio,
struct bio_vec *bvec);
void blk_queue_merge_bvec(request_queue_t *q, merge_bvec_fn *fn);
Once the given fn is associated with this queue, it will be called every time a bio_vec entry bvec is being considered for addition to the given bio. It should return the number of bytes from bvec which can be added; zero should be returned if the new segment cannot be added at all. By default, there is no merge_bvec_fn.
Setting the hardware sector size. The old hardsect_size global array is gone and nobody misses it. Block drivers now inform the system of the underlying hardware's sector size with:
void blk_queue_hardsect_size(request_queue_t *q, unsigned short size);
The default is the usual 512-byte sector. There is one other important change with regard to sector sizes: your driver will always see requests expressed in terms of 512-byte sectors, regardless of the hardware sector size. The block subsystem will not generate requests which go against the hardware sector size, but sector numbers and counts in requests are always in 512-byte units. This change was required as part of the new centralized partition remapping.
DMA support
Most block I/O requests will come down to one more more DMA operations. The 2.6 block layer provides a couple of functions designed to make the task of setting up DMA operations easier.
void blk_queue_dma_alignment(request_queue_t *q, int mask);
This function sets a mask indicating what sort of memory alignment the hardware needs for DMA requests; the default is 511.
DMA operations to modern devices usually require the creation of a scatter/gather list of segments to be transferred. A block driver can create this "scatterlist" using the generic DMA support routines and the information found in the request. The block subsystem has made life a little easier, though. A simple call to:
int blk_rq_map_sg(request_queue_t *q, struct request *rq,
struct scatterlist *sg);
will construct a scatterlist for the given request; the return value is the number of entries in the resulting list. This scatterlist can then be passed to pci_map_sg() or dma_map_sg() in preparation for the DMA operation.
Going on
The second part of the request queue article series looks at command preparation, tagged command queueing, and writing drivers which do without a request queue altogether.Driver porting: Request Queues II
| This article is part of the LWN Porting Drivers to 2.6 series. |
Command pregeneration
Traditionally, block drivers have prepared low-level hardware commands at the time a request is processed. There can be advantages to preparing commands at an earlier point, however. In 2.6, drivers which wish to prepare commands (or perform some other sort of processing) for requests before they hit the request function should set up a prep_rq_fn with this prototype:
typedef int (prep_rq_fn) (request_queue_t *q, struct request *rq);
This function should perform preparatory work on the given request rq. The 2.6 request structure includes a 16-byte cmd field where a pregenerated command can be stored; rq->cmd_len should be set to the length of that command. The prep function should return BLKPREP_OK (process the request normally), BLKPREP_DEFER (which defers processing of the command for now), or BLKPREP_KILL (which terminates the request with a failure status).
To add your prep function to a request queue, call:
void blk_queue_prep_rq(request_queue_t *q, prep_rq_fn *pfn);
The prep function is currently called out of elv_next_request() - immediately before the request is passed back to your driver. There is a possibility that, at some future point, the call to the prep function could happen earlier in the process.
Tagged command queueing
Tagged command queueing (TCQ) allows a block device to have multiple outstanding I/O requests, each identified by an integer "tag." TCQ can yield performance benefits; the drive generally knows best when it comes to figuring out which request should be serviced next. SCSI drivers in Linux have long supported TCQ, but each driver has included its own infrastructure for tag management. In 2.6, a simple tag management facility has been added to the block layer. The generic tag management code can make life easier, but it's important to understand how these functions interact with the request queue.Drivers wishing to use tags should set things up with:
int blk_queue_init_tags(request_queue_t *q, int depth,
struct blk_queue_tag *tags);
This call should be made after the queue has been initialized. Here, depth is the maximum number of tagged commands which can be outstanding at any given time. The tags argument is a pointer to a blk_queue_tag structure which will be used to track the outstanding tags. Normally you can pass tags as NULL, and the block subsystem will allocate and initialize the structure for you. If you wish to share a structure (and, thus, the tags it represents) with another device, however, you can pass a pointer to the blk_queue_tag structure in the first queue when initializing the second. This call performs memory allocation, and will return a negative error code if that allocation failed.
A call to:
void blk_queue_free_tags(request_queue_t *q);
will clean up the TCQ infrastructure. This normally happens automatically when blk_cleanup_queue() is called, so drivers do not normally have to call blk_queue_free_tags() themselves.
To allocate a tag for a request, use:
int blk_queue_start_tag(request_queue_t *q, struct request *rq);
This function will associate a tag number with the given request rq, storing it in rq->tag. The return value will be zero on success, or a nonzero value if there are no more tags available. This function will remove the request from the queue, so your driver must take care not to lose track of it - and to not try to dequeue the request itself. It is also necessary to hold the queue lock when calling blk_queue_start_tag().
blk_queue_start_tag() has been designed to work as the command prep function. If your driver would like to have tags automatically assigned, it can perform a call like:
blk_queue_prep_rq(queue, blk_queue_start_tag);
And every request that comes from elv_next_request() will already have a tag associated with it.
If you need to know if a given request has a tag associated with it, use the macro blk_rq_tagged(rq). The return value will be nonzero if this request has been tagged.
When all transfers for a tagged request have been completed, the tag should be returned with:
void blk_queue_end_tag(request_queue_t *q, struct request *rq);
Timing is important here: blk_queue_end_tag() must be called before end_that_request_last(), or unpleasant things will happen. Be sure to have the queue lock held when calling this function.
If you need to know which request is associated with a given tag, call:
struct request *blk_queue_find_tag(request_queue_t *q, int tag);
The return value will be the request structure, or NULL if the given tag is not currently in use.
In the real world, things occasionally go wrong. If a drive (or the bus it is attached to) goes into an error state and must be reset, all outstanding tagged requests will be lost. In such a situation, the driver should call:
void blk_queue_invalidate_tags(request_queue_t *q);
This call will return all outstanding tags to the pool, and the associated I/O requests will be returned to the request queue so that they can be restarted.
Doing without a request queue
Some devices have no real need for a request queue. In particular, truly random-access devices, such as memory technology devices or ramdisks, can process requests quickly and do not benefit from sorting and merging of requests. Drivers for such devices may achieve better performance by shorting out much of the request queue structure and handling requests directly as they are generated.As in 2.4, this sort of driver can set up a "make request" function. First, however, the request queue must still be created. The queue will not be used to handle the actual requests, but it contains other infrastructure needed by the block subsystem. If your driver will use a make request function, it should first create the queue with blk_alloc_queue():
request_queue_t *blk_alloc_queue(int gfp_mask);
The gfp_mask argument describes how the requisite memory should be allocated, as usual. Note that this call can fail.
Once you have a request queue, you can set up the make request function; the prototype for this function has changed a bit from 2.4, however:
typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);
If the make request function can arrange for the transfer(s) described in the given bio, it should do so and return zero. "Stacking" drivers can also redirect the bio by changing its bi_bdev field and returning nonzero; in this case the bio will then be dispatched to the new device's driver (this is as things were done in 2.4).
If the "make request" function performs the transfer itself, it is responsible for passing the BIO to bio_endio() when the transfer is complete. Note that the "make request" function is not called with the queue lock held.
To arrange for your driver's function to be called, use:
void blk_queue_make_request(request_queue_t *q,
make_request_fn *func);
If and when your driver shuts down, be sure to return the request queue to the system with:
void blk_put_queue(request_queue_t *queue);
As of 2.6.0-test3, this function is just another name for blk_cleanup_queue(), but such things could always change in the future.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
