Brief items
The current development kernel is 2.5.66; there have been no
development kernel releases since March 24.
Linus's BitKeeper repository is full of patches, however, including some
loadable module tweaks, some softirq handling changes (fixing a bug where
bottom halves could be run when interrupts are disabled), an improved
scsi_debug module, a number of PCMCIA/Cardbus improvements, a new
initcall_debug boot parameter for tracking down early boot
crashes, an ACPI update, an NFS update, some IDE changes, an XFS update, a
big x86-64 update, and numerous other cleanups and fixes.
The current stable kernel is 2.4.20; the last 2.4.21 prepatch was 2.4.21-pre6, released on March 26.
Comments (none posted)
Kernel development news
David Brownell has sent out
an announcement
regarding the availability of the new USB "gadget" API. The Linux kernel
has long had support for USB host controllers - the subsystem which lets
the kernel drive attached USB devices. But what if Linux is running inside
the device itself? Implementing the USB protocol is a very different job
when you're approaching it from the other end of the bus, and the current
in-kernel USB implementation will not be particularly helpful.
Thus this announcement. The chosen terminology calls attached devices
"gadgets," which need a gadget driver to make them work. (The USB
standard, instead, calls gadgets "devices," but reusing the term "device
driver" in this context would lead only to confusion). The new gadget
implementation supports the NetChip 2280 controller, and comes with a
couple of drivers: "gadget zero" (a skeleton example driver) and a network
driver. There's also a dummy controller driver, allowing gadget
development to be done in the absence of real hardware (and, perhaps, on a
more friendly development platform).
The project has reached the point where it needs to get more people
involved writing drivers. The substrate is there, so a lot of the hard
decisions have been made, but the actual implementation for various
hardware controllers and gadget classes is missing. This could be a fun
area of development for people who would like to get into kernel
programming.
Comments (6 posted)
Bartlomiej Zolnierkiewicz did a fair amount of IDE work during the Martin
Dalecki period last year. He has been more quiet in recent times, however,
leaving the current IDE cleanup effort to others. This week marked an end
to that silence, however, as Bartlomiej surfaced with a significant set of
IDE patches.
The first part, based on work originally by Suparna Bhattacharya, changes
the way the block layer BIO structure and
request structure work together. The BIO structure contains the pointers
needed to keep track of which I/O transfers have been completed, and which
have not. What is lacking, however, is any way of tracking which
operations have been commenced. That tracking has traditionally been the
driver's job.
The new "BIO traversal" patches change things by adding a new set of
pointers which mark where the next I/O operation should begin. A new
support function (with the unwieldy name
process_that_request_first()) gives drivers a way to indicate that
processing has begun on part of the request. Overall, it could be a useful
infrastructure for drivers with relatively complicated request processing.
The real change, however, is a new set of IDE programmed I/O (PIO)
handlers. One wouldn't think that PIO matters much, given that DMA
operations are available. But, in fact, quite a few IDE drives still don't
handle DMA well, and the Linux kernel tends to be very conservative about
enabling DMA. Unless you've taken steps to change things, chances are that
your IDE-based Linux system is running with programmed I/O. So the quality
of the PIO handlers matters.
Bartlomiej's new PIO code simplifies (and clarifies) things considerably.
It also uses the ATA taskfile mode of operation. The taskfile code has
been broken for some time, with the result that it is disabled in 2.5 and
was going to stay that way through the 2.6 release. After seeing the new
PIO handlers, however, Alan Cox has changed
his strategy: "Looks like the revised plan is 'pure taskfile for
2.6 care of Bartlomiej'."
IDE patches (after last year's IDE regime change) are treated with great
care; they don't even make it into development kernels until the confidence
level is quite high. So Bartlomiej's changes probably won't appear in the
2.5 mainstream for a little while yet. They should eventually get there,
however, and the result will be an improved IDE subsystem for 2.6. (The
full set of patches can be found in the "Patches and Updates" section,
below.)
Comments (4 posted)
Driver porting
The
driver porting series this week
features two articles on the management of request queues and structures.
Those of you who are uninterested in the Linux block layer will be glad to
know that these articles finish out the series on writing block drivers.
We'll move on to some other, as yet undetermined, subject next week.
Comments (none posted)
The
simple block driver
example earlier in this series showed how to write the simplest possible
request function. Most block drivers, however, will need greater control
over how requests are built and processed. This article will get into the
details of how request queues work, with an emphasis on what every driver
writer needs to know to process requests. A
second article looks at some of the more
advanced features of request queues in 2.6.
Request queues
Request queues are represented by a pointer to
struct request_queue
or to the typedef
request_queue_t,
defined in
<linux/blkdev.h>. One request queue can be
shared across multiple physical drives, but the normal usage is to create a
separate queue for each drive. Request queues must be
allocated and initialized by the block subsystem; this allocation (and
initialization) is done by:
request_queue_t *blk_init_queue(request_fn_proc *request_fn,
spinlock_t *lock);
Here request_fn is the
driver's function which will process requests, and lock is a
spinlock which controls access to the queue. The return value is a pointer
to the newly-allocated request queue if the initialization succeeded, or
NULL otherwise.
Since setting up a request queue requires memory
allocation, failure is possible. A couple of other changes from 2.4 should be
noted: a spinlock must be provided to control access to the queue
(io_request_lock is no more), and there is no per-major "default"
queue provided in 2.6.
When a driver is done with a request queue, it should pass it back to the
system with:
void blk_cleanup_queue(request_queue_t *q);
Note that neither of these functions is normally called if a "make request"
function is being used (make request functions are covered in part II).
Basic request processing
The request function prototype has not changed from 2.4; it gets the
request queue as its only parameter. The queue lock will be held when the
request function is called.
All request handlers, from the simplest to the most complicated, will find
the next request to process with:
struct request *elv_next_request(request_queue_t *q);
The return value is the next request that should be processed, or
NULL if the queue is empty. If you look through the kernel
source, you will find references to blk_queue_empty() (or
elv_queue_empty()), which tests the state of the queue. Use of
this function in drivers is best avoided, however. In the future, it could
be that a non-empty queue still has no requests that are ready to be
processed.
In 2.4 and prior kernels, a block request contained one or more buffer
heads with sectors to be transferred. In 2.6, a request contains a list of
BIO structures instead. This list can be
accessed via the bio member of the request structure, but
the recommended way of iterating through a request's BIOs is instead:
struct bio *bio;
rq_for_each_bio(bio, req) {
/* Process this BIO */
}
Drivers which use this macro are less likely to break in the future. Do
note, however, that many drivers will never need to iterate through the
list of BIOs in this way; for DMA transfers, use bio_rq_map_sg()
(described below) instead.
As your driver performs the transfers described by the BIO structures, it
will need to update the kernel on its progress. Note that drivers should
not call bio_endio() as transfers complete; the block layer
will take care of that. Instead, the driver should call
end_that_request_first(), which has a different prototype in 2.6:
int end_that_request_first(struct request *req, int uptodate,
int nsectors);
Here, req is the request being handled, uptodate is
nonzero unless an error has occurred, and nsectors is the number
of sectors which were transferred. This function will clean up as many BIO
structures as are covered by the given number of sectors, and return
nonzero if any BIOs remain to be transferred.
When the request is complete (end_that_request_first() returns
zero), the driver should clean up the request. The cleanup task involves
removing the request from the queue, then passing it to
end_that_request_last(), which is unchanged from 2.4. Note that
the queue lock must be held when calling both of these functions:
void blkdev_dequeue_request(struct request *req);
void end_that_request_last(struct request *req);
Note that the driver can dequeue the request at any time (as long as it
keeps track of it, of course). Drivers which keep multiple requests in
flight will need to dequeue each request as it is passed to the drive.
If your device does not have predictable timing behavior, your driver
should contribute its timing information to the system's entropy pool.
That is done with:
void add_disk_randomness(struct gendisk *disk);
BIO walking
The "BIO walking" patch was added in 2.5.70. This patch adds some request
queue fields and a new function to help complicated drivers keep track of
where they are in a given request. Drivers using BIO walking will not use
rq_for_each_bio(); instead, they rely upon the fact that the
cbio field of the request structure refers to the current,
unprocessed BIO,
nr_cbio_segments tells how many segments remain
to be processed in that BIO, and
nr_cbio_sectors tells how many
sectors are yet to be transferred. The macro:
int blk_rq_idx(struct request *req)
returns the index of the next segment to process. If you need to access the
current segment buffer directly (for programmed I/O, say), you may use:
char *rq_map_buffer(struct request *req, unsigned long *flags);
void rq_unmap_buffer(char *buffer, unsigned long flags);
These functions potentially deal with atomic kmaps, so the usual
constraints apply: no sleeping while the mapping is in effect, and buffers
must be mapped and unmapped in the same function.
When beginning I/O on a set of blocks from the request, your driver can
update the current pointers with:
int process_that_request_first(struct request *req,
unsigned int nr_sectors);
This function will update the various cbio values in the request,
but does not signal completion (you still need
end_that_request_first() for that).
Use of process_that_request_first() is optional; your driver may
call it if you would like the block subsystem to track your current
position in the request for I/O submission independently from how much of
the request has actually been completed.
Barrier requests
Requests will come off the request queue sorted into an order that should
give good performance. Block drivers (and the devices they drive) are free
to reorder those requests within reason, however. Drives which support
features like tagged command queueing and write caching will often complete
operations in an order different from that in which they received the
requests. Most of the time, this reordering leads to improved performance
and is a good thing.
At times, however, it is necessary to inhibit this reordering. The classic
example is that of journaling filesystems, which must be able to force
journal entries to the disk before the operations they describe.
Reordering of requests can undermine the filesystem integrity that a
journaling filesystem is trying to provide.
To meet the needs of higher-level layers, the concept of a "barrier
request" has been added to the 2.6 kernel. Barrier requests are marked by
the REQ_HARDBARRIER flag in the request structure flags
field. When your driver encounters a barrier request, it must complete
that request (and all that preceded it) before beginning any requests after
the barrier request. "Complete," in this case, means that the data has
been physically written to the disk platter - not just transferred to the
drive.
Tweaking request queue parameters
The block subsystem contains a long list of functions which control how I/O
requests are created for your driver. Here's a few of them.
Bounce buffer control: in 2.4, the block code assumed that devices
could not perform DMA to or from high memory addresses. When I/O buffers
were located in high memory, data would be copied to or from low-memory
"bounce" buffers; the driver would then operate on the low-memory buffer.
Most modern devices can handle (at a minimum) full 32-bit DMA addresses, or
even 64-bit addresses. For now, 2.6 will still use bounce buffers for
high-memory addresses. A driver can change that behavior with:
void blk_queue_bounce_limit(request_queue_t *q, u64 dma_addr);
After this call, any buffer whose physical address is at or above
dma_addr will be copied through a bounce buffer. The driver can
provide any reasonable address, or one of BLK_BOUNCE_HIGH (bounce
high memory pages, the default), BLK_BOUNCE_ANY (do not use bounce
buffers at all), or BLK_BOUNCE_ISA (bounce anything above the ISA
DMA threshold).
Request clustering control. The block subsystem works hard to
coalesce adjacent requests for better performance. Most devices have
limits, however, on how large those requests can be. A few functions have
been provided to instruct the block subsystem on how not to create requests
which must be split apart again.
void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors);
Sets the maximum number of sectors which may be transferred in a single
request; default is 255. It is not possible to set the maximum below the
number of sectors contained in one page.
void blk_queue_max_phys_segments(request_queue_t *q,
unsigned short max_segments);
void blk_queue_max_hw_segments(request_queue_t *q,
unsigned short max_segments);
The maximum number of discontiguous physical segments in a single request;
this is the maximum size of a scatter/gather list that could be presented
to the device. The first functions controls the number of distinct memory
segments in the request; the second does the same, but it takes into
account the remapping
which can be performed by the system's I/O memory management unit (if
any). The default for both is 128 segments.
void blk_queue_max_segment_size(request_queue_t *q,
unsigned int max_size);
The maximum size that any individual segment within a request can be. The
default is 65536 bytes.
void blk_queue_segment_boundary(request_queue_t *q,
unsigned long mask);
Some devices cannot perform transfers which cross memory boundaries of a
certain size. If your device is one of these, you should call
blk_queue_segment_boundary() with a mask indicating where
the boundary is. If, for example, your hardware has a hard time crossing
4MB boundaries, mask should be set to 0x3fffff. The
default is 0xffffffff.
Finally, some devices have more esoteric restrictions on which requests may
or may not be clustered together. For situations where the above
parameters are insufficient, a block driver can specify a function which
can examine (and pass judgement on) each proposed merge.
typedef int (merge_bvec_fn) (request_queue_t *q, struct bio *bio,
struct bio_vec *bvec);
void blk_queue_merge_bvec(request_queue_t *q, merge_bvec_fn *fn);
Once the given fn is associated with this queue, it will be called
every time a bio_vec entry bvec is being considered for
addition to the given bio. It should return the number of bytes
from bvec which can be added; zero should be returned if the new
segment cannot be added at all. By default, there is no
merge_bvec_fn.
Setting the hardware sector size. The old hardsect_size
global array is gone and nobody misses it. Block drivers now inform the
system of the underlying hardware's sector size with:
void blk_queue_hardsect_size(request_queue_t *q, unsigned short size);
The default is the usual 512-byte sector. There is one other important
change with regard to sector sizes: your driver will always see requests
expressed in terms of 512-byte sectors, regardless of the hardware sector
size. The block subsystem will not generate requests which go against the
hardware sector size, but sector numbers and counts in requests are always
in 512-byte units. This change was required as part of the new centralized
partition remapping.
DMA support
Most block I/O requests will come down to one more more DMA operations.
The 2.6 block layer provides a couple of functions designed to make the
task of setting up DMA operations easier.
void blk_queue_dma_alignment(request_queue_t *q, int mask);
This function sets a mask indicating what sort of memory alignment the
hardware needs for DMA requests; the default is 511.
DMA operations to modern devices usually require the creation of a
scatter/gather list of segments to be transferred. A block driver can
create this "scatterlist" using the generic DMA support routines and the
information found in the request. The block subsystem has made life a
little easier, though. A simple call to:
int blk_rq_map_sg(request_queue_t *q, struct request *rq,
struct scatterlist *sg);
will construct a scatterlist for the given request; the return value is the
number of entries in the resulting list. This scatterlist can then be
passed to pci_map_sg() or dma_map_sg() in preparation for
the DMA operation.
Going on
The second part of the request queue article
series looks at command preparation, tagged command queueing, and writing
drivers which do without a request queue altogether.
Comments (1 posted)
This article continues the look at request queues in 2.6; if you've not
read
the first part in the request queue
series, you may want to start there. Here we'll look at command
pregeneration, tagged command queueing, and doing without a request queue
altogether.
Command pregeneration
Traditionally, block drivers have prepared low-level hardware commands at
the time a request is processed. There can be advantages to preparing
commands at an earlier point, however. In 2.6, drivers which wish to
prepare commands (or perform some other sort of processing) for requests
before they hit the
request function
should set up a
prep_rq_fn with this prototype:
typedef int (prep_rq_fn) (request_queue_t *q, struct request *rq);
This function should perform preparatory work on the given request
rq. The 2.6 request structure includes a 16-byte
cmd field where a pregenerated command can be stored;
rq->cmd_len should be set to the length of that command.
The prep function should return BLKPREP_OK (process the
request normally), BLKPREP_DEFER (which defers processing of the
command for now), or BLKPREP_KILL (which terminates the request
with a failure status).
To add your prep function to a request queue, call:
void blk_queue_prep_rq(request_queue_t *q, prep_rq_fn *pfn);
The prep function is currently called out of elv_next_request() -
immediately before the request is passed back to your driver. There is a
possibility that, at some future point, the call to the prep function could
happen earlier in the process.
Tagged command queueing
Tagged command queueing (TCQ) allows a block device to have multiple
outstanding I/O requests, each identified by an integer "tag." TCQ can
yield performance benefits; the drive generally knows best when it comes to
figuring out which request should be serviced next. SCSI drivers in Linux
have long supported TCQ, but each driver has included its own
infrastructure for tag management. In 2.6, a simple tag management
facility has been added to the block layer. The generic tag management
code can make life easier, but it's important to understand how these
functions interact with the request queue.
Drivers wishing to use tags should set things up with:
int blk_queue_init_tags(request_queue_t *q, int depth,
struct blk_queue_tag *tags);
This call should be made after the queue has been initialized. Here,
depth is the maximum number of tagged commands which can be
outstanding at any given time. The tags argument is a pointer to
a blk_queue_tag structure which will be used to track the
outstanding tags. Normally you can pass tags as NULL,
and the block subsystem will allocate and initialize the structure for
you. If you wish to share a structure (and, thus, the tags it represents)
with another device, however, you can pass a pointer to the
blk_queue_tag structure in the first queue when initializing the
second. This call performs memory allocation, and
will return a negative error code if that allocation failed.
A call to:
void blk_queue_free_tags(request_queue_t *q);
will clean up the TCQ infrastructure. This normally happens automatically
when blk_cleanup_queue() is called, so drivers do not normally
have to call blk_queue_free_tags() themselves.
To allocate a tag for a request, use:
int blk_queue_start_tag(request_queue_t *q, struct request *rq);
This function will associate a tag number with the given request
rq, storing it in rq->tag. The return value will be
zero on success, or a nonzero value if there are no more tags available.
This function will remove the request from the queue, so your driver must
take care not to lose track of it - and to not try to dequeue the request
itself. It is also necessary to hold the queue
lock when calling blk_queue_start_tag().
blk_queue_start_tag() has been designed to work as the command
prep function. If your driver would like to have tags automatically
assigned, it can perform a call like:
blk_queue_prep_rq(queue, blk_queue_start_tag);
And every request that comes from elv_next_request() will already
have a tag associated with it.
If you need to know if a given request has a tag associated with it, use the
macro blk_rq_tagged(rq). The return value will be nonzero if
this request has been tagged.
When all transfers for a tagged request have been completed, the tag should
be returned with:
void blk_queue_end_tag(request_queue_t *q, struct request *rq);
Timing is important here: blk_queue_end_tag() must be called
before end_that_request_last(), or unpleasant things will happen.
Be sure to have the queue lock held when calling this function.
If you need to know which request is associated with a given tag, call:
struct request *blk_queue_find_tag(request_queue_t *q, int tag);
The return value will be the request structure, or NULL
if the given tag is not currently in use.
In the real world, things occasionally go wrong. If a drive (or the bus it
is attached to) goes into an error state and must be reset, all outstanding
tagged requests will be lost. In such a situation, the driver should call:
void blk_queue_invalidate_tags(request_queue_t *q);
This call will return all outstanding tags to the pool, and the associated
I/O requests will be returned to the request queue so that they can be
restarted.
Doing without a request queue
Some devices have no real need for a request queue. In particular, truly
random-access devices, such as memory technology devices or ramdisks, can
process requests quickly and do not benefit from sorting and merging of
requests. Drivers for such devices may achieve better performance by
shorting out much of the request queue structure and handling requests
directly as they are generated.
As in 2.4, this sort of driver can set up a "make request" function.
First, however, the request queue must still be created. The queue will
not be used to handle the actual requests, but it contains other
infrastructure needed by the block subsystem. If your driver will use a
make request function, it should first create the queue with
blk_alloc_queue():
request_queue_t *blk_alloc_queue(int gfp_mask);
The gfp_mask argument describes how the requisite memory should be
allocated, as usual. Note that this call can fail.
Once you have a request queue, you can set up the make request function;
the prototype for this function has changed a bit from 2.4, however:
typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);
If the make request function can arrange for the transfer(s) described in the given
bio, it should do so and return zero. "Stacking" drivers can also
redirect the bio by changing its bi_bdev field and returning
nonzero; in this case the bio will then be dispatched to the new
device's driver (this is as things were done in 2.4).
If the "make request" function performs the transfer itself, it is
responsible for passing the BIO to bio_endio() when the transfer
is complete. Note that the "make request" function is not called
with the queue lock held.
To arrange for your driver's function to be called, use:
void blk_queue_make_request(request_queue_t *q,
make_request_fn *func);
If and when your driver shuts down, be sure to return the request queue to
the system with:
void blk_put_queue(request_queue_t *queue);
As of 2.6.0-test3, this function is just another name for
blk_cleanup_queue(), but such things could always change in the
future.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>