Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.5.66, which came out on March 24. This large patch contains a great many small fixes. It also has some more IDE changes, some ext3 performance improvements, the 32-bit dev_t preparatory patches (see last week's Kernel Page and below), more devfs chopping, the new sys_epoll() API (covered briefly here two weeks ago), a big framebuffer update, an ALSA update, and an XFS update. See Linus's announcement or the long-format changelog for the details.

Linus's BitKeeper repository contains an XFS update, a USB update, and a number of architecture updates (ARM, SPARC64, x86-64, PPC64).

The current prepatch from Alan Cox is 2.5.65-ac3, which adds another set of small fixes.

The current stable kernel is 2.4.20; Marcelo tried to catch us by releasing 2.4.21-pre6 late on Wednesday, but we're on to him. This release contains many fixes of course (including a large set from the -ac tree and the ptrace() fix) and some architecture updates. The first 2.4.21 release candidate is apparently coming soon.

There has been some significant disagreement over whether 2.4.21 should be rushed out with the fix for the ptrace() vulnerability. Numerous people, it is said, run kernels obtained from kernel.org, but do not follow the mailing list closely enough to pick up needed security patches. Rather than leave those people vulnerable, a new release (containing, perhaps, only the security fix) should be made available as soon as possible. On the other side, it is argued that distributors have made patched kernels available, and anybody who is concerned can patch their kernels themselves.

The apparent resolution is that there will not be an expedited 2.4.21 release with the fix. Certainly no such kernel has been released; Marcelo has been completely silent on the matter.

Comments (11 posted)

Toward a larger dev_t

The 2.5.66 kernel includes Andries Brouwer's patches clearing the path for an expansion of the dev_t device number type. A small number of problems have been found, but the changes are working for most people. Andrew Morton has gone a little further and actually changed dev_t to 32-bits in his -mm tree; predictably, the number of problems found there has been a little higher. As a whole, though, the transition appears to be going relatively smoothly.

Badari Pulavarty decided that it was time to play with the possibilities of a larger device number type; he posted this patch which makes the SCSI disk driver make full use of the expanded minor number range. Testing with 4000 virtual disks, with 50 real drives at the end of the range, worked - for the most part. Some scaling problems did turn up, however.

The most significant one appears to be in the request queue mechanism. When the kernel wants to issue a block I/O request, the block subsystem needs to be able to set it up quickly. In particular, memory allocations are best avoided at that point; it's possible that the system is out of memory and the kernel is doing I/O in an attempt to free up some space. Trying to allocate memory at that point can lead to deadlocks. So the block subsystem sets aside a number of pre-allocated request structures for every request queue (and there is typically one request queue for each physical drive in the system). That number varies depending on the amount of memory on the system; it can be as low as 32, and as high as 256. Request structures run about 144 bytes each. So, if one assumes that a system hosting 4000 disks really should be equipped with a fair amount of memory, the block subsystem will set aside about a million request structures, at a cost of about 150MB. And that is just the beginning; the deadline I/O scheduler augments each request structure with a separate deadline_rq structure. Other overheads exist as well.

The end result is that, when the number of disks gets large, a great deal of memory (which must all be in the low memory zone on 32-bit processors) gets tied up in request queues. As Andrew Morton pointed out, with 4000 disks, enough request structures have been allocated to represent 200GB of current I/O requests. That, perhaps, is a bit more than is really needed in most situations.

The solution, as hacked up by Jens Axboe, is to go to a more dynamic scheme for the allocation of request structures. The mempool mechanism is used to keep an absolute minimum number of request structures available for each queue; all the rest are allocated as needed and freed afterwards. This patch will probably go through a few more iterations, but the immediate scalability problem has been addressed.

Meanwhile, not everybody is entirely happy with the direction of the dev_t changes for char devices. In particular, Roman Zippel, who has apparently given up on getting the module changes backed out, has now posted a series of patches backing out the char device changes and substituting his approach. That approach includes maintaining the (currently unused) char device hashing scheme and getting rid of the new register_chrdev_region() function. There is, he claims, no particular need to split char minor number ranges into regions, as there is with block devices. Roman's patches have created some discussion, but there does not appear to be a great deal of pressure for a change in direction at this time.

There has also been a bit of discussion on how big the new dev_t should be. The plan has been to expand it to 32 bits; 12 for the major number, and 20 for the minor number. That is the way Linus has wanted to do it, but he recently has made noises about being open to the idea of making dev_t even larger. If dev_t were to go to 64 bits, with 32 each for major and minor numbers, there would be little need to worry about running out of device numbers for some time into the future. This decision may not be made for a while; once the work to support the dev_t expansion has been done, setting it to one size or another is a relatively simple task.

Comments (6 posted)

Various short topics

Discussion on linux-kernel this week touched on a number of topics that, while worth a mention, don't necessarily justify a full article of their own. Here's a few of them:

Deprecating the .gz format. Peter Anvin would like to get rid of the .gz files on kernel.org. The bzip2 format has been around for quite some time and is far more space efficient; it would seem that eliminating the older format would be relatively uncontroversial. Such is not the case, however; users protested that the bzip2 format is slower, is not supported on Windows, and so on. The end result is that the gzip files will remain for some time yet.

kbugs.org is now showing over 1400 potential bugs found with the rapidly-evolving smatch system. A number of these are real, and fixes are beginning to find their way into the mainline kernel.

The Stanford Checker team has also been posting errors; the latest set points out places where kernel code is directly dereferencing user-supplied pointers. That kind of mistake can lead to all kinds of problems, of course, including security issues. The discussion led to a suggestion that the kernel use a different type for user-space pointers, so that this kind error could be caught directly by the compiler. The idea makes some sense; kernel code currently does not formally distinguish between user-space, kernel-space, and physical address pointers. Clarifying the difference between them could catch a lot of mistakes. This sort of change seems unlikely at this point in 2.5, however.

The object-based reverse mapping VM patch was covered here back in February. The object-based rmap code does not work with anonymous memory (memory which is not mapped to a file somewhere), however, meaning that this memory must still be managed with PTE chains. Hugh Dickins has posted a new set of patches which extend the object-based approach to anonymous memory as well. The patch was included in the -mm tree for a while, and seems to work without trouble. The only problem is: it doesn't actually help performance very much. Most anonymous memory only shows up in one page table, so its PTE chain overhead is essentially zero. So this patch has been dropped, though useful pieces of it may eventually find their way into the tree.

The IDE todo list has been posted by Alan Cox. This list is important in that it affects most Linux users; it also documents some of the remaining tasks to be done on the way to a 2.6 release. There's a few drivers needing thorough audits (and some that don't work at all yet), more hotplug work, documentation, and a number of other tasks yet to be done.

SMP overhead and rwlocks. Andrew Morton has noted that a simple write test takes twice as long on an SMP system as on a uniprocessor system. The culprit, of course, is the extra locking overhead. Reader-writer locks (rwlocks) have been singled out as particular problem; it turns out that they are slower than regular spinlocks, and they tend to mask problems where locks are simply being held for too long. There is a chance that rwlocks will be removed before 2.6 comes out.

Comments (1 posted)

Driver porting: the BIO structure

This article is part of the LWN Porting Drivers to 2.6 series.

The block layer in 2.4 (and prior) kernels was organized around the buffer head data structure. The limits of buffer heads have long been clear, however. It is hard to create a truly high-performance block I/O subsystem when the underlying buffer head structure forces each I/O request to be split into 512-byte chunks. So one of the first items on the 2.5 block "todo" list was the creation of a way to represent block I/O requests that supported higher performance and greater flexibility. The result was the new BIO structure. [Crude BIO diagram]

BIO basics

As with most real-world code, the BIO structure incorporates a fair number of tricky details. The core of the structure (as defined in <linux/bio.h>) is not that complicated, however; it is as appears in the diagram to the right. The BIO structure itself contains the usual collection of housekeeping information, along with a pointer (bi_io_vec) pointing to an array of bio_vec structures. This array represents the (possibly multiple) segments which make up this I/O request. There is also an index (bi_idx) giving an offset into the bi_io_vec array; we'll get into its use shortly.

The bio_vec structure itself has a simple definition:

    struct bio_vec {
	struct page	*bv_page;
	unsigned int	bv_len;
	unsigned int	bv_offset;
    };

As is increasingly the case with internal kernel data structures, the BIO now tracks data buffers using struct page pointers. There are some implications of this change for driver writers:

Data buffers for block transfers can be anywhere - kernel or user space. The driver author need not be concerned about the ultimate source or destination of the data.
These buffers could be in high memory, unless the driver author has explicitly requested that bounce buffers be used (Request Queues I covers how to do that). The driver author cannot count on the existence of a kernel-space mapping for the buffer unless one has been created explicitly.
More than ever, block I/O operations are scatter/gather operations, with data coming from multiple, dispersed buffers.

At first glance, the BIO structure may seem more difficult to work with than the old buffer head, which provided a nice kernel virtual address for a single chunk of data. Working with BIOs is not hard, however.

Getting request information from a BIO

A driver author could use the information above (along with the other BIO fields) to get the needed information out of the structure without too much trouble. As a general rule, however, direct access to the bio_vec array is discouraged. A set of accessor routines has been provided which hides the details of how the BIO structure works and eases access to that structure. Use of these routines will make the driver author's job easier, and, with luck, will enable a driver to keep working in the face of future block I/O changes.

So how does one get request information from the BIO structure? The beginning sector for the entire BIO is in the bi_sector field - there is no accessor function for that. The total size of the operation is in bi_size (in bytes). One can also get the total size in sectors with:

    bio_sectors(struct bio *bio);

The function (macro, actually):

    int bio_data_dir(struct bio *bio);

returns either READ or WRITE, depending on what type of operation is encapsulated by this BIO.

Almost everything else requires working through the bio_vec array. The encouraged way of doing that is to use the special bio_for_each_segment macro:

    int segno;
    struct bio_vec *bvec;

    bio_for_each_segment(bvec, bio, segno) {
	/* Do something with this segment */
    }

Within the loop, the integer variable segno will be the current index into the array, and bvec will point to the current bio_vec structure. Usually the driver programmer need not use either variable; instead, a new set of macros is available for use within this sort of loop:

struct page *bio_page(struct bio *bio): Returns a pointer to the current page structure.
int bio_offset(struct bio *bio): Returns the offset within the current page for this operation. Block I/O operations are often page-aligned, but that is not always the case.
int bio_cur_sectors(struct bio *bio): The number of sectors to transfer for this bio_vec.
char *bio_data(struct bio *bio): Returns the kernel virtual address for the data buffer. Note that this address will only exist if the buffer is not in high memory.
char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags): This function returns a kernel virtual address which can be used to access the data buffer pointed to by the given bio_vec entry; it also disables interrupts and returns an atomic kmap - so the driver should not sleep until bvec_kunmap_irq() has been called. Note that the flags argument is a pointer value, which is a departure for the usual convention for macros which disable interrupts.
void bvec_kunmap_irq(char *buffer, unsigned long *flags);: Undo a mapping which was created with bvec_kmap_irq().
char *bio_kmap_irq(struct bio *bio, unsigned long *flags);: This function is a wrapper around bvec_kmap_irq(); it returns a mapping for the current bio_vec entry in the given bio. There is, of course, a corresponding bio_kunmap_irq().
char *__bio_kmap_atomic(struct bio *bio, int i, enum km_type type): Use kmap_atomic() to obtain a kernel virtual address for the i^th buffer in the bio; the kmap slot designated by type will be used.
void __bio_kunmap_atomic(char *addr, enum km_type type): Return a kernel virtual address obtained with __bio_kmap_atomic().

A little detail which is worth noting: all of bio_data(), bvec_kmap_irq(), and bio_kmap_irq() add the segment offset (bio_offset(bio)) to the address before returning it. It is tempting to add the offset separately, but that is an error which leads to weird problems. Trust me.

Completing I/O

Given the information from the BIO, each block driver should be able to arrange a transfer to or from its particular device. Note that a helper function (blk_rq_map_sg()) exists which makes it easy to set up DMA scatter/gather lists from a block request; we'll get into that when we look at request queue management.

When the operation is complete, the driver must inform the block subsystem of that fact. That is done with bio_endio():

    void bio_endio(struct bio *bio, unsigned int nbytes, int error);

Here, bio is the BIO of interest, nbytes is the number of bytes actually transferred, and error indicates the status of the operation; it should be zero for a successful transfer, and a negative error code otherwise.

Other BIO details

The bi_private field in the BIO structure is not used by the block subsystem, and is available for the owner of the structure to use. Drivers do not own BIOs passed in to their request function and should not touch bi_private there. If your driver creates its own BIO structures (using the functions listed below, usually), then the bi_private field in those BIOs is available to it.

As mentioned above, the bi_idx BIO field is an index into the bi_io_vec array. This index is maintained for a couple of reasons. One is that it can be used to keep track of partially-complete operations. But this field (along with bi_vcnt, which says how many bio_vec entries are to be processed) can also be used to split a BIO into multiple chunks. Using this facility, a RAID or volume manager driver can "clone" a BIO into multiple structures all pointing at different parts of the bio_vec array. The operation is quick and efficient, and allows a large operation to be quickly dispatched across a number of physical drives.

To clone a BIO in this way, use:

    struct bio *bio_clone(struct bio *bio, int gfp_mask);

bio_clone() creates a second BIO pointing to the same bio_vec array as the original. This function uses the given gfp_mask when allocating memory.

BIO structures contain reference counts; the structure is released when the reference count hits zero. Drivers normally need not manipulate BIO reference counts, but, should the need arise, functions exist in the usual form:

    void bio_get(struct bio *bio);
    void bio_put(struct bio *bio);

Numerous other functions exist for working with BIO structures; most of the functions not covered here are involved with creating BIOs. More information can be found in <linux/bio.h> and block/biodoc.txt in the kernel documentation directory.

Comments (18 posted)

Linus Torvalds Linux 2.5.66 ?

Andrew Morton 2.5.66-mm1 ?

Alan Cox Linux 2.5.65-ac2 ?

Alan Cox Linux 2.5.65-ac3 ?

Andrew Morton 2.5.65-mm3 ?

Andrew Morton 2.5.65-mm4 ?

Stephen Hemminger 2.5.65-osdl1 ?

Martin J. Bligh 2.5.64-mjb2 (scalability / NUMA patchset) ?

Marcelo Tosatti Linux 2.4.21-pre6 ?

J.A. Magallon Linux-2.4.21-pre5-jam1 ?

Andries.Brouwer@cwi.nl - struct stat - Attention all arch maintainers ?

Randy.Dunlap in-kernel config for 2.5.65 ?

Roman Zippel alternative dev patch ?

Roman Zippel revert register_chrdev_region change ?

Roman Zippel restore character device hash ?

Roman Zippel remove character device array ?

Andries.Brouwer@cwi.nl 64-bit kdev_t - just for playing ?

Ingo Molnar sched-2.5.66-A2, scheduler enhancements ?

Andy Pfiffer kexec for 2.5.66 available ?

Rusty Russell __try_module_get() ?

john stultz linux-2.5.66_monotonic-clock_A2 ?

Keith Owens Announce: ksymoops 2.4.9 is available ?

Robert Williamson Mid-month LTP release made. ?

John Levon Module load notification take 3 ?

Selbak, Rolla N Open POSIX Test Suite 0.9.0 ?

Adam J. Richter small devfs patch for 2.5.65, plan to replace /sbin/hotplug ?

James Simmons Frmaebuffer updates. ?

mikpe@csd.uu.se perfctr-2.5.1 released ?

Greg KH USB changes for 2.5.66 ?

Adam Belay PnP Changes for 2.5.66 ?

Florin Malita LUFS (Linux Userland File System) 0.9.5 released ?

Badari Pulavarty 2.5.65 patch to support > 256 disks ?

Jens Axboe Dynamic block request allocation ?

Dave Kleikamp JFS 1.1.2 ?

William Lee Irwin III pgcl-2.5.65-2 ?

William Lee Irwin III pgcl-2.5.66-1 ?

Hugh Dickins anobjrmap 1/6 rmap.h ?

Hugh Dickins anobjrmap 2/6 mapping ?

Hugh Dickins anobjrmap 3/6 unchained ?

Hugh Dickins anobjrmap 4/6 anonmm ?

Hugh Dickins anobjrmap 5/6 rechained ?

Hugh Dickins anobjrmap 6/6 arches ?

Jari Ruusu Announce loop-AES-v1.7c file/swap crypto package ?

Sowmya Adiga AIM9 benchmark result for kernel 2.5.65 ?

Sowmya Adiga unixbench result for kernel 2.5.65 ?

Con Kolivas 2.5.65-mm3,4 with contest ?

Matthias Andree lk-changelog.pl 0.83 ?

Matthias Andree lk-changelog.pl 0.85 ?

Alan Cox IDE todo list ?

Keith Owens Announce: modutils 2.4.24 is available ?

Kernel development

Brief items

Kernel release status

Kernel development news

Toward a larger dev_t

Various short topics

Driver porting

Driver porting: the BIO structure

BIO basics

Getting request information from a BIO

Completing I/O

Other BIO details

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Benchmarks and bugs

Miscellaneous