|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.64, unchanged from one week ago. Linus has been busy, however; his BitKeeper tree includes more driver model work, the continuing removal of unwanted stuff from devfs, a uClinux update, an x86-64 update, some block layer cleanups (see below), scheduler changes for improved interactive response (see below again), and a number of other fixes.

Alan Cox has released 2.5.64-ac3 which adds a new set of IDE updates. "Handle with care."

The current stable kernel is 2.4.20; Marcelo has not released any 2.4.21 prepatches over the last week.

Alan Cox's current 2.4.21 prepatch is 2.4.21-pre5-ac3. Here you'll find an even newer set of IDE changes, along with quite a few other fixes and updates.

Comments (1 posted)

Kernel development news

Improving interactivity on Linux systems

The 2.5 kernel features a massively reworked scheduler which, among other things, improves the interactive feel of a desktop system. It goes to great lengths to try to separate interactive tasks from "background" processes, and to give a priority boost to the former. One way that this distinction is made is to look at how much time each process spends sleeping. Processes that sleep a lot are generally waiting for humans to do something, so the kernel tries to ensure that, when they wake up, they get quick access to the processor.

This heuristic works well much of the time, but it also fails badly in some situations. Consider, for example, the case of a user dragging a window across the screen. That sort of operation can require a fair amount of computation on the part of the X server. If the system is busy anyway (with a kernel compilation, for example), the X server can end up using all of the processor time that is available to it. When the server stops sleeping, the kernel concludes that it is a compute-bound background task and drops its priority. At that point, the pointer stops keeping up with the mouse, and the desktop experience becomes generally unpleasant.

A classic solution (which predates Linux) for this problem is to raise the priority of the X server. A higher-priority server can make things work better for some users, but it ignores the fact that similar situations can arise with other interactive processes that require a fair amount of processor time. Streaming media applications tend to work this way, for example. Raising the priority of the X server can make things worse for this sort of application. Also, as Linus points out, tweaking priorities in this way is an indication that the system has failed somehow:

Something is wrong, and we couldn't fix it, so here's the band-aid to avoid that problem for hat particular case. It's acceptable as a band-aid, but if you don't realize that it's indicative of a problem, then you're just kidding yourself.

A few patches have gone into the 2.5.65 kernel which, by most reports, make things a lot better. One of them, which originally came from Linus, is based on the recognition that, if an interactive process is waiting for another process to do something, that other process should be considered interactive as well. The X server may be using a fair amount of CPU time, but, since interactive processes (i.e. the clients that the user works with) are waiting for it, the X server should still be seen as an interactive process.

The ideal time to make this adjustment might be when an interactive process goes to sleep waiting for an event. Unfortunately, that is hard to do; the kernel has no way to know, in the general case, who will be waking up processes that sleep on a particular queue. On the other hand, when the wakeup actually occurs, the relationship is immediately obvious. So the new scheduler will, at wakeup time, look at the interactivity bonus for the process being awakened. If that process has maxed out its bonus (as processes that sleep a lot will), the "excess" interactivity bonus is given, instead, to the process which is performing the wakeup. Thus, a sleeping mail client gives some of its bonus to the X server, which wakes it up. This patch is said to improve the interactivity of X significantly.

Ingo Molnar has taken Linus's patch and merged it into a larger set of scheduler changes (which, in turn, has gone into 2.5.65). Some of the additional changes that have been made include:

  • Various scheduler parameter tweaks. The maximum timeslice given to any process has been reduced, for example (to 200ms).

  • One process can preempt another with the same priority, if the former has a longer remaining timeslice.

  • The first wakeup of a newly-forked child has been made smarter, resulting in less work being redone.

The end result of these changes is a kernel which provides a much more satisfying interactive experience. Note, however, that some causes of X server stalls - in particular, those related to disk I/O scheduling - still have not been resolved. Work is ongoing, however.

(See also: Jim Houston's self-tuning scheduler patch, which takes a different approach to scheduler improvement).

Comments (12 posted)

Block device registration and 32-bit dev_t

Long-suffering block driver maintainers will have to cope with a new change in 2.5.65: this patch from Andries Brouwer changes the prototype of register_blkdev(), which is used by block drivers to tell the kernel of their existence. The previous version of this function took a struct block_device_operations pointer, which contains some of the operations provided by the driver. That parameter has not been used for some time (block operations are now directly associated with disks, and are kept in the generic disk structure), so Andries removed it.

Not everybody agreed with this change. With all of the work that has been done in the block layer, register_blkdev() does not actually do very much anymore. Its main remaining purpose is to associate a driver name with a major number, so that it shows up in /proc/devices. A block driver can now function nicely without calling register_blkdev() at all. The long-term plan is to remove register_blkdev() altogether. In the mean time, it was asked, why bother changing the prototype of a doomed function? Even so, the change was merged into 2.5.65.

The real purpose of Andries's patch, however, was to get rid of the static blkdevs array used to keep track of block devices in the kernel. blkdevs is about the only static array left in the block subsystem, and thus is one of the remaining impediments to Andries's real goal: the long-awaited expansion of dev_t to 32 bits.

The 32-bit dev_t is one of the final items on the 2.5 "todo" list. It is still considered important by many users: an Oracle engineer mentions 4000-disk systems that "want to go to Linux" but can't, and from IBM we hear about a 5000-drive system with waiting customers. There appears to be little opposition to the adoption of a larger dev_t, even at this late stage. But everybody agrees that it would be best to get this change done sooner rather than later.

The amount of work remaining is said to be relatively small. The block layer, for example, is almost ready for a larger dev_t now. The char device subsystem could take more work - many drivers "know" that device numbers (especially minor numbers) are only eight bits. So a detailed audit of many drivers could be required. This suggestion from Alan Cox could make life a little easier, though. The idea would be to replace the venerable register_chrdev() function with a new register_chr_device() which takes a parameter indicating the largest minor number that the driver can deal with. A change to all char drivers would still be required, but, by defaulting the maximum minor number to 255, these drivers could be made safe without the need for a larger "audit and fix" operation. The few drivers that actually need more minor numbers could be fixed individually.

There are, of course, other issues to deal with before a larger dev_t will be truly stable. Some protocols (i.e. NFSv2) aren't prepared for large device numbers. The interface to user space may well hold a surprise or two. And so on. These are all problems that can be solved, but the process will take time.

(As an aside, Alexander Viro, who has been an active participant in the block layer and dev_t work, has been absent from kernel development for a few months. In a recent message, however, he proclaimed "I'm finally back - hopefully for good." Welcome back, Al).

Comments (none posted)

Klibc and initramfs

Another incomplete 2.5 development item is initramfs - an initial filesystem attached to the kernel image. The plan is to move much of the early boot code into initramfs, so that it can be run in user mode. But there has not been a whole lot of progress in that direction.

One part of the process is klibc, a small C library to be used in initramfs applications. A patch exists which adds a working klibc to the 2.5.64 kernel, but Linus is not ready to merge it:

However, I also have to say that klibc is pretty late in the game, and as long as it doesn't add any direct value to the kernel build the whole thing ends up being pretty moot right now. It might be different if we actually had code that needed it (ie ACPI in user space or whatever).

In other words, unless some code which really needs klibc does not show up soon, it may not get merged into 2.5 at all. That would have the effect of pushing the whole initramfs project back into the next development series. There are people working on creating this code, but, as Linus says, it's late in the game.

Comments (none posted)

Smatch update

Smatch is Dan Carpenter's project to create a free version of the Stanford Checker. The project is making progress, and smatch is now capable of finding several classes of bugs in the Linux kernel. Some patches fixing bugs found by smatch have already begun to appear.

The database of problems found by smatch is now hosted at kbugs.org. As of 2.5.64, there are just over 1000 potential bugs in the database. Many of them are certainly false alarms, but others will be real. An interesting feature of the kbugs.org site is the ability to "moderate" bugs as being real problems or not. With this capability, interested volunteers can help to sift out the real bugs, even if they don't feel able to contribute patches to fix them.

The smatch project is still in an early stage, but it is already showing great promise as a tool which can help in the creation of a better kernel.

Comments (none posted)

Edge-triggered interfaces are too difficult?

The new epoll interface was covered here back in October, 2002. The epoll system calls offer a significant performance improvement for applications which must frequently poll large numbers of file descriptors. It does so by performing the setup work only once, and then trapping new I/O events as they occur.

One aspect of the epoll interface is that it is edge-triggered; it will only return a file descriptor as being available for I/O after a change has happened on that file descriptor. In other words, if you tell epoll to watch a particular socket for readability, and a certain amount of data is already available for that socket, epoll will block anyway. It will only flag that socket as being readable when new data shows up.

Edge-triggered interfaces have their own advantages and disadvantages. One of their disadvantages, as epoll author Davide Libenzi has discovered, would appear to be that many programmers do not understand edge-triggered interfaces.. Additionally, most existing applications are written for level-triggered interfaces (such as poll() and select()) instead. Rather than fight this tide, he has sent out a new patch which switches epoll over to level-triggered behavior. A subsequent patch makes the behavior configurable on a per-file-descriptor basis.

The end result is a more flexible epoll interface that can be more easily used in existing applications. The patch has not been merged as of this writing, but there does not seem to be any reason why it shouldn't be. After all, epoll has not yet appeared in a stable kernel release; now is the best time to be making improvements to the interface.

Comments (10 posted)

The BitKeeper to CVS gateway goes live

Larry McVoy has announced the availability of the current BitKeeper kernel repository in CVS format. Things are still stabilizing, but the plan is to have the current 2.4 and 2.5 repositories available in CVS format in near real time. Almost all of the change and commit information will be available, making it easy for people who are unwilling or unable to run BitKeeper to peruse the kernel's revision history and track current developments. Says Larry:

Our goal is to provide the data in a way that you can get at it without being dependent on us or BK in any way. As soon as we have this debugged, I'd like to move the CVS repositories to kernel.org (if I can get HPA to agree) and then you'll have the revision history and can live without the fear of the "don't piss Larry off license". Quite frankly, we don't like the current situation any better than many of you, so if this addresses your concerns that will take some pressure off of us.

Of course, when dealing with this sort of topic, things are never that easy. People will certainly be happy to have the CVS repository available, but one other aspect of the announcement has made people nervous. It seems that the near-SCCS file format used by BitKeeper is increasingly difficult to work with; now that BitKeeper repositories can be accessed in CVS format, the BitKeeper developers would like to move to a new, proprietary format. And that idea does not fly with all developers; this complaint from Ben Collins has been echoed by a few hackers:

You've made quite a marketing move. It's obvious to me, maybe not to others. By providing this CVS gateway, you make it almost pointless to work on an alternative client. Also by providing it, you make it easier to get away with locking the revision history into a proprietary format.

It is clear that, as long as BitKeeper is in use by the kernel development community, some people are going to be unhappy. Nothing short of the complete freeing of the BitKeeper source will satisfy some users, and that does not appear to be in the cards. Fortunately this disagreement, while noisy, hasn't really gotten in the way of continued kernel development.

In fact, it hasn't even gotten in the way of BitKeeper as it improves the kernel development process. Regardless of what one thinks of BitKeeper or its license, the fact remains that kernel development has been working well over the last year; an incredible stream of patches has been merged, and the people involved have stayed sane. As sane as they were before, anyway.

(As an aside, Larry has suggested that the license clause that forbids (free) BitKeeper use by people working on other source management systems could be removed in the future "if we feel we have pulled far enough ahead that everyone else is just playing catchup").

Comments (1 posted)

Driver porting

Driver Porting: block layer overview

This article is part of the LWN Porting Drivers to 2.6 series.
The first big, disruptive changes to the 2.6 kernel came from the reworking of the block I/O layer. As one might guess, the result of all this work is a great many changes as seen by driver authors - or anybody else who works with block I/O. The transition may be painful for some, but it's worth it: the new block layer is easier to work with and offers much better performance than its predecessor.

Fully covering the changes that have been made will require a whole series of articles. So we'll start with an overview which highlights the major changes that have been made without getting into any sort of detail. Subsequent articles will fill in the rest.

Note that parts of the block layer remain volatile - this development is not yet complete. We'll keep up with further changes as they happen.

So, what has changed with the block layer?

  • A great deal of old cruft is gone. For example, it is no longer necessary to work with a whole set of global arrays within block drivers. These arrays (blk_size, blksize_size, hardsect_size, read_ahead, etc.) have simply vanished. The kernel still maintains much of the same information, of course, but the management of that information is much improved.

  • As part of the cruft removal, most of the <linux/blk.h> macros (DEVICE_NAME, DEVICE_NR, CURRENT, INIT_REQUEST, etc.) have been removed; <linux/blk.h> is now empty. Any block driver which used these macros to implement its request loop will have to be rewritten. It is still possible to implement a simple request loop for straightforward devices where performance is not a big issue, but the mechanisms have changed.

  • The io_request_lock is gone; locking is now done on a per-queue basis.

  • Request queues have, in general, gotten more sophisticated. Quite a bit of work has been done in the area of fancy request scheduling (though drivers don't generally need to know about that). There is simple support for tagged command queueing, along with features like request barriers and queue-time device command generation. Request queues must be allocated dynamicly in 2.6.

  • Buffer heads are no longer used in the block layer; they have been replaced with the new "bio" structure. The new representation of block I/O operations is designed for flexibility and performance; it encourages keeping large operations intact. Simple drivers can pretend that the bio structure does not exist, but most performance-oriented drivers - i.e. those that want to implement clustering and DMA - will need to be changed to work with bios.

    One of the most significant features of the bio structure is that it represents I/O buffers directly with page structures and offsets, not in terms of kernel virtual addresses. By default, I/O buffers can be located in high memory, on the assumption that computers equipped with that much memory will also have reasonably modern I/O controllers. Support operations have been provided for tasks like bio splitting and the creation of DMA scatter/gather maps.

  • Sector numbers can now be 64 bits wide, making it possible to support very large block devices.

  • The rudimentary gendisk ("generic disk") structure from 2.4 has been greatly improved in 2.6; generic disks are now used extensively throughout the block layer. Among other things, each generic disk has its own block_device_operations structure; the operations are no longer directly associated with the driver. The most significant change for block driver authors, though, may be the fact that partition handling has been moved up into the block layer, and drivers no longer need know anything about partitions. That is, of course, the way things should always have been.

Subsequent articles will explore the above changes in depth; stay tuned.

Comments (1 posted)

Patches and updates

Kernel trees

Stephen Hemminger 2.5.64-osdl1 ?
Alan Cox Linux 2.5.64-ac1 ?
Alan Cox Linux 2.5.64-ac2 ?
Alan Cox Linux 2.5.64-ac3 ?

Architecture-specific

William Lee Irwin III cpu-2.5.64-1 ?

Core kernel code

Development tools

Device drivers

Documentation

Denis Vlasenko lk maintainers ?

Filesystems and block I/O

Janitorial

Christoph Hellwig remove devfs_only() ?

Memory management

Andrew Morton 2.5.64-mm1 ?
Andrew Morton 2.5.64-mm2 ?
Andrew Morton 2.5.64-mm4 ?
Andrew Morton 2.5.64-mm5 ?
William Lee Irwin III pgcl-2.5.64-[345] ?
Rik van Riel rmap 15e ?

Networking

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds