Kernel development
Brief items
Kernel release status
The current development kernel is 2.5.64, unchanged from one week ago. Linus has been busy, however; his BitKeeper tree includes more driver model work, the continuing removal of unwanted stuff from devfs, a uClinux update, an x86-64 update, some block layer cleanups (see below), scheduler changes for improved interactive response (see below again), and a number of other fixes.Alan Cox has released 2.5.64-ac3 which adds a new set of IDE updates. "Handle with care."
The current stable kernel is 2.4.20; Marcelo has not released any 2.4.21 prepatches over the last week.
Alan Cox's current 2.4.21 prepatch is 2.4.21-pre5-ac3. Here you'll find an even newer set of IDE changes, along with quite a few other fixes and updates.
Kernel development news
Improving interactivity on Linux systems
The 2.5 kernel features a massively reworked scheduler which, among other things, improves the interactive feel of a desktop system. It goes to great lengths to try to separate interactive tasks from "background" processes, and to give a priority boost to the former. One way that this distinction is made is to look at how much time each process spends sleeping. Processes that sleep a lot are generally waiting for humans to do something, so the kernel tries to ensure that, when they wake up, they get quick access to the processor.This heuristic works well much of the time, but it also fails badly in some situations. Consider, for example, the case of a user dragging a window across the screen. That sort of operation can require a fair amount of computation on the part of the X server. If the system is busy anyway (with a kernel compilation, for example), the X server can end up using all of the processor time that is available to it. When the server stops sleeping, the kernel concludes that it is a compute-bound background task and drops its priority. At that point, the pointer stops keeping up with the mouse, and the desktop experience becomes generally unpleasant.
A classic solution (which predates Linux) for this problem is to raise the priority of the X server. A higher-priority server can make things work better for some users, but it ignores the fact that similar situations can arise with other interactive processes that require a fair amount of processor time. Streaming media applications tend to work this way, for example. Raising the priority of the X server can make things worse for this sort of application. Also, as Linus points out, tweaking priorities in this way is an indication that the system has failed somehow:
A few patches have gone into the 2.5.65 kernel which, by most reports, make things a lot better. One of them, which originally came from Linus, is based on the recognition that, if an interactive process is waiting for another process to do something, that other process should be considered interactive as well. The X server may be using a fair amount of CPU time, but, since interactive processes (i.e. the clients that the user works with) are waiting for it, the X server should still be seen as an interactive process.
The ideal time to make this adjustment might be when an interactive process goes to sleep waiting for an event. Unfortunately, that is hard to do; the kernel has no way to know, in the general case, who will be waking up processes that sleep on a particular queue. On the other hand, when the wakeup actually occurs, the relationship is immediately obvious. So the new scheduler will, at wakeup time, look at the interactivity bonus for the process being awakened. If that process has maxed out its bonus (as processes that sleep a lot will), the "excess" interactivity bonus is given, instead, to the process which is performing the wakeup. Thus, a sleeping mail client gives some of its bonus to the X server, which wakes it up. This patch is said to improve the interactivity of X significantly.
Ingo Molnar has taken Linus's patch and merged it into a larger set of scheduler changes (which, in turn, has gone into 2.5.65). Some of the additional changes that have been made include:
- Various scheduler parameter tweaks. The maximum timeslice given to
any process has been reduced, for example (to 200ms).
- One process can preempt another with the same priority, if the former
has a longer remaining timeslice.
- The first wakeup of a newly-forked child has been made smarter, resulting in less work being redone.
The end result of these changes is a kernel which provides a much more satisfying interactive experience. Note, however, that some causes of X server stalls - in particular, those related to disk I/O scheduling - still have not been resolved. Work is ongoing, however.
(See also: Jim Houston's self-tuning scheduler patch, which takes a different approach to scheduler improvement).
Block device registration and 32-bit dev_t
Long-suffering block driver maintainers will have to cope with a new change in 2.5.65: this patch from Andries Brouwer changes the prototype of register_blkdev(), which is used by block drivers to tell the kernel of their existence. The previous version of this function took a struct block_device_operations pointer, which contains some of the operations provided by the driver. That parameter has not been used for some time (block operations are now directly associated with disks, and are kept in the generic disk structure), so Andries removed it.Not everybody agreed with this change. With all of the work that has been done in the block layer, register_blkdev() does not actually do very much anymore. Its main remaining purpose is to associate a driver name with a major number, so that it shows up in /proc/devices. A block driver can now function nicely without calling register_blkdev() at all. The long-term plan is to remove register_blkdev() altogether. In the mean time, it was asked, why bother changing the prototype of a doomed function? Even so, the change was merged into 2.5.65.
The real purpose of Andries's patch, however, was to get rid of the static blkdevs array used to keep track of block devices in the kernel. blkdevs is about the only static array left in the block subsystem, and thus is one of the remaining impediments to Andries's real goal: the long-awaited expansion of dev_t to 32 bits.
The 32-bit dev_t is one of the final items on the 2.5 "todo" list. It is still considered important by many users: an Oracle engineer mentions 4000-disk systems that "want to go to Linux" but can't, and from IBM we hear about a 5000-drive system with waiting customers. There appears to be little opposition to the adoption of a larger dev_t, even at this late stage. But everybody agrees that it would be best to get this change done sooner rather than later.
The amount of work remaining is said to be relatively small. The block layer, for example, is almost ready for a larger dev_t now. The char device subsystem could take more work - many drivers "know" that device numbers (especially minor numbers) are only eight bits. So a detailed audit of many drivers could be required. This suggestion from Alan Cox could make life a little easier, though. The idea would be to replace the venerable register_chrdev() function with a new register_chr_device() which takes a parameter indicating the largest minor number that the driver can deal with. A change to all char drivers would still be required, but, by defaulting the maximum minor number to 255, these drivers could be made safe without the need for a larger "audit and fix" operation. The few drivers that actually need more minor numbers could be fixed individually.
There are, of course, other issues to deal with before a larger dev_t will be truly stable. Some protocols (i.e. NFSv2) aren't prepared for large device numbers. The interface to user space may well hold a surprise or two. And so on. These are all problems that can be solved, but the process will take time.
(As an aside, Alexander Viro, who has been an active participant in the
block layer and dev_t work, has been absent from kernel
development for a few months. In a recent
message, however, he proclaimed "I'm finally back - hopefully for
good.
" Welcome back, Al).
Klibc and initramfs
Another incomplete 2.5 development item is initramfs - an initial filesystem attached to the kernel image. The plan is to move much of the early boot code into initramfs, so that it can be run in user mode. But there has not been a whole lot of progress in that direction.One part of the process is klibc, a small C library to be used in initramfs applications. A patch exists which adds a working klibc to the 2.5.64 kernel, but Linus is not ready to merge it:
In other words, unless some code which really needs klibc does not show up soon, it may not get merged into 2.5 at all. That would have the effect of pushing the whole initramfs project back into the next development series. There are people working on creating this code, but, as Linus says, it's late in the game.
Smatch update
Smatch is Dan Carpenter's project to create a free version of the Stanford Checker. The project is making progress, and smatch is now capable of finding several classes of bugs in the Linux kernel. Some patches fixing bugs found by smatch have already begun to appear.The database of problems found by smatch is now hosted at kbugs.org. As of 2.5.64, there are just over 1000 potential bugs in the database. Many of them are certainly false alarms, but others will be real. An interesting feature of the kbugs.org site is the ability to "moderate" bugs as being real problems or not. With this capability, interested volunteers can help to sift out the real bugs, even if they don't feel able to contribute patches to fix them.
The smatch project is still in an early stage, but it is already showing great promise as a tool which can help in the creation of a better kernel.
Edge-triggered interfaces are too difficult?
The new epoll interface was covered here back in October, 2002. The epoll system calls offer a significant performance improvement for applications which must frequently poll large numbers of file descriptors. It does so by performing the setup work only once, and then trapping new I/O events as they occur.One aspect of the epoll interface is that it is edge-triggered; it will only return a file descriptor as being available for I/O after a change has happened on that file descriptor. In other words, if you tell epoll to watch a particular socket for readability, and a certain amount of data is already available for that socket, epoll will block anyway. It will only flag that socket as being readable when new data shows up.
Edge-triggered interfaces have their own advantages and disadvantages. One of their disadvantages, as epoll author Davide Libenzi has discovered, would appear to be that many programmers do not understand edge-triggered interfaces.. Additionally, most existing applications are written for level-triggered interfaces (such as poll() and select()) instead. Rather than fight this tide, he has sent out a new patch which switches epoll over to level-triggered behavior. A subsequent patch makes the behavior configurable on a per-file-descriptor basis.
The end result is a more flexible epoll interface that can be more easily used in existing applications. The patch has not been merged as of this writing, but there does not seem to be any reason why it shouldn't be. After all, epoll has not yet appeared in a stable kernel release; now is the best time to be making improvements to the interface.
The BitKeeper to CVS gateway goes live
Larry McVoy has announced the availability of the current BitKeeper kernel repository in CVS format. Things are still stabilizing, but the plan is to have the current 2.4 and 2.5 repositories available in CVS format in near real time. Almost all of the change and commit information will be available, making it easy for people who are unwilling or unable to run BitKeeper to peruse the kernel's revision history and track current developments. Says Larry:
Of course, when dealing with this sort of topic, things are never that easy. People will certainly be happy to have the CVS repository available, but one other aspect of the announcement has made people nervous. It seems that the near-SCCS file format used by BitKeeper is increasingly difficult to work with; now that BitKeeper repositories can be accessed in CVS format, the BitKeeper developers would like to move to a new, proprietary format. And that idea does not fly with all developers; this complaint from Ben Collins has been echoed by a few hackers:
It is clear that, as long as BitKeeper is in use by the kernel development community, some people are going to be unhappy. Nothing short of the complete freeing of the BitKeeper source will satisfy some users, and that does not appear to be in the cards. Fortunately this disagreement, while noisy, hasn't really gotten in the way of continued kernel development.
In fact, it hasn't even gotten in the way of BitKeeper as it improves the kernel development process. Regardless of what one thinks of BitKeeper or its license, the fact remains that kernel development has been working well over the last year; an incredible stream of patches has been merged, and the people involved have stayed sane. As sane as they were before, anyway.
(As an aside, Larry has suggested that the
license clause that forbids (free) BitKeeper use by people working on other
source management systems could be removed in the future "if we feel
we have pulled far enough ahead that everyone else is just playing
catchup
").
Driver porting
Driver Porting: block layer overview
| This article is part of the LWN Porting Drivers to 2.6 series. |
Fully covering the changes that have been made will require a whole series of articles. So we'll start with an overview which highlights the major changes that have been made without getting into any sort of detail. Subsequent articles will fill in the rest.
Note that parts of the block layer remain volatile - this development is not yet complete. We'll keep up with further changes as they happen.
So, what has changed with the block layer?
- A great deal of old cruft is gone. For example, it is no longer
necessary to work
with a whole set of global arrays within block drivers. These arrays
(blk_size, blksize_size, hardsect_size,
read_ahead, etc.) have simply vanished. The kernel still
maintains much of the same information, of course, but the management
of that information is much improved.
- As part of the cruft removal, most of the <linux/blk.h>
macros (DEVICE_NAME, DEVICE_NR, CURRENT,
INIT_REQUEST, etc.) have been removed;
<linux/blk.h> is now empty. Any block driver
which used these macros to implement its request loop will have to be
rewritten. It is still possible to implement a simple request loop
for straightforward devices where performance is not a big issue, but
the mechanisms have changed.
- The io_request_lock is gone; locking is now done on a
per-queue basis.
- Request queues have, in general, gotten more sophisticated. Quite a
bit of work has been done in the area of fancy request scheduling
(though drivers don't generally need to know about that). There is
simple support for tagged command queueing, along with features like
request barriers and queue-time device command generation. Request
queues must be allocated dynamicly in 2.6.
- Buffer heads are no longer used in the block layer; they have been
replaced with the new "bio" structure. The new
representation of block I/O operations is designed for flexibility and
performance; it encourages keeping large operations intact. Simple
drivers can pretend that the bio structure does not exist,
but most performance-oriented drivers - i.e. those that want to
implement clustering and DMA - will need to be changed to work with
bios.
One of the most significant features of the bio structure is that it represents I/O buffers directly with page structures and offsets, not in terms of kernel virtual addresses. By default, I/O buffers can be located in high memory, on the assumption that computers equipped with that much memory will also have reasonably modern I/O controllers. Support operations have been provided for tasks like bio splitting and the creation of DMA scatter/gather maps.
- Sector numbers can now be 64 bits wide, making it possible to support
very large block devices.
- The rudimentary gendisk ("generic disk") structure from 2.4 has been greatly improved in 2.6; generic disks are now used extensively throughout the block layer. Among other things, each generic disk has its own block_device_operations structure; the operations are no longer directly associated with the driver. The most significant change for block driver authors, though, may be the fact that partition handling has been moved up into the block layer, and drivers no longer need know anything about partitions. That is, of course, the way things should always have been.
Subsequent articles will explore the above changes in depth; stay tuned.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
