LWN.net Logo

Kernel development

Kernel release status

The current development kernel is 2.5.20, which was released by Linus on June 2. Big changes this time include a large ACPI merge, a bunch more buffer/VM work, a PowerPC64 merge, the usual set of IDE patches, various merges from the -dj series, some device model work, and numerous other fixes and updates. The long format changelog is also available.

Other releases from Linus since the last LWN Kernel Page include:

  • 2.5.19 (short, long). Changes include more block, buffer, and IDE layer work, some enhancements to the driver model code, more kbuild tweaks, and many other fixes and updates.

  • 2.5.18 (short, long). This one included the software suspend patch (as covered in the May 23 LWN Kernel Page), a bunch of kbuild fixes (which are not Keith Owens's new kbuild system - see below), more IDE reworking, more VFS changes, and a bunch of other fixes and improvements.

The current prepatch from Dave Jones is 2.5.20-dj3. The most significant feature of this patch, perhaps, is the merging of some small pieces of the kbuild 2.5 code.

The latest 2.5 status summary from Guillaume Boissiere came out on June 5.

The current stable kernel release is 2.4.18. Marcelo's plan had been to create a 2.4.19 release candidate, but some problems turned up. So he released 2.4.19-pre10 instead. A very long list of fixes got into this release. With luck, the next prepatch from Marcelo will be the first 2.4.19 release candidate.

Alan Cox has released 2.4.19-pre10-ac2; arguably the most interesting change in this prepatch is the inclusion of the "speakup" console module for blind users.

Comments (none posted)

A new way of block queue plugging

Jens Axboe has posted a patch which, once again, changes some of the main assumptions underlying the block I/O subsystem. It is worth a look at what is going on.

A longstanding feature of the block layer has been "queue plugging." If the request queue for a particular block device has been plugged, that device's driver will not be invoked to execute the operations in the queue. The main reason for plugging has been to allow the block layer to build up a backlog of requests, so that adjacent operations can be merged. By sometimes waiting a little longer to start an operation, the block layer can often achieve better performance overall.

With the 2.5 block layer, however, there is less need for this sort of plugging. The code works harder at not splitting large requests in the first place, so it is not necessary to merge them again. The new plugging code actually serves a different purpose: it is a mechanism by which a block driver can indicate that it is busy and can not handle any more requests at the moment.

As Jens points out in his patch, the block code is starting to look a little (a little!) bit more like the networking subsystem. Like network interfaces, block devices can have multiple requests outstanding. When the device has been given all the simultaneous requests that it can handle, there is no point in further troubling the driver until some of those requests complete. Thus the new plugging code: block devices, too, can ask to be allowed to work in peace for a while.

There's a couple of other, incidental changes in this patch. One is that the venerable tq_disk task queue has been removed. Slowly, the set of standard task queues is shrinking. A block driver's request ("strategy") function is also now called out of a tasklet. The block layer that shows up in 2.6 will be vastly different from what has been seen in previous stable kernels.

Comments (1 posted)

Splitting the kernel stack

The Linux kernel has, for years, run with an 8KB (two page) stack in each process's address space (at least, on i386 systems). That stack holds the "task structure" (the kernel's information about the process) and provides space for automatic variables and call frames when the system is running in kernel mode. The 8KB stack works, of course, but it is not optimal. The biggest problem, perhaps, is the need to find two adjacent pages for a new stack every time a new process is created. On a busy system memory can get badly fragmented, and allocating two pages together can be a challenge.

So Ben LaHaise has posted a patch which splits the kernel stack into two 4KB stacks. One of them holds the task structure and is used by normal kernel code (i.e. handling system calls). The other stack is set aside and is used only when the kernel is handling interrupts.

A separate interrupt stack is not a particularly new idea - many operating systems have had interrupt stacks for decades. There are numerous advantages to doing things this way. Only one interrupt stack (per CPU) is needed, so one page of memory per process is freed up. The interrupt stack is also more likely to stay in the processor cache, improving performance. Interrupt handlers need not worry about other kernel code having consumed most of the stack when they get invoked. And, of course, it is no longer necessary to perform a two-page allocation to set up the regular kernel stack.

The biggest downside, perhaps, is that non-interrupt kernel code must now fit into much less stack space. Some kernel code is not particularly careful about the size of its automatic variables, and risks overflowing the new, smaller stack. As a way of tracking down such code, Ben has also posted a stack checker (followed by a brown paper bag fix) which monitors stack usage and raises the alarm when available space on the stack gets too low. The two patches are probably best used together.

Comments (none posted)

The continuing saga of kbuild 2.5.

The discussion over whether to merge kbuild 2.5 has been covered in this space before. It is one of those conversations that persists, however. A few things have happened over the last few weeks.

Keith Owens, the author of kbuild 2.5, has posted a new set of timing comparisons meant to show the advantages of the new code. The full build process Keith performed took a bit less than 14 minutes with kbuild 2.5, and a little over 20 minutes with the existing kbuild. He also points out that the result is sometimes incorrect with the existing code.

Daniel Phillips also tried it out and obtained similar results. For good measure, Daniel took a look at the code itself: "There is no Python anywhere to be seen in kbuild 2.5, for those who worry about that. It is coded in C, about 10,000 lines it seems. It has a simple built in database which I suppose accounts for some of that. For what it does, it seems quite reasonable."

In general, most (but not all) developers who express an opinion on the matter seem to feel that kbuild 2.5 is worthwhile and should be merged. So it has surprised a number of people to see numerous patches to the existing kbuild system, written by Kai Germaschewski, being merged by Linus. These patches do worthwhile things, but they are not kbuild 2.5. Why bother, one might ask, if the whole thing is going to be replaced?

The answer seems to be that Linus, for now, wants Kai to be the kbuild maintainer. Kai is willing to do things in small pieces, which has always been Linus's preferred method; Keith has, so far, refused to break his kbuild work up in this way. Also, says Linus:

Kai isn't an enthusiastic kbuild-2.5 supporter. In fact, he tends to be a bit down on some of it. Which is a plus in my book: it means that whatever Kai tries to push my way I'll feel just that much more comfortable with as having had critical review.

Meanwhile, a couple of different developers (Sam Ravnborg and "Lightweight patch manager") have started submitting broken up versions of kbuild 2.5. Kai has stated that he will look them over and integrate those which make sense. Some of these patches also found their way into 2.5.20-dj3. It seems like at least a partial victory for the new kbuild.

So one has to wonder why, after all this, Keith felt the need to post his call for an email campaign entitled "If you want kbuild 2.5, tell Linus." It's a full-scale polemic that takes one back to the old devfs wars. It is also, seemingly, counterproductive. One would think that would be better to work with the people who are trying to make kbuild acceptible to Linus than to call for a pressure campaign.

Comments (2 posted)

The value of negative dentries

A "directory entry" (dentry) is an internal data structure used to hold the results of looking up a file in the filesystem. The Linux "dentry cache" keeps a number of recently used dentries around; they tend to be useful, since files are often accessed more than once over a short period of time. Finding a file in the dentry cache can save a lot of time by avoiding a full filesystem lookup.

The kernel also hangs on to "negative dentries," which indicate that the given file does not exist. Andrea Arcangeli recently noted that these negative dentries can take up quite a bit of memory, and wondered what possible use they could be. His message included a patch to force negative dentries out of memory quickly.

It turns out, though, that "this file does not exist" can be useful information. A quick strace run on a GNOME application, for example, turns up dozens of lookups on nonexistent files as the application gropes around looking for the unbelievable number of libraries it needs. Similarly, apache is continually looking for .htaccess files, shells look for executables, etc. It is more than worthwhile to be able to determine that a file doesn't exist without an expensive filesystem call - especially for file names that are often looked up. So negative dentries will stay.

There is one optimization that can be made, though. In Andrea's case, the negative dentries were created by deleting a large directory full of files. When a file is deleted, it is relatively unlikely that it will be looked up again soon, and keeping a negative dentry around is less useful. In this case, perhaps, it is better to just forget about the file name altogether.

Comments (1 posted)

The return of /dev/port

A few weeks ago, LWN reported on the removal of support for /dev/port from the 2.5 kernel. Since then, a few users have reported real uses for /dev/port and a desire that it stay in the kernel. Martin Dalecki, who create the patch removing /dev/port, suggested that users who really need it can patch it back in themselves. Linus disagreed, saying:

So when simplifying, it's not just important to say "we could do without this". You have to also say "and nobody can reasonably expect to need it".

Which doesn't seem to be the case with /dev/ports. So it stays.

That is, of course, the definitive end to the discussion.

Comments (1 posted)

Resources

A few other worthwhile notes:
  • Kernel Traffic issues 168 and 169 are available.

  • The Linux Security Module web site has been overhauled in a big way. "It's no longer an endless dribble of old patches. It contains some information about the project, more navigable patch listing, links to the BK repositories, and links to all the documentation that I am aware of."

  • Late last April, we mentioned that Pacific Northwest National Laboratory was seeking an experienced kernel programmer to work on its new, 1400-node Linux cluster. The position is still open, so go check it out if you think you might be interested.

Comments (none posted)

Patches and updates

Kernel trees

  • Lightweight patch manager: linux-2.5.20-ct1. Adds a number of "trivial patches" to 2.5.20. (June 4, 2002)
  • Andrea Arcangeli: 2.4.19pre9aa1. Included the integration of the O(1) scheduler - "highly experimental." (June 4, 2002)
  • Paul P Komkoff Jr: 2.4.19-pre9-ac1-s1. kbuild 2.5, EVMS, and a number of fixes. (June 4, 2002)
  • Marc-Christian Petersen: 2.2.21-3-secure. Many goodies for 2.2: OpenWall, ext3, ReiserFS, CryptoAPI, ACLs, USAGI, FreeS/Wan, 2.4 IDE, etc. "<span>The intended purpose is for production/servers.</span>" (June 4, 2002)

Core kernel code

  • Robert Love: scheduler hints. Allow applications to give hints to the scheduler on how they will behave. (June 4, 2002)
  • Russell King: cpufreq core for 2.5. A common (across architectures) interface to CPU clock speed. (June 4, 2002)
  • William Lee Irwin III: lazy_buddy-2.5.19-3. A "bugfix and cleanup release" of the new, deferred coalescing memory allocator. (June 4, 2002)
  • Andrew Morton: direct-to-BIO writeback. Perform filesystem writeouts direct to the block layer via BIO requests - no more buffer heads. At least in simple cases. (June 5, 2002)
  • Andrew Morton: direct-to-BIO readahead. Make the readahead code work without buffer heads. "<span>CPU load for `cat large_file &gt; /dev/null' is reduced by approximately 15%.</span>" (June 5, 2002)

Development tools

  • Randy.Dunlap: kerneltop. A "top"-like display generated from kernel profiling data. (June 4, 2002)

Device drivers

  • Martin Dalecki: 2.5.18 IDE 71. "<span>Scary big patch this time</span>." (June 4, 2002)

Documentation

  • Patrick Mochel: device model documentation 1/3. Documentation of the device model code - part 1 covers the <tt>bus_type</tt> structure. (June 4, 2002)

Filesystems and block I/O

  • Andreas Gruenbacher: Status of 2.5.x port. An initial port of the extended attribute/access control list code to 2.5. (June 5, 2002)

Janitorial

  • Robert Love: remove suser(). The venerable suser() call is gone at last. (June 5, 2002)

Kernel building

Networking

Architecture-specific

  • James Bottomley: i386 arch subdivision into machine types for 2.5.18. "<span>This code rearranges the arch/i386 directory structure to allow for sliding additional non-pc hardware in here in an easily separable (and thus easily maintainable) fashion.</span>" (June 4, 2002)
  • Thomas Capricelli: linux zeta-0.2 released. Zeta is a virtual platform to which the group is porting Linux. (June 4, 2002)

Security-related

  • Chris Wright: 2.5.20-lsm1. New version of the Linux Security Module patch. (June 5, 2002)
  • Chris Wright: 2.4.18-lsm3. Linux Security Module patch for 2.4.18. (June 5, 2002)
  • Amon Ott: RSBAC v1.2.0. Rule Set Based Access Control. (June 4, 2002)

Miscellaneous

  • Andrew Morton: "laptop mode". Optimizations for laptop use - mostly minimizing disk spinups. (June 5, 2002)
  • Bartlomiej Zolnierkiewicz: atapci 0.50. Reads information from ATA PCI chipsets. (June 4, 2002)
  • Bartlomiej Zolnierkiewicz: atapci 0.51. Fixes a problem with 0.50. (June 5, 2002)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds