Kernel development
Brief items
Kernel release status
The current development kernel is 2.5.68, which was released by Linus on April 19. This is a large patch which has been a while in coming; it includes the usual big pile of fixes along with a bunch of devfs tweaking (read Linus's note if you use devfs), a new h8300 architecture, some NFS performance tuning, some changes to the workqueue interface, the merging of s390 and s390x into a single architecture (along with a bunch of other s390 work), the generation of hotplug events from kobject registration, a new __user attribute to mark user-space pointers (to help static analysis tools find bugs), a small change to the semantics of msync(MS_ASYNC) (it no longer actually starts any I/O), some reverse-mapping VM speedups, a new requirement that gcc version 2.95 (or later) be used to compile the kernel, a big pile of small fixes from Alan Cox, an NFSv4 update, and a big IA-64 update. The details can be found in the long-format changelog.Linus's BitKeeper repository contains a change to the interrupt handler prototype (see below), a patch for runtime barrier instruction patching (which allows optimal performance on different processors without the need to ship multiple kernels), more devfs cleanups, more preparation for an expanded dev_t type, some swapoff improvements, a new set of memory allocation flags (described below), and numerous other fixes and updates.
The current stable kernel is 2.4.20. The 2.4.21 release got a little closer with the announcement of the first release candidate. 2.4.21-rc1 adds a relatively small number of fixes to -pre7, and includes a plea for extensive testing.
Kernel development news
Some new memory allocation flags
Dealing with memory allocation failures is a requirement for all kernel code (and user-space code as well). But there are some places in the kernel where failures cannot be allowed to happen. So it is not uncommon to see kernel code which doesn't take "no" for an answer. As Andrew Morton put it:
As a way of helping kernel code get it right, Andrew has created a patch - since merged for 2.5.69 - which adds a new set of __GFP flags for get_free_page() and the other memory allocation functions. These flags are:
- __GFP_REPEAT
- This flag tells the page allocater to "try harder," repeating failed allocation attempts if need be. Allocations can still fail, but failure should be less likely.
- __GFP_NOFAIL
- Try even harder; allocations with this flag must not fail. Needless to say, such an allocation could take a long time to satisfy.
- __GFP_NORETRY
- Failed allocations should not be retried; instead, a failure status will be returned to the caller immediately.
These flags should make memory allocation operations a little more predictable. There is a moral hazard here, however, that programmers will start simply supplying __GFP_NOFAIL instead of making the extra effort to deal with failed allocations. __GFP_NOFAIL has its place, but, in most cases, it is probably better to be able to deal with low-memory situations directly.
An interrupt handling change
One problem that can confront an operating system kernel is that of "screaming" devices - hardware which continually raises interrupts, but for which there is no driver to tell it to shut up. If the offending hardware is yanking on an interrupt line which is not otherwise in use, the kernel can quickly disable that line and be done with the problem. If, however, the interrupt line is in use in a shared mode, there is (in kernels through 2.5.68) no way for the kernel to know that nobody is dealing with the loud device. All it can do is pass an interrupt request to the registered handlers and hope for the best.Of course, there is no need for things to be that way; each device driver knows whether it handled a specific interrupt or not. So all that's needed is for the drivers to communicate that information back to the kernel. The 2.5.69 kernel does exactly that - thanks to a patch by Linus - at the cost of breaking every driver which registers an interrupt handler.
Interrupt handlers no longer return void; instead, they must return an irqreturn_t value (adding typedefs to the kernel is OK when Linus does it). The values are IRQ_HANDLED if the driver recognized the interrupt or IRQ_NONE if the interrupt was not for one of the driver's devices. The IRQ_RETVAL(handled) macro can also be used; the handled parameter should be nonzero if the interrupt was handled in the driver.
With this change, the kernel can tell whether a particular device is being handled or not. As of this writing, the "fix the drivers" effort is in full swing; by the time 2.5.69 is released, most of the (in-tree) drivers should be working again. At least, with regard to the interrupt change.
The internal representation of device numbers
The expanded device number type - one of the big remaining items for the 2.5 development cycle - is getting closer to reality. Much of the preparation work has been done. There are still a few issues to be resolved, however; this week's discussion mostly centers around how device numbers should be represented in the kernel.One seeming outcome is that the kdev_t type will go away. Alexander Viro, who has recently resurfaced behind a UK email address, is pushing strongly for this change. Among other things, he has posted a set of "kdev_t-ectomy" patches which remove the kdev_t type from the TTY layer and a few other spots. kdev_t variables are replaced with direct pointers to driver data structures or integer indexes, depending on the context. Every instance of kdev_t, according to Al, is a sign of a problem; he'll be submitting more cleanup patches in the future.
As this work progresses, device numbers will become less visible throughout much of the kernel. But there will still be a need to work with device numbers; they are, after all, token which is passed between kernel and user space. A 64-bit device number seems like a done deal, but it's still not entirely clear how they will be represented. A few schools of thought exist:
- Many developers have been proceeding on the assumption that a simple,
64-bit integer would be used to hold device numbers in the future.
This approach, of course, is just an extension of the current 16-bit
number scheme.
- While most developers, perhaps, see that 64-bit quantity as being
split into 32-bit major and minor numbers, there are still people who
would like to get rid of the major/minor distinction altogether. The
management of the device number space will make that distinction
increasingly unimportant. Still, retention of the distinction between
major and minor numbers seems likely for now.
- Linus has been advocating a tuple representation, where major and minor numbers would be carried around independently of each other. Few others have argued for this representation, however, and Linus does not appear to feel strongly enough to force the issue.
The end result will matter little for most developers, since the MAJOR() and MINOR() macros will work as always. The real concern has to do with how backward compatibility will be supported. We all have filesystems and applications with 16-bit numbers wired deeply into them; we all expect those filesystems and applications to work with the 2.6 kernel. That means that a 16-bit device number, with eight-bit major and minor numbers:
| major | minor | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
will look to the kernel like a device number with a major number of zero and a large minor number:
| major | minor | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This case is easy to detect, of course, and it is not that big a deal to map it into the proper large representation:
| major | minor | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The important thing is that this remapping must happen consistently everywhere in the kernel. So, in every place where device numbers enter the kernel, they must be turned into a standard form, be it a combined device number or some sort of tuple representation. In practice, this remapping need not happen in many places; the mknod(), open() and stat() system calls are the big ones.
Peter Anvin proposed a different way of representing device numbers in a 64-bit word:
| major | minor | major | minor | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This representation appears to be more complicated, since obtaining the major and minor numbers would require extracting and splicing bit fields. It's worth noting again, however, that this work would be hidden within the MAJOR() and MINOR() macros, and invisible to kernel code. And, with this representation, no remapping of device numbers would be required.
The discussion seemed to wind down in an inconclusive manner. The real decisions will be made, of course, when the patches appear and are merged.
Driver porting
Driver porting: Atomic kmaps
| This article is part of the LWN Porting Drivers to 2.6 series. |
When the kernel needs to access a high memory page directly, an ad hoc memory mapping must be set up. This is the purpose of the functions kmap() and kunmap(), which have existed since high memory support was first implemented. kmap() is relatively expensive to use, however; it requires global page table changes, and it can put the calling function to sleep. It is thus a poor fit to many parts of the kernel where performance is important.
To address these performance issues, a new type of kernel mapping (the "atomic kmap") has been created (they actually existed, in a slightly different form, in 2.4.1). Atomic kmaps are intended for short-term use in small, atomic sections of kernel code; it is illegal to sleep while holding an atomic kmap. Atomic kmaps are a per-CPU structure; given the constraints on their use, there is no point in sharing them across processors. They are also available in very limited numbers.
In fact, there are only about a dozen atomic kmap slots available on each processor (the actual number is architecture-dependent), and users of atomic kmaps must specify which slot to use. A new enumerated type (km_type) has been defined to give names to the atomic kmap slots. The slots that will be of most interest to driver writers are:
- KM_USER0, KM_USER1. These slots are to be used
by code called from user space (i.e. system calls).
- KM_IRQ0, KM_IRQ1. Slots for interrupt handlers
to use.
- KM_SOFTIRQ0, KM_SOFTIRQ1; for code running out of a software interrupt, such as a tasklet.
Several other slots exist, but they have been set aside for specific purposes and should not be used.
The actual interface for obtaining an atomic kmap is:
void *kmap_atomic(struct page *page, enum km_type type);
The return value is a kernel virtual address which may be used to address the given page. kmap_atomic() will always succeed, since the slot to use has been given to it. It will also disable preemption while the atomic kmap is held.
When you have finished with the atomic kmap, you should undo it with:
void kunmap_atomic(void *address, enum km_type type);
Users of atomic kmaps should be very aware of the fact that nothing in the kernel prevents one function from stepping on another function's mappings. Code which holds atomic kmaps thus needs to be short and simple. If you are using one of the KM_IRQ slots, you should have locally disabled interrupts first. As long as everybody is careful, conflicts over atomic kmap slots do not arise.
Should you need to obtain a struct page pointer for an address obtained from kmap_atomic(), you can use:
struct page *kmap_atomic_to_page(void *address);
If you are wanting to map buffers obtained from the block layer in a BIO structure, you should use the BIO-specific kmap functions (described in the BIO article) instead.
Atomic kmaps are a useful resource for performance-critical code. They should not be overused, however. For any code which might sleep, or which can afford to wait for a mapping, the old standard kmap() should be used instead.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
