Brief items
The current development kernel is 2.5.68, which was
released by Linus on April 19. This is a
large patch which has been a while in coming; it includes the usual big
pile of fixes along with a bunch of devfs tweaking (read Linus's note if
you use devfs), a new h8300 architecture, some NFS performance tuning, some
changes to the workqueue interface, the merging of s390 and s390x into a
single architecture (along with a bunch of other s390 work), the generation
of hotplug events from kobject registration, a new
__user
attribute to mark user-space pointers (to help static analysis tools find
bugs), a small change to the semantics of
msync(MS_ASYNC) (it no
longer actually starts any I/O), some reverse-mapping VM speedups, a new
requirement that gcc version 2.95 (or later) be used to compile the kernel,
a big pile of small fixes from Alan Cox, an NFSv4 update, and a big IA-64
update. The details can be found in
the
long-format changelog.
Linus's BitKeeper repository contains a change to the interrupt handler
prototype (see below), a patch for runtime barrier instruction patching
(which allows optimal performance on different processors without the need
to ship multiple kernels), more devfs cleanups, more preparation for an
expanded dev_t type, some swapoff improvements, a new set of
memory allocation flags (described below), and numerous other fixes and
updates.
The current stable kernel is 2.4.20. The 2.4.21 release got a
little closer with the announcement of the
first release candidate. 2.4.21-rc1 adds a relatively small number of
fixes to -pre7, and includes a plea for extensive testing.
Comments (none posted)
Kernel development news
Dealing with memory allocation failures is a requirement for all kernel
code (and user-space code as well). But there are some places in the
kernel where failures cannot be allowed to happen. So it is not uncommon
to see kernel code which doesn't take "no" for an answer. As Andrew Morton
put it:
There are quite a lot of places in the kernel which will infinitely retry a
memory allocation. Generally, they get it wrong.
As a way of helping kernel code get it right, Andrew has created a patch - since merged for 2.5.69 - which adds
a new set of __GFP flags for get_free_page() and the
other memory allocation functions. These flags are:
- __GFP_REPEAT
- This flag tells the page allocater to "try harder," repeating failed
allocation attempts if need be. Allocations can still fail, but
failure should be less likely.
- __GFP_NOFAIL
- Try even harder; allocations with this flag must not fail. Needless
to say, such an allocation could take a long time to satisfy.
- __GFP_NORETRY
- Failed allocations should not be retried; instead, a failure status
will be returned to the caller immediately.
These flags should make memory allocation operations a little more
predictable. There is a moral hazard here, however, that programmers will
start simply supplying __GFP_NOFAIL instead of making the extra
effort to deal with failed allocations. __GFP_NOFAIL has its
place, but, in most cases, it is probably better to be able to deal with
low-memory situations directly.
Comments (2 posted)
One problem that can confront an operating system kernel is that of
"screaming" devices - hardware which continually raises interrupts, but for
which there is no driver to tell it to shut up. If the offending hardware
is yanking on an interrupt line which is not otherwise in use, the kernel
can quickly disable that line and be done with the problem. If, however,
the interrupt line is in use in a shared mode, there is (in kernels through
2.5.68) no way for the kernel to know that nobody is dealing with the loud
device. All it can do is pass an interrupt request to the registered
handlers and hope for the best.
Of course, there is no need for things to be that way; each device driver
knows whether it handled a specific interrupt or not. So all that's needed
is for the drivers to communicate that information back to the kernel. The
2.5.69 kernel does exactly that - thanks to a
patch by Linus - at the cost of breaking every driver which registers
an interrupt handler.
Interrupt handlers no longer return void; instead, they must
return an irqreturn_t value (adding typedefs to the kernel is OK
when Linus does it). The values are IRQ_HANDLED if the driver
recognized the interrupt or IRQ_NONE if the interrupt was not for
one of the driver's devices. The IRQ_RETVAL(handled) macro can
also be used; the handled parameter should be nonzero if the
interrupt was handled in the driver.
With this change, the kernel can tell whether a particular device is being
handled or not. As of this writing, the "fix the drivers" effort is in
full swing; by the time 2.5.69 is released, most of the (in-tree) drivers
should be working again. At least, with regard to the interrupt change.
Comments (5 posted)
The expanded device number type - one of the big remaining items for the
2.5 development cycle - is getting closer to reality. Much of the
preparation work has been done. There are still a few issues to be
resolved, however; this week's discussion mostly centers around how device
numbers should be represented in the kernel.
One seeming outcome is that the kdev_t type will go away.
Alexander Viro, who has recently resurfaced behind a UK email address, is
pushing
strongly for this change. Among other things, he has posted a set of "kdev_t-ectomy" patches which remove
the kdev_t type from the TTY layer and a few other spots.
kdev_t variables are replaced with direct pointers to driver data
structures or integer indexes, depending on the context. Every instance of
kdev_t, according to Al, is a sign of a problem; he'll be
submitting more cleanup patches in the future.
As this work progresses, device numbers will become less visible throughout
much of the kernel. But there will still be a need to work with device
numbers; they are, after all, token which is passed between kernel and user
space. A 64-bit device number seems like a done deal, but it's still not
entirely clear how they will be represented. A few schools of thought
exist:
- Many developers have been proceeding on the assumption that a simple,
64-bit integer would be used to hold device numbers in the future.
This approach, of course, is just an extension of the current 16-bit
number scheme.
- While most developers, perhaps, see that 64-bit quantity as being
split into 32-bit major and minor numbers, there are still people who
would like to get rid of the major/minor distinction altogether. The
management of the device number space will make that distinction
increasingly unimportant. Still, retention of the distinction between
major and minor numbers seems likely for now.
- Linus has been advocating a tuple representation, where major and
minor numbers would be carried around independently of each other.
Few others have argued for this representation, however, and Linus
does not appear to feel strongly enough to force the issue.
The end result will matter little for most developers, since the
MAJOR() and MINOR() macros will work as always. The real
concern has to do with how backward compatibility will be supported. We
all have filesystems and applications with 16-bit numbers wired deeply into
them; we all expect those filesystems and applications to work with the 2.6
kernel. That means that a 16-bit device number, with eight-bit
major and minor numbers:
will look to the kernel like a device number with a major number of zero
and a large minor number:
This case is easy to detect, of course, and it is not that big a deal to
map it into the proper large representation:
The important thing is that this remapping must happen consistently
everywhere in the kernel. So, in every place where device numbers enter
the kernel, they must be turned into a standard form, be it a combined
device number or some sort of tuple representation. In practice, this
remapping need not happen in many places; the mknod(),
open() and stat() system calls are the big ones.
Peter Anvin proposed a different way of
representing device numbers in a 64-bit word:
This representation appears to be more complicated, since obtaining the
major and minor numbers would require extracting and splicing bit fields.
It's worth noting again, however, that this work would be hidden within the
MAJOR() and MINOR() macros, and invisible to kernel
code. And, with this representation, no remapping of device numbers would
be required.
The discussion seemed to wind down in an inconclusive manner. The real
decisions will be made, of course, when the patches appear and are merged.
Comments (1 posted)
Driver porting
High memory can be a pain to work with. The addressing limitations of
32-bit processors make it impossible to map all of high memory into the
kernel's address space. So various workarounds must be employed to manage
high memory portably; this need is one of the reasons for the increasing
use of
struct page pointers in the kernel.
When the kernel needs to access a high memory page directly, an ad hoc
memory mapping must be set up. This is the purpose of the functions
kmap() and kunmap(), which have existed since high memory
support was first implemented. kmap() is relatively expensive to
use, however; it requires global page table changes, and it can put the
calling function to sleep. It is thus a poor fit to many parts of the
kernel where performance is important.
To address these performance issues, a new type of kernel mapping (the
"atomic kmap") has been created (they actually existed, in a slightly
different form, in 2.4.1). Atomic kmaps are intended for short-term
use in small, atomic sections of kernel code; it is illegal to sleep while
holding an atomic kmap. Atomic kmaps are a per-CPU structure; given the
constraints on their use, there is no point in sharing them across
processors. They are also available in very limited numbers.
In fact, there are only about a dozen atomic kmap slots available on each
processor (the actual number is architecture-dependent), and users of
atomic kmaps must specify which slot to use. A new enumerated type
(km_type) has been defined to give names to the atomic kmap
slots. The slots that will be of most interest to driver writers are:
- KM_USER0, KM_USER1. These slots are to be used
by code called from user space (i.e. system calls).
- KM_IRQ0, KM_IRQ1. Slots for interrupt handlers
to use.
- KM_SOFTIRQ0, KM_SOFTIRQ1; for code running out of
a software interrupt, such as a tasklet.
Several other slots exist, but they have been set aside for specific
purposes and should not be used.
The actual interface for obtaining an atomic kmap is:
void *kmap_atomic(struct page *page, enum km_type type);
The return value is a kernel virtual address which may be used to address
the given page. kmap_atomic() will always succeed, since the slot
to use has been given to it. It will also disable preemption while the
atomic kmap is held.
When you have finished with the atomic kmap, you should undo it with:
void kunmap_atomic(void *address, enum km_type type);
Users of atomic kmaps should be very aware of the fact that nothing in the
kernel prevents one function from stepping on another function's mappings.
Code which holds atomic kmaps thus needs to be short and simple. If you
are using one of the KM_IRQ slots, you should have locally
disabled interrupts first. As long
as everybody is careful, conflicts over atomic kmap slots do not arise.
Should you need to obtain a struct page pointer for an address
obtained from kmap_atomic(), you can use:
struct page *kmap_atomic_to_page(void *address);
If you are wanting to map buffers obtained from the block layer in a BIO
structure, you should use the BIO-specific kmap functions (described in the BIO article) instead.
Atomic kmaps are a useful resource for performance-critical code. They
should not be overused, however. For any code which might sleep, or which
can afford to wait for a mapping, the old standard kmap() should
be used instead.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>