Brief items
The current development kernel is 2.5.62, which was
released by Linus on February 17. It
included a new version of the dentry cache which uses read-copy-update for
lockless file lookups, a number of architecture updates, some kbuild fixes
(including module alias and device table support), more signal cleanup
work, and (a classic sign that the freeze is progressing) lots of spelling
fixes. The
long-format changelog has
the details.
2.5.61 was released on February 14.
Changes in this release include a number of
SCSI driver fixes, an x86-64 merge, a new set of AGP changes, some ACPI
work, an SCTP update, and, of course, numerous other fixes. Once again,
see the
long-format changelog for the details.
Linus's (pre-2.5.63) BitKeeper tree contains, as of this writing, the
longstanding POSIX timers patch (but without the high-resolution
timers), a new set of IDE changes (see below), updates for obscure
architectures (Visual Workstation, v850, m68k-nommu), an ACPI update
(including the license change to dual BSD/GPL), and another big set of
spelling fixes.
The current stable kernel is 2.4.20; there have been no 2.4.21
prepatches issued over the last week.
Alan Cox has released the fourth 2.2.24 release
candidate.
Comments (12 posted)
Kernel development news
While fixing various problems in the signal handling code of recent
kernels, Linus evidently decided to take a stab at the issue of signal
handling races. The result was
this patch
implementing a prototype of a new signal handling mechanism. The idea
needs some fleshing out before it might be merged into the kernel, but it
has attracted a certain amount of interest among the developers.
The patch adds a new sigfd() system call:
int sigfd(sigset_t *mask, unsigned long flags);
The system call returns a file descriptor which will report on the set of
signals found in the given mask (the flags argument is
not used for now). A process reading from the file descriptor will receive
a structure describing one signal which was delivered to the process; it
will block if there are no outstanding signals.
This approach offers some advantages. Since signals are queued up and read
one at a time, they can be dealt with in an orderly manner. The user-space
application need not worry about races between signal handlers and other
code. The signal file descriptor can also be used with the
select() and poll() system calls, allowing signal
handling to be folded into application event processing loops. An
application can even pass the file descriptor to another process, should
there be, for some reason, a desire to let that other process listen in on
the first process's signals.
There was some immediate discussion of expanding this interface into a more
generic event-handling mechanism. For example, timer events, asychronous
I/O events, etc. could also be delivered via the same file descriptor.
Linus stated that, to an extent, expanding the interface is what the
flags argument was intended for. He doesn't want to put too much into this
interface, however:
I'm not in the least interested in some "generic event" mechanism,
and it's not where I think this should even go. This was very much
about signals, and while I can see the potential to extend the
notion of signals to things like timers, I don't think it's
necessarily a good idea to extend it too far
Looking at the patch, a few developers commented on how much of it is
really just boilerplate filesystem and inode code. It has to be there to
make the file descriptor work, but really has little to do with the task at
hand. Much of that code is duplicated with other subsystems which have to
make "virtual" file descriptors. Davide Libenzi responded to this
observation with a patch implementing a new,
shared, "virtual filesystem" capability. If some variant of that patch
goes in, it has the potential of ridding the kernel of a fair amount of
tedious and error-prone code duplication.
Comments (none posted)
After a long pause, a new set of IDE patches has found its way into Linus's
pre-2.5.63 BitKeeper tree. Most of these patches have been around for a
while (in the 2.4-ac tree), but Alan Cox has not felt that 2.5.x was stable
enough to attempt new IDE work. Now that things are working a little
better, the patches are flowing again.
The new generation of IDE changes is rather more restrained than last
year's "cleanup" effort. Changes that have gone in this time around
include cleaning out some old data structures that were either unused or
did not suit the purpose to which they were being put. Some improved
locking has been put in place, and the handling of missing drives
(i.e. PCMCIA drives which are removed by the user) has been improved -
though work remains to be done in that area. There is also a new
ide_execute_command() function which is meant to be the way
commands are passed down to drives in the future. For now, though, it is
only used for CD drives ("As with 2.4 I want it to
run for a bit on read only media first.")
The IDE work is one of the more prominent entries remaining on the "todo"
list for 2.5. Given the need to proceed slowly (it really is no fun
to ship a kernel with broken IDE), this work may take some time yet. So
it's good to see the patches finding their way into Linus's tree again.
Comments (3 posted)
The low-level kernel memory allocation functions take a set of flags
describing how that allocation is to be performed. Among other things,
these
GFP_ ("get free page") flags control whether the allocation
process can sleep and wait for memory, whether high memory can be used, and
so on. See
this article for the full set.
The kernel slab allocator is an additional layer built on top of the
low-level code; it handles situations where numerous objects of the same
size are frequently allocated and freed. The slab code, too, has a set of
flags describing how memory allocation is to happen. They look
suspiciously like the low-level flags, but they have different names;
instead of GFP_KERNEL, for example, user of the slab code are
expected to say SLAB_KERNEL.
Underneath it all, however, the two sets of flags are the same. As a
result, many calls to the slab code just use the GFP_ flags,
rather than the SLAB_ flags. William Lee Irwin decided it was
time to fix that; he posted a patch
converting several slab users over to the SLAB_ flags. It looked
like a fairly standard, freeze-stage kernel cleanup.
The question came up, however: why bother? Not everybody, it seems, thinks
that the separate SLAB_ flags are worth the trouble. William
responded with another patch which gets rid
of the SLAB_ flags altogether. So far, neither patch has been
merged. But they do raise a worthwhile question: why do we need a separate
set of flags if the callers have nothing different to say?
Comments (none posted)
Driver porting
The LWN.net series on porting drivers (and other kernel code) to the 2.5
kernel continues this week with three new articles. Two of them (on
low-level memory allocation and per-CPU variables) appear below; the third
(an
updated description of the seqlock
mechanism) is available but won't be included inline here. As always,
the full series can be found at
http://lwn.net/Articles/driver-porting/.
Comments (none posted)
The 2.5 development series has brought relatively few changes to the way
device drivers will allocate and manage memory. In fact, most drivers
should work with no changes in this regard. There are a few improvements
that have been made, however, that are worth a mention. These include some
changes to page allocation, and the new "mempool" interface. Note that the
allocation and management of per-CPU data is described in
a separate article.
Allocation flags
The old
<linux/malloc.h> include file is gone; it is now
necessary to include
<linux/slab.h> instead.
The GFP_BUFFER allocation flag is gone (it was actually removed in
2.4.6). That will bother few people, since almost nobody used it. There
are two new flags which have replaced it: GFP_NOIO and
GFP_NOFS. The GFP_NOIO flag allows sleeping, but no I/O
operations will be started to help satisfy the request. GFP_NOFS
is a bit less restrictive; some I/O operations can be started (writing to a
swap area, for example), but no filesystem operations will be performed.
For reference, here is the full set of allocation flags, from the most
restrictive to the least::
- GFP_ATOMIC: a high-priority allocation which will not sleep;
this is the flag to use in interrupt handlers and other non-blocking
situations.
- GFP_NOIO: blocking is possible, but no I/O will be
performed.
- GFP_NOFS: no filesystem operations will be performed.
- GFP_KERNEL: a regular, blocking allocation.
- GFP_USER: a blocking allocation for user-space pages.
- GFP_HIGHUSER: for allocating user-space pages where high
memory may be used.
The __GFP_DMA and __GFP_HIGHMEM flags still exist and may
be added to the above to direct an allocation to a particular memory zone.
In addition, 2.5.69 added some new modifiers:
- __GFP_REPEAT
This flag tells the page allocater to "try harder," repeating failed
allocation attempts if need be. Allocations can still fail, but
failure should be less likely.
- __GFP_NOFAIL
Try even harder; allocations with this flag must not fail. Needless
to say, such an allocation could take a long time to satisfy.
- __GFP_NORETRY
Failed allocations should not be retried; instead, a failure status
will be returned to the caller immediately.
The __GFP_NOFAIL flag is sure to be tempting to programmers who
would rather not code failure paths, but that temptation should be resisted
most of the time. Only allocations which truly cannot be allowed to fail
should use this flag.
Page-level allocation
For page-level allocations, the
alloc_pages() and
get_free_page() functions (and variants) exist as always. They
are now defined in
<linux/gfp.h>, however, and there
are a few new ones as well. On NUMA systems, the allocator will do
its best to allocate pages on the same node as the caller. To explicitly
allocate pages on a different NUMA node, use:
struct page *alloc_pages_node(int node_id,
unsigned int gfp_mask,
unsigned int order);
The memory allocator now distinguishes between "hot" and "cold" pages. A
hot page is one that is likely to be represented in the processor's cache;
cold pages, instead, must be fetched from RAM. In general, it is
preferable to use hot pages whenever possible, since they are already
cached. Even if the page is to be overwritten immediately (usually the
case with memory allocations, after all), hot pages are better -
overwriting them will not push some other, perhaps useful, data from the
cache. So alloc_pages() and friends will return hot pages when
they are available.
On occasion, however, a cold page is preferable. In particular, pages
which will be overwritten via a DMA read from a device might as well be
cold, since their cache data will be invalidated anyway. In this sort of
situation, the __GFP_COLD flag should be passed into the
allocation.
Of course, this whole scheme depends on the memory allocator knowing which
pages are likely to be hot. Normally, order-zero allocations (i.e. single
pages) are assumed to be hot. If you know the state of a page you are
freeing, you can tell the allocator explicitly with one of the following:
void free_hot_page(struct page *page);
void free_cold_page(struct page *page);
These functions only work with order-zero allocations; the hot/cold status
of larger blocks is not tracked.
Memory pools
Memory pools were one of the very first changes in the 2.5 series - they
were added to 2.5.1 to support the new block I/O layer. The purpose of
mempools is to help out in situations where a memory allocation must
succeed, but sleeping is not an option. To that end, mempools pre-allocate
a pool of memory and reserve it until it is needed. Mempools make life
easier in some situations, but they should be used with restraint; each
mempool takes a chunk of kernel memory out of circulation and raises the
minimum amount of memory the kernel needs to run effectively.
To work with mempools, your code should include
<linux/mempool.h>. A mempool is created with
mempool_create():
mempool_t *mempool_create(int min_nr,
mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn,
void *pool_data);
Here,
min_nr is the minimum number of pre-allocated objects that
the mempool tries to keep around. The mempool defers the actual allocation
and deallocation of objects to user-supplied routines, which have the
following prototypes:
typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);
The allocation function should take care not to sleep unless
__GFP_WAIT is set in the given gfp_mask. In all of the
above cases, pool_data is a private pointer that may be used by
the allocation and deallocation functions.
Creators of mempools will often want to use the slab allocator to
do the actual object allocation and deallocation. To do that, create the
slab, pass it in to mempool_create() as the pool_data
value, and give mempool_alloc_slab and mempool_free_slab
as the allocation and deallocation functions.
A mempool may be returned to the system by passing it to
mempool_destroy(). You must have returned all items to the pool
before destroying it, or the mempool code will get upset and oops the
system.
Allocating and freeing objects from the mempool is done with:
void *mempool_alloc(mempool_t *pool, int gfp_mask);
void mempool_free(void *element, mempool_t *pool);
mempool_alloc() will first call the pool's allocation function to
satisfy the request; the pre-allocated pool will only be used if the
allocation function fails. The allocation may sleep if the given
gfp_mask allows it; it can also fail if memory is tight and the
preallocated pool has been exhausted.
Finally, a pool can be resized, if necessary, with:
int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);
This function will change the size of the pre-allocated pool, using the
given gfp_mask to allocate more memory if need be. Note that, as
of 2.5.60, mempool_resize() is disabled in the source, since
nobody is actually using it.
Comments (none posted)
The 2.6 kernel makes extensive use of per-CPU data - arrays containing one
object for each processor on the system. Per-CPU variables are not suitable for
every task, but, in situations where they can be used, they do offer a
couple of advantages:
- Per-CPU variables have fewer locking requirements since they are
(normally) only accessed by a single processor. There is nothing
other than convention that keeps processors from digging around in
other processors' per-CPU data, however, so the programmer must remain
aware of what is going on.
- Nothing destroys cache performance as quickly as accessing the same
data from multiple processors. Restricting each processor to its own
area eliminates cache line bouncing and improves performance.
Examples of per-CPU data in the 2.6 kernel include lists of buffer heads,
lists of hot and cold pages, various kernel and networking statistics
(which are occasionally summed together into the full system values), timer
queues, and so on. There are currently no drivers using per-CPU values,
but some applications (i.e. networking statistics for high-bandwidth
adapters) might benefit from their use.
The normal way of creating per-CPU variables at compile time is with this
macro (defined in <linux/percpu.h>):
DEFINE_PER_CPU(type, name);
This sort of definition will create name, which will hold one
object of the given type for each processor on the system. If the
variables are to be exported to modules, use:
EXPORT_PER_CPU_SYMBOL(name);
EXPORT_PER_CPU_SYMBOL_GPL(name);
If you need to link to a per-CPU variable defined elsewhere, a similar
macro may be used:
DECLARE_PER_CPU(type, name);
Variables defined in this way are actually an array of values. To get at a
particular processor's value, the per_cpu() macro may be used; it
works as an lvalue, so so code like the following works:
DEFINE_PER_CPU(int, mypcint);
per_cpu(mypcint, smp_processor_id()) = 0;
The above code can be dangerous, however. Accessing per-CPU variables can
often be done without locking, since each processor has its own private
area to work in. The 2.6 kernel is preemptible, however, and that adds a
couple of challenges. Since kernel code can be preempted, it is possible
to encounter race conditions with other kernel threads running on the same
processor. Also, accessing a per-CPU variable requires knowing which
processor you are running on; it would not do to be preempted and moved to
a different CPU between looking up the processor ID and accessing a per-CPU
variable.
For both of the above reasons, kernel preemption usually must be disabled when
working with per-CPU data. The usual way of doing this is with the
get_cpu_var and put_cpu_var macros. get_cpu_var
works as an lvalue, so it can be assigned to, have its address taken, etc.
Perhaps the simplest example of the use of these macros can be found in
net/socket.c:
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
Of course, since preemption is disabled between the calls, the code should
take care not to sleep. Note that there is no version of these macros
for access to another CPU's data; cross-processor access to per-CPU data
requires explicit locking arrangements.
It
is also possible to allocate per-CPU variables
dynamically. Simply use these functions:
void *alloc_percpu(type);
void free_percpu(const void *);
alloc_percpu() will allocate one object (of the given type) for
each CPU on the system; the allocated storage will be zeroed before being
returned to the caller.
There is another set of macros which may be used to access per-CPU data
obtained with kmalloc_percpu(). At the lowest level, you may use:
per_cpu_ptr(void *ptr, int cpu)
which returns (without any concurrency control) a pointer to the per-CPU
data for the given cpu. For access to a local processor's data,
with preemption disabled, use:
get_cpu_ptr(ptr)
put_cpu_ptr(ptr)
With the usual proviso that you do not sleep between the two.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>