Kernel release status
The current 2.6 prepatch remains 2.6.9-rc1; no new prepatches have
been released since August 24.
The flow of patches into Linus's BitKeeper repository continues, however,
and a new prepatch could come out at any time. That repository now
contains the removal of the ancient, unused "busmouse" driver,
infrastructure for cluster-wide file locking, a number of DRM subsystem
cleanups, the out-of-line spinlock patch,
AMD dual-core support, more filesystem conversions to the new
symbolic link resolution code (which will eventually allow an increase in
the maximum link depth), a new waitid() system call implementing
the POSIX call by the same name, a "fake NUMA" mode for x86-64 testing, a
small-footprint tmpfs implementation, the base KProbes patch, a
set of IDE updates, support for scheduler profiling (seeing where context
switches come from), automatic TCP window scaling calculation, a kobject
change (it uses kref now), a USB gadget interface update with "On The Go"
support, a big ALSA update, the removal of the Philips webcam driver,
numerous network driver updates, some random number generator fixes, a fix
for the audio CD writing memory leak, some VFS interface improvements,
executable support in hugetlb mappings, the Whirlpool digest algorithm,
some virtual memory tweaks, a number of asynchronous I/O fixes and
improvements, a User-mode Linux update, the "flex mmap" user-space memory
layout (covered here last
June), a number of scheduler tweaks, the removal of the very last
suser() call, and lots of fixes.
The current tree from Andrew Morton is 2.6.9-rc1-mm4. Recent changes to -mm
include CacheFS (covered here last week),
the removal of lockmeter (it got broken by the out-of-line spinlock patch),
special code for handling misrouted interrupts on x86 systems, the new
sysfs event layer patch (see below), and M32R architecture support.
The current 2.4 prepatch remains 2.4.28-pre2; no prepatches have
been released since August 25.
Comments (none posted)
Kernel development news
Figuring out kernel event reporting
Robert Love's kernel event notification patch was covered here
last July. This patch enables
the reporting of events to interested user-space software, which can then
communicate with the user and generally respond to the events. As the
Linux desktop projects become more capable and all-encompassing, they need
to know more about what is going on with the system; the events layer is
meant to be the mechanism which makes that information available.
Robert has recently posted a new
version of the patch which changes the proposed interface
significantly. It looks, however, like the patch will change yet again.
As it turns out, there is still a fair amount of uncertainty about how best
to represent and report kernel events.
The initial version of the patch required four pieces of information for
each event: the type (a general class, like "hotplug"), the object
generating the event, the signal (saying what is happening), and an
explanatory string. The new version eliminates the descriptive string, and
turns the object into a proper kobject, which will be communicated to user
space as its location in sysfs. This interface is simpler, and it solves
the problem of how to generate predictable and consistent object names, but there are still
questions on how events should really be represented.
The easier part of the discussion has to do with the "type" parameter,
which allows user-space applications to filter out events which will not be
of interest. Kernel-generated events are expected to be relatively rare,
however, so there will be little cost in simply receiving all of them and
ignoring the uninteresting ones. So the type value associated with events
may go away.
The more interesting question has to do with the representation of the
"signal" parameter. That signal is currently a verb, describing something
which has happened with the object of interest. If the object is a CPU,
the signal might be "overheating". An alternative implementation
would be to replace the signal with an attribute of the object; for a
processor event, the temperature attribute would be passed. User
space would then read the value of that attribute in sysfs to figure out
what is really going on. This approach would force a structure onto the
signal names, and would point user space to where it needs to go to learn
more about what is going on. On the other hand, there may not always be
attributes available to describe a given event, and the approach could be
seen as overly restrictive.
Meanwhile, Greg Kroah-Hartman pointed out
that the simplified send_kevent() interface strongly resembles
another, existing kernel interface:
int send_kevent(struct kobject *kobj, const char *signal);
void kobject_hotplug(const char *action, struct kobject *kobj);
Given that kobject_hotplug() is also an event reporting mechanism,
why not unify the two? The big difference, at this point, would seem to be
that send_kevent() uses the netlink interface to communicate with
user space, while the hotplug code runs /sbin/hotplug and passes
the relevant information via the environment. Perhaps the best thing to
do, says Greg, is to have the hotplug code also send a copy of its events
via netlink, and use it for everything?
The idea of sending the same events out by way of two different transports
does not appeal to many developers, however; it seems better to decide
which is best and go with it. The netlink transport is strongly favored by
the desktop crowd, which dislikes the unpredictable delays and ordering
associated with event handling via /sbin/hotplug. On the other
hand, netlink is not available early in the boot process, but it is
important to be able to handle hotplug events then.
In the end, the hybrid approach may persist for some time. A future system
might use /sbin/hotplug at boot time, then turn it off once
everything is up and running. The one sure conclusion is that this is an
area in need of further thought and experimentation.
Comments (1 posted)
NETIF_F_LLTX
One of the key network driver methods is called
hard_start_xmit();
its job is to put a network packet onto the wire (or, at least, queue it
for transmission). The networking subsystem protects calls to this method
with a lock (
xmit_lock) in the
net_device structure so
that only one call will be happening at any given time. This lock also
protects a few configuration operations.
As it turns out, quite a few network drivers implement their own locking
internally as well. There are contexts (such as in interrupt handlers)
where the xmit_lock will not be held, so some other provision must
be made for mutual exclusion. So the hard_start_xmit() method, in
those drivers, is called with a redundant lock held. It all works, but it
adds overhead to a performance-critical path.
Andi Kleen has put together a patch which
addresses this duplicate locking. With this patch (which appears likely to
be merged), drivers which do their own transmit locking can set the
NETIF_F_LLTX "feature" flag. When a packet is to be handed to an
interface with that flag set, no additional locking is performed by the
networking code. As an added feature, the driver can attempt to take its
internal lock with spin_trylock(), and immediately return
-1 if that attempt fails; the networking subsystem will then retry
the transmission later. In this way, the driver can avoid stalling the CPU
while waiting for the lock; there should be, after all, no slowdown if the
packet is added to the transmission ring a little bit later.
Comments (1 posted)
Kswapd and high-order allocations
The core memory allocation mechanism inside the kernel is page-based; it
will attempt to find a certain number of contiguous pages in response to a
request (where "a certain number" is always a power of two). After the
system has been running for a while, however, "higher-order" allocations
requiring multiple contiguous pages become hard to satisfy. The virtual
memory subsystem fragments physical memory to the point that the free pages
tend to be separated from each other.
Curious readers can query /proc/buddyinfo to see how fragmented
the currently free pages are. On a 1GB system, your editor currently sees the
following:
Node 0, zone Normal 258 9 5 0 1 2 0 1 1 0 0
On this system, 258 single pages could be allocated immediately, but only
nine contiguous pairs exist, and only five groups of four pages can be found.
If something comes along which needs a lot of higher-order allocations, the
available memory will be exhausted quickly, and those allocations may start
to fail.
Nick Piggin has recently looked at this
issue and found one area where improvements can be made. The problem
is with the kswapd process, which is charged with running in the
background and making free pages available to the memory allocator (by
evicting user pages). The current kswapd code only looks at the
number of free pages available; if that number is high enough,
kswapd takes a rest regardless of whether any of those pages are
contiguous with others or not. That can lead to a situation where
high-order allocations fail, but the system is not making any particular
effort to free more contiguous pages.
Nick's patch is fairly straightforward; it simply keeps kswapd
from resting until a sufficient number of higher-order allocations are
possible.
It has been pointed out, however, that the approach used by kswapd
has not really changed: it chooses pages to free without
regard to whether those pages can be coalesced into larger groups or not.
As a result, it may have to free a great many pages before it, by chance,
creates some higher-order groupings of pages. In prior kernels, no better
approach was possible, but 2.6 includes the reverse-mapping code. With
reverse mapping, it should be possible to target contiguous pages for
freeing and vastly improve the system's performance in that area.
Linus's objection to this idea is that it
overrides the current page replacement policy, which does its best to evict
pages which, with luck, will not be needed in the near future. Changing
the policy to target contiguous blocks would make higher-order allocations
easier, but it could also penalize system performance as a whole by
throwing out useful pages. So, says Linus, if a "defragmentation" mode is
to be implemented at all, it should be run rarely and as a separate
process.
The other approach to this problem is to simply avoid higher-order
allocations in the first place. The switch to 4K kernel stacks was a step
in this direction; it eliminated a two-page allocation for every process
created. In current kernels, one of the biggest users of high-order
allocations would appear to be high-performance network adapter drivers.
These adapters can handle large packets which do not fit in a single page,
so the kernel must perform multi-page allocations to hold those packets.
Actually, those allocations are only required when the driver (and its
hardware) cannot handle "nonlinear" packets which are spread out in
memory. Most modern hardware can do scatter/gather DMA operations, and
thus does not care whether the packet is stored in a single, contiguous
area of memory. Using the hardware's scatter/gather capabilities requires
additional work when writing the driver, however, and, for a number of
drivers, that work has not yet been done. Addressing the high-order
allocation problem from the demand side may prove to be far more effective
than adding another objective to the page reclaim code, however.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Kernel building
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>