Brief items
The current 2.6 prepatch remains 2.6.11-rc2.
Linus's BitKeeper repository, which looks like it is heading for a
2.6.11-rc3 release before too long, contains an XFS update, a set of
out-of-memory killer fixes, a generic transport class mechanism (which
replaces the SCSI transport code), some architecture updates, the removal
of bcopy(), a fix for writable module parameters in sysfs (it never
actually worked before), and various fixes.
The current -mm release is 2.6.11-rc2-mm2.
Recent changes to -mm include the unexporting of register_cpu()
and unregister_cpu(), an InfiniBand update, a tool for tracking
page-level memory leaks (see below), the addition of the unprivileged
realtime scheduling rlimit code (covered here last week; this code replaces the
SCHED_ISO patch), and a fair number of fixes.
The current 2.4 kernel remains 2.4.29; the 2.4.30 process has not
yet begun.
Comments (2 posted)
Kernel development news
We argued that the owner of a Digital Audio Workstation should be
free to lock up his CPU any time he wants. But, no one would
listen. We were told that we didn't really know what we needed,
and were asking the wrong question. That was very discouraging.
It looked like LKML was going to ignore our needs for yet another
year.
--
Jack O'Quin, finding the process long
and frustrating.
The Linux acceptance process is not about "whose patch sucks
least", but whether it hits a subsystem-specific bar of
architectural requirements or not.... We'll rather live on with
one less feature for another year than with a crappy feature that
is twice as hard to get rid of!
--
Ingo Molnar explains that process.
Whoever's responsible, prepare to be flamed to a crisp the likes of
which has never been witnessed before by observers of solar probes, nor
conceived of by the most visionary and imaginative of eschatologists.
--
William Lee Irwin. I'd stand back if I
were you.
Comments (2 posted)
Network drivers must provide a function (
hard_start_xmit()) for
the networking layer to call whenever it decides the time has come to send
out a packet. Normally, calls to
hard_start_xmit() are serialized
with a spinlock (
xmit_lock) in the
net_device structure.
In this way, the networking subsystem guarantees that it will not attempt
to send multiple packets simultaneously on the same interface.
This method works, but it is not quite ideal, especially for
high-performance network adaptors. Most drivers already implement
their own internal locking, rendering xmit_lock redundant. The
xmit_lock can also cause a certain amount of cache line bouncing
on SMP systems with a lot of networking traffic. To work around these
problems, the NETIF_F_LLTX "feature" flag was added in 2.6.9. If
a driver sets NETIF_F_LLTX on its interface, it is declaring that
it performs its own locking, and its hard_start_xmit() function
will be called without the xmit_lock held.
All seemed well for a while, but, back in December, Roland Dreier noticed a problem. When a network driver
notices that an interface's transmit buffers are too full to accept any
more packets, it calls netif_stop_queue() to inform the networking
layer. Its hard_start_xmit() method should then not be called
until the driver (with a call to netif_wake_queue()) indicates
that new packets can, once again be accepted. Network drivers thus can
count on not being asked to transmit packets when they have stopped the
queue.
Unless, as it turns out, they have set NETIF_F_LLTX. The lack of
transmit locking in the networking layer itself leads to a situation where
hard_start_xmit() can be called simultaneously on multiple
processors; hard_start_xmit() is supposed to handle that situation
with its own locking. But, if one hard_start_xmit() call fills
the transmit buffer and stops the queue, the second call will proceed in a
state it had not expected: it has a packet to transmit but no place to put
it. In most cases, this race leads to a strange error message in the
system logs. In a poorly-written driver, worse things could happen.
Roland's initial problem report included a patch which silenced the log
message. The networking hackers did not like
that solution, however; they feared that it could hide serious
(unrelated) bugs. So they set out to come up with a better solution. The
result was a lengthy patch which made some significant changes to how
network driver locking works. Uses of xmit_lock were changed to
disable interrupts, so that lock could be used in interrupt handlers as
well. Drivers could then use xmit_lock (rather than their own
lock) for internal locking. The NETIF_F_LLTX flag was redefined
to indicate that the transmit routine was completely lockless, a condition
which only applies to certain types of software device. The end result was
most of the advantages of NETIF_F_LLTX but with the race condition
solved. A version of this patch was merged as part of 2.6.11-rc2.
Unfortunately, there were some difficulties. The locking changes led to
deadlocks in certain situations where the driver would try to grab a lock
already held by the networking code which called it. Network drivers had
to be careful not to do anything (such as spin_unlock_irq()) which
would enable interrupts while xmit_lock was held.
dev_kfree_skb() could no longer be called in any place where
xmit_lock was held, since its use is not legal when interrupts are
disabled. Overall, there were enough problems with this approach that the
patch was backed out after the -rc2 release, and the developers started
over.
The current approach, as proposed by David
Miller, is to leave things as they are and silence the log message. The
patch has been tweaked a bit since first proposed by Roland in December; it
now tries to distinguish the NETIF_F_LLTX race from other (more
serious) calls to hard_start_xmit() with the transmit buffer
full. This is done by checking to see if the queue has been stopped; if
so, it is a harmless race and transmission of the packet is silently
deferred. If the queue is still running, however, then something has gone
wrong somewhere. This change must be made in all drivers which use
NETIF_F_LLTX - a relatively small set. It's a small change, but
it is a change in the rules for network drivers and worth being aware of.
Comments (8 posted)
A number of developers have taken a stab at the problem of memory
fragmentation and the allocation of large, contiguous blocks of memory in
the kernel. Approaches covered on this page recently include Marcelo
Tosatti's
active defragmentation patch and
Nick Piggin's
kswapd improvements. Now Mel
Gorman has jumped into the fray with a different take on the problem.
At a very high level, the kernel organizes free pages as shown in the
diagram below.
The system's physical memory is split into zones; on
an x86 systems, the zones include the small space reachable by ISA devices
(ZONE_DMA), the regular memory zone (ZONE_NORMAL), and
memory not directly accessible by the kernel (ZONE_HIGHMEM). NUMA
systems divide things further by creating zones for each node. Within each
node, memory is split into chunks and sorted depending on its "order" - the base-2
logarithm of the size of each block. For each order, there is a linked list
of available blocks of that size. So, at the bottom of the array, the
order-0 list contains individual pages; the order-1 list has pairs of
pages, etc., up to the maximum order handled by the system. When a request
for an allocation of a given order arrives, a block is taken off the
appropriate list. If no blocks of that size are available, a larger block
is split. When blocks are freed, the buddy allocator tries to coalesce
them with neighboring blocks to recreate higher-order chunks.
In real-life Linux systems, over time, the larger blocks tend to get split
up, to the point that larger allocations can become difficult. A look at
/proc/buddyinfo on a running system will tend to show quite a few
zero-order pages available (one hopes), but relatively few larger blocks.
For this reason, high-order allocations have a high probability of failure
on a system which has been up for a while.
Mel's approach is to split memory allocations into three types, as
indicated by a new set of GFP_ flags which can be provided when
memory is requested. Memory allocations marked by __GFP_USERRCLM
are understood to be for user space, and to be easily reclaimable. In
general, all that's required to reclaim a user-space page is to write it to
backing store (if it has been modified). The __GFP_KERNRCLM flag
marks reclaimable kernel memory, such as that obtained from slabs and used
in caches which can, when needed, be dropped. Finally, allocations not
otherwise marked are considered to not be reclaimable in any easy way.
Then, the buddy allocator's data structures are expanded to look something
like this:
When the allocator is initialized, and all that nice, virgin memory is
still unfragmented, the free_area_global field points to a long
list of maximally-sized blocks of memory. The three free_area
arrays - one for each type of allocation - are initially empty. Each
allocation request, when it arrives, will be satisfied from the associated
free_area array if possible; otherwise, one of the
MAX_ORDER blocks from free_area_global will be split up.
The portion of that block which is not allocated will be placed in the
array associated with the current memory allocation type.
When memory is freed and blocks are coalesced, they remain within the
type-specific array until they reach the largest size, at which point they
go back onto the global array.
One immediate benefit from this organization is that the pages which are
hardest to get back - those in the "kernel non-reclaimable" category - are
grouped together into their own blocks. A single pinned page can prevent
the coalescing of a large block, so segregating the difficult kernel pages
makes the management of the rest of memory easier. Beyond that, this
organization makes it possible to perform active page freeing. If a
high-order request cannot be satisfied, simply start with a smaller block
and free up the neighboring pages. Active freeing is not yet implemented in
Mel's current patch, however.
Even without the active component, this patch helps the kernel to satisfy
large allocations. Mel gives results from a memory-thrashing test he ran;
with a vanilla kernel, only three out of 160 attempted order-10 allocations
were successful. With a patched kernel, instead, 81 attempts succeeded.
So the new allocation technique and data structures do help the situation.
What happens next remains to be seen, however; there seems to be a big
hurdle to overcome when trying to get high-order allocation patches
merged.
Comments (3 posted)
If you look far enough into the
2.6.11-rc2-mm2
announcement, you'll find a mention of a "page owner tracking leak
detector" patch. The addition of this patch was almost certainly motivated
by the series of memory leak problems which have afflicted the 2.6.11
prepatches. It is a heavy-handed tool, but, for some situations, it might
make the problem of finding memory leaks far easier.
Essentially, this patch causes the kernel to keep track of the call chain that
leads to the allocation of every page. This information is made available
via /proc/page_owner; it looks something like this:
Page allocated via order 0
[0xc0146f01] kmem_getpages+49
[0xc014846d] cache_grow+173
[0xc0148aac] cache_alloc_refill+460
[0xc0118a8f] copy_files+431
[0xc0148ff5] kmem_cache_alloc+149
[0xc011986b] copy_process+3051
[0xc01199d1] fork_idle+65
[0xc041824a] do_boot_cpu+42
Your editor's 256MB sacrificial kernel box has, after a short period of run
time, over 13,000 such entries. So plowing through the raw data is
probably not what most people want to do. To help out, a small program (page_owner.c) has been put into the
Documentation directory (though one might argue that it should be
in scripts instead). This program boils down the contents of
/proc/page_owner to something which looks like this:
856 times:
Page allocated via order 0
[0xc0146572] __do_page_cache_readahead+290
[0xc0146a70] max_sane_readahead+48
[0xc0140166] filemap_nopage+790
[0xc013fe50] filemap_nopage+0
[0xc0150861] do_no_page+193
[0xc0150cc6] handle_mm_fault+246
[0xc01126cc] do_page_fault+492
[0xc0151b3c] remove_vm_struct+140
839 times:
Page allocated via order 0
[0xc0146572] __do_page_cache_readahead+290
[0xc0146a70] max_sane_readahead+48
[0xc0140166] filemap_nopage+790
[0xc013fe50] filemap_nopage+0
[0xc0150861] do_no_page+193
[0xc0150cc6] handle_mm_fault+246
[0xc01126cc] do_page_fault+492
[0xc013c207] ltt_log_event+71
With this output, finding the source of a major memory leak should be
relatively straightforward. It's worth noting that this program fails if
told to read directly from /proc/page_owner (it does a
stat() to determine the size of its input), so you must copy the
data to a regular file first. This patch is also a major memory consumer
in its own right, since it must store the call chain information for every
allocated page. It's thus not something most people would put onto a
production system - or even on most development systems. But it can be a
useful thing to have around when a page-level memory leak bites.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
- Shailabh Nagar: ckrm-e17.
(January 28, 2005)
Development tools
Device drivers
- Dave Airlie: drm tree.
(February 1, 2005)
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>