LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.11-rc2.

Linus's BitKeeper repository, which looks like it is heading for a 2.6.11-rc3 release before too long, contains an XFS update, a set of out-of-memory killer fixes, a generic transport class mechanism (which replaces the SCSI transport code), some architecture updates, the removal of bcopy(), a fix for writable module parameters in sysfs (it never actually worked before), and various fixes.

The current -mm release is 2.6.11-rc2-mm2. Recent changes to -mm include the unexporting of register_cpu() and unregister_cpu(), an InfiniBand update, a tool for tracking page-level memory leaks (see below), the addition of the unprivileged realtime scheduling rlimit code (covered here last week; this code replaces the SCHED_ISO patch), and a fair number of fixes.

The current 2.4 kernel remains 2.4.29; the 2.4.30 process has not yet begun.

Comments (2 posted)

Kernel development news

Quotes of the week

We argued that the owner of a Digital Audio Workstation should be free to lock up his CPU any time he wants. But, no one would listen. We were told that we didn't really know what we needed, and were asking the wrong question. That was very discouraging. It looked like LKML was going to ignore our needs for yet another year.
-- Jack O'Quin, finding the process long and frustrating.

The Linux acceptance process is not about "whose patch sucks least", but whether it hits a subsystem-specific bar of architectural requirements or not.... We'll rather live on with one less feature for another year than with a crappy feature that is twice as hard to get rid of!
-- Ingo Molnar explains that process.

Whoever's responsible, prepare to be flamed to a crisp the likes of which has never been witnessed before by observers of solar probes, nor conceived of by the most visionary and imaginative of eschatologists.
-- William Lee Irwin. I'd stand back if I were you.

Comments (2 posted)

NETIF_F_LLTX and race conditions

Network drivers must provide a function (hard_start_xmit()) for the networking layer to call whenever it decides the time has come to send out a packet. Normally, calls to hard_start_xmit() are serialized with a spinlock (xmit_lock) in the net_device structure. In this way, the networking subsystem guarantees that it will not attempt to send multiple packets simultaneously on the same interface.

This method works, but it is not quite ideal, especially for high-performance network adaptors. Most drivers already implement their own internal locking, rendering xmit_lock redundant. The xmit_lock can also cause a certain amount of cache line bouncing on SMP systems with a lot of networking traffic. To work around these problems, the NETIF_F_LLTX "feature" flag was added in 2.6.9. If a driver sets NETIF_F_LLTX on its interface, it is declaring that it performs its own locking, and its hard_start_xmit() function will be called without the xmit_lock held.

All seemed well for a while, but, back in December, Roland Dreier noticed a problem. When a network driver notices that an interface's transmit buffers are too full to accept any more packets, it calls netif_stop_queue() to inform the networking layer. Its hard_start_xmit() method should then not be called until the driver (with a call to netif_wake_queue()) indicates that new packets can, once again be accepted. Network drivers thus can count on not being asked to transmit packets when they have stopped the queue.

Unless, as it turns out, they have set NETIF_F_LLTX. The lack of transmit locking in the networking layer itself leads to a situation where hard_start_xmit() can be called simultaneously on multiple processors; hard_start_xmit() is supposed to handle that situation with its own locking. But, if one hard_start_xmit() call fills the transmit buffer and stops the queue, the second call will proceed in a state it had not expected: it has a packet to transmit but no place to put it. In most cases, this race leads to a strange error message in the system logs. In a poorly-written driver, worse things could happen.

Roland's initial problem report included a patch which silenced the log message. The networking hackers did not like that solution, however; they feared that it could hide serious (unrelated) bugs. So they set out to come up with a better solution. The result was a lengthy patch which made some significant changes to how network driver locking works. Uses of xmit_lock were changed to disable interrupts, so that lock could be used in interrupt handlers as well. Drivers could then use xmit_lock (rather than their own lock) for internal locking. The NETIF_F_LLTX flag was redefined to indicate that the transmit routine was completely lockless, a condition which only applies to certain types of software device. The end result was most of the advantages of NETIF_F_LLTX but with the race condition solved. A version of this patch was merged as part of 2.6.11-rc2.

Unfortunately, there were some difficulties. The locking changes led to deadlocks in certain situations where the driver would try to grab a lock already held by the networking code which called it. Network drivers had to be careful not to do anything (such as spin_unlock_irq()) which would enable interrupts while xmit_lock was held. dev_kfree_skb() could no longer be called in any place where xmit_lock was held, since its use is not legal when interrupts are disabled. Overall, there were enough problems with this approach that the patch was backed out after the -rc2 release, and the developers started over.

The current approach, as proposed by David Miller, is to leave things as they are and silence the log message. The patch has been tweaked a bit since first proposed by Roland in December; it now tries to distinguish the NETIF_F_LLTX race from other (more serious) calls to hard_start_xmit() with the transmit buffer full. This is done by checking to see if the queue has been stopped; if so, it is a harmless race and transmission of the packet is silently deferred. If the queue is still running, however, then something has gone wrong somewhere. This change must be made in all drivers which use NETIF_F_LLTX - a relatively small set. It's a small change, but it is a change in the rules for network drivers and worth being aware of.

Comments (8 posted)

Yet another approach to memory fragmentation

A number of developers have taken a stab at the problem of memory fragmentation and the allocation of large, contiguous blocks of memory in the kernel. Approaches covered on this page recently include Marcelo Tosatti's active defragmentation patch and Nick Piggin's kswapd improvements. Now Mel Gorman has jumped into the fray with a different take on the problem.

At a very high level, the kernel organizes free pages as shown in the diagram below.

[cheesy memory diagram]

The system's physical memory is split into zones; on an x86 systems, the zones include the small space reachable by ISA devices (ZONE_DMA), the regular memory zone (ZONE_NORMAL), and memory not directly accessible by the kernel (ZONE_HIGHMEM). NUMA systems divide things further by creating zones for each node. Within each node, memory is split into chunks and sorted depending on its "order" - the base-2 logarithm of the size of each block. For each order, there is a linked list of available blocks of that size. So, at the bottom of the array, the order-0 list contains individual pages; the order-1 list has pairs of pages, etc., up to the maximum order handled by the system. When a request for an allocation of a given order arrives, a block is taken off the appropriate list. If no blocks of that size are available, a larger block is split. When blocks are freed, the buddy allocator tries to coalesce them with neighboring blocks to recreate higher-order chunks.

In real-life Linux systems, over time, the larger blocks tend to get split up, to the point that larger allocations can become difficult. A look at /proc/buddyinfo on a running system will tend to show quite a few zero-order pages available (one hopes), but relatively few larger blocks. For this reason, high-order allocations have a high probability of failure on a system which has been up for a while.

Mel's approach is to split memory allocations into three types, as indicated by a new set of GFP_ flags which can be provided when memory is requested. Memory allocations marked by __GFP_USERRCLM are understood to be for user space, and to be easily reclaimable. In general, all that's required to reclaim a user-space page is to write it to backing store (if it has been modified). The __GFP_KERNRCLM flag marks reclaimable kernel memory, such as that obtained from slabs and used in caches which can, when needed, be dropped. Finally, allocations not otherwise marked are considered to not be reclaimable in any easy way.

Then, the buddy allocator's data structures are expanded to look something like this:

[The Gorman approach to buddy allocators]

When the allocator is initialized, and all that nice, virgin memory is still unfragmented, the free_area_global field points to a long list of maximally-sized blocks of memory. The three free_area arrays - one for each type of allocation - are initially empty. Each allocation request, when it arrives, will be satisfied from the associated free_area array if possible; otherwise, one of the MAX_ORDER blocks from free_area_global will be split up. The portion of that block which is not allocated will be placed in the array associated with the current memory allocation type.

When memory is freed and blocks are coalesced, they remain within the type-specific array until they reach the largest size, at which point they go back onto the global array.

One immediate benefit from this organization is that the pages which are hardest to get back - those in the "kernel non-reclaimable" category - are grouped together into their own blocks. A single pinned page can prevent the coalescing of a large block, so segregating the difficult kernel pages makes the management of the rest of memory easier. Beyond that, this organization makes it possible to perform active page freeing. If a high-order request cannot be satisfied, simply start with a smaller block and free up the neighboring pages. Active freeing is not yet implemented in Mel's current patch, however.

Even without the active component, this patch helps the kernel to satisfy large allocations. Mel gives results from a memory-thrashing test he ran; with a vanilla kernel, only three out of 160 attempted order-10 allocations were successful. With a patched kernel, instead, 81 attempts succeeded. So the new allocation technique and data structures do help the situation. What happens next remains to be seen, however; there seems to be a big hurdle to overcome when trying to get high-order allocation patches merged.

Comments (3 posted)

Useful gadget: /proc/page_owner

If you look far enough into the 2.6.11-rc2-mm2 announcement, you'll find a mention of a "page owner tracking leak detector" patch. The addition of this patch was almost certainly motivated by the series of memory leak problems which have afflicted the 2.6.11 prepatches. It is a heavy-handed tool, but, for some situations, it might make the problem of finding memory leaks far easier.

Essentially, this patch causes the kernel to keep track of the call chain that leads to the allocation of every page. This information is made available via /proc/page_owner; it looks something like this:

Page allocated via order 0
[0xc0146f01] kmem_getpages+49
[0xc014846d] cache_grow+173
[0xc0148aac] cache_alloc_refill+460
[0xc0118a8f] copy_files+431
[0xc0148ff5] kmem_cache_alloc+149
[0xc011986b] copy_process+3051
[0xc01199d1] fork_idle+65
[0xc041824a] do_boot_cpu+42

Your editor's 256MB sacrificial kernel box has, after a short period of run time, over 13,000 such entries. So plowing through the raw data is probably not what most people want to do. To help out, a small program (page_owner.c) has been put into the Documentation directory (though one might argue that it should be in scripts instead). This program boils down the contents of /proc/page_owner to something which looks like this:

856 times:
Page allocated via order 0
[0xc0146572] __do_page_cache_readahead+290
[0xc0146a70] max_sane_readahead+48
[0xc0140166] filemap_nopage+790
[0xc013fe50] filemap_nopage+0
[0xc0150861] do_no_page+193
[0xc0150cc6] handle_mm_fault+246
[0xc01126cc] do_page_fault+492
[0xc0151b3c] remove_vm_struct+140

839 times:
Page allocated via order 0
[0xc0146572] __do_page_cache_readahead+290
[0xc0146a70] max_sane_readahead+48
[0xc0140166] filemap_nopage+790
[0xc013fe50] filemap_nopage+0
[0xc0150861] do_no_page+193
[0xc0150cc6] handle_mm_fault+246
[0xc01126cc] do_page_fault+492
[0xc013c207] ltt_log_event+71

With this output, finding the source of a major memory leak should be relatively straightforward. It's worth noting that this program fails if told to read directly from /proc/page_owner (it does a stat() to determine the size of its input), so you must copy the data to a regular file first. This patch is also a major memory consumer in its own right, since it must store the call chain information for every allocated page. It's thus not something most people would put onto a production system - or even on most development systems. But it can be a useful thing to have around when a page-level memory leak bites.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

  • Shailabh Nagar: ckrm-e17. (January 28, 2005)

Development tools

Device drivers

  • Dave Airlie: drm tree. (February 1, 2005)

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds