LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.10, which was released by Linus on December 24. There are CIFS and DVB updates since -rc3, along with the usual pile of fixes. For those of you just tuning in, 2.6.10 as a whole includes a new kernel events notification mechanism, switchable I/O schedulers (and a new CFQ scheduler as well), in-kernel cryptographic key management, DVD+RW and CDRW packet writing support, ext3 block reservation and online resizing support, big updates for many kernel subsystems, and a handful of security fixes. The long-format changelog (1.5MB) has all the details.

Linus's BitKeeper repository, as of this writing, contains the four-level page table patch (see below), a VIA PadLock crypto engine driver, a new SKB allocation function (see below), ACPI hotplug support, the full InfiniBand patch set (covered here last November), a big direct rendering manager (DRM) rework, a new and simplified file readahead mechanism, a set of user-mode Linux patches, a big set of input patches, a new set of "sparse" annotations, an NFS update, an iptables update, support for the Fujitsu FR-V architecture, in-inode extended attribute support for ext3, some SELinux scalability improvements, and lots of fixes.

The current prepatch from Andrew Morton is 2.6.10-mm1. Recent additions to -mm include some software suspend improvements, a PCMCIA update, a number of NUMA-related cleanups, and a reiser4 update.

The current 2.4 prepatch remains 2.4.29-pre3, dating back to December 22.

Comments (1 posted)

Kernel development news

Quotes of the week

After 2.6.9-ac its clear that the long 2.6.9 process worked very badly. While 2.6.10 is looking much better its long period meant the allegedly "official" base kernel was a complete pile of insecure donkey turd for months. That doesn't hurt most vendor users but it does hurt those trying to do stuff on the base kernels very badly.

-- Alan Cox

Not all 2.6.x kernels will be good; but if we do releases every 1 or 2 weeks, some of them *will* be good. The problem with the -rc releases is that we try to predict in advance which releases in advance will be stable, and we don't seem to be able to do a good job of that. If we do a release every week, my guess is that at least 1 in 3 releases will turn out to be stable enough for most purposes. But we won't know until after 2 or 3 days which releases will be the good ones.

-- Ted Ts'o

Comments (3 posted)

Four-level page tables merged

As expected, one of the first things to be merged into Linus's BitKeeper repository after the 2.6.10 release was the four-level page table patch. Two weeks ago, we noted that Nick Piggin had posted an alternative patch which changed the organization initially created by Andi Kleen. It was not clear, then, which version of the patch would go in. In the end, Nick's changes to the four-level patch were accepted.

Thus, in 2.6.11, the page table structure will include a new level, called "PUD," placed immediately below the top-level PGD directory. The new page table structure looks like this:

[Four-level page tables]

The PGD remains the top-level directory, accessed via the mm_struct structure associated with each process. The PUD only exists on architectures which are using four-level tables; that is only x86-64, as of this writing, but other 64-bit architectures will probably use the fourth level in the future as well. The PMD and PTE function as they did in previous kernels; the PMD is absent if the architecture only supports two-level tables.

ArchitectureBits used
PGDPUDPMDPTE
i38622-31   12-21
i386 (PAE mode)30-31  21-2912-20
x86-6439-46 30-38 21-29 12-20

Each level in the page table hierarchy is indexed with a subset of the bits in the virtual address of interest. Those bits are shown in the table to the right (for a few architectures). In the classic i386 architecture, only the PGD and PTE levels are actually used; the combined twenty bits allow up to 1 million pages (4GB) to be addressed. The i386 PAE mode adds the PMD level, but does not increase the virtual address space (it does expand the amount of physical memory which may be addressed, however). On the x86-64 architecture, four levels are used with a total of 35 bits for the page frame number. Before the patch was merged, the x86-64 architecture could not effectively use the fourth level and was limited to a 512GB virtual address space. Now x86-64 users can have a virtual address space covering 128TB of memory, which really should last them for a little while.

Those who are curious about how x86-64 uses its expanded address space may want to take a look at this explanation from Andi Kleen.

The merging of this patch demonstrates a few things about the current kernel development model. Prior to 2.6, such a fundamental change could never be applied during a "stable" kernel series; anybody needing the four-level feature would have had to wait a couple more years for 2.8. The new way of kernel development, for better or for worse, does bring new features to users far more quickly than the old method did - and without the need for distributor backports. This patch is also a clear product of the peer review process. Andi's initial version worked fine, and could certainly have been merged into the mainline. The uninvited participation of another developer, however, helped to rework the patch into a less intrusive form which brought minimal changes to code outside the VM core. The end result is an improved kernel which can take full advantage of the hardware on which it runs.

Comments (none posted)

alloc_skb_from_cache()

The post-2.6.10 mainline kernel contains a set of patches designed to help with the merging of the Xen virtual architecture. One of them is an enhancement to the networking API which could have uses beyond Xen.

The "socket buffer" (SKB) is the core kernel data structure used to represent packets as they pass through the system. The SKB API has been described for 2.4 in LDD2; this interface has changed little since then. SKB structures are allocated in various ways by the networking layer; the Xen patches add a new way:

    struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cache,
                                         unsigned int size, int gfp_mask);

This function will allocate an SKB of the given size from the slab cache provided. It assumes that the cache will provide a chunk of memory of sufficient size for the buffer - and various bits of overhead imposed by the SKB structure itself.

The new allocation function might speed things slightly for network drivers which allocate large numbers of buffers of the same size - though the existing allocation interfaces are already pretty fast. Xen has an interesting use for this capability, however: fast networking between virtual machines. By using the slab cache, Xen can ensure that every packet is allocated a one-page buffer. When that packet is sent to another virtual machine, the associated page can be unmapped from the source system and mapped into the address space of the destination. It is, in other words, a fairly straightforward zero-copy networking scheme. As a side benefit, the Xen monitor benefits from the knowledge that the pages in question have been used for network packets - since the contents of the packet could be read by third parties while it is in transit, there is no real point in worrying about zeroing out the data afterward.

Comments (2 posted)

Faster page faulting through prezeroing

In early December, this page covered Christoph Lameter's efforts to speed up the page fault mechanism by reducing lock contention. That work speeds things significantly on multiprocessor systems, but is of little help to uniprocessor users. That is not true of Christoph's other page fault work, which can benefit users on all systems.

Christoph notes that, once the locking issues are taken care of, the most expensive part of the page fault handler is the code which zeroes anonymous pages before handing them to the faulting process. He has concluded that, in some situations, performance can be significantly improved by clearing those pages ahead of time and having them ready when the page fault happens. Just zeroing pages ahead of time is not particularly helpful - it is mostly an exercise in moving work around to different places in the system. But, if (1) the zeroing of pages can be made more efficient, and (2) the workload is of the right type, things can be made quite a bit faster.

What is the right kind of workload? For the purposes of this patch set, the best workload is one which allocates whole pages, but then only touches parts of them. If those pages are already cleared, there is no need to load an entire page into the processor cache when it is faulted in. The improved cache behavior, along with the speedup in fault handling itself, can yield big improvements. Some figures posted by Christoph show an almost 4x improvement in the page fault rate in the right conditions. As it turns out, many applications fit this profile, so "the right conditions" should not be all that rare.

There are four parts to the prezeroing patch set. The first patch extends the page allocation mechanism to make it explicitly handle requests for zeroed memory. There is a new __GFP_ZERO allocation flag which tells alloc_pages() (and thus functions like __get_free_page() and kmalloc()) to return zeroed memory. Many places in the kernel which clear their own pages have been fixed to request zeroed memory instead. With only this patch applied, the kernel's code is cleaned up a bit, but no performance improvements result - the __GFP_ZERO flag just causes a call to clear_page() in the page allocator.

The second patch changes the prototype of the clear_page() function to:

    void clear_page(void *page, int order);

With the new interface, clear_page() can zero higher-order pages. This change is an important part of the patch set: pages are most efficiently zeroed if they can be done in larger groups. Often, the setup cost is a big part of the total; the value of prezeroing pages is much reduced if it can only be done one page at a time.

The kscrubd patch is where things start to get interesting. This patch expands the zone structure so that it can keep track of pages which are known to be clear. Requests for zeroed pages are satisfied from this list when possible. A new kernel thread (actually, a set of per-node threads) wakes up occasionally and clears pages for future allocation. This thread does not normally scrub zero-order (single) pages, but can be configured to do so (via /proc) if desired.

The kscrubd patch also implements a linked list of "zero drivers," being functions which can be called upon to zero pages efficiently. There are no such drivers in this patch, so all pages are zeroed with a call to clear_page(), which, as a comment in the code notes, can be hard on the processor's cache. It would be nicer if pages could be zeroed without the cache impacts. The fourth patch shows how this can be done - at least, on Altix systems. It adds a driver for the Altix block transfer engine which can zero memory directly without the processor's involvement - at least, when relatively large chunks of memory are involved. Drivers for other hardware have not yet been posted, but it would not be surprising to see them begin to appear after the prezeroing code has been merged.

And that could happen soon: Linus, having been convinced by Christoph's results, has requested that this set of patches be merged soon. So prezeroing could even find its way into the kernel prior to the 2.6.11 release. (Update: the __GFP_ZERO patch was merged just as LWN was being published).

Comments (6 posted)

Patches and updates

Kernel trees

  • Domen Puncer: 2.6.10-kj. (December 27, 2004)

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds