Brief items
The current 2.6 kernel is 2.6.10, which was
released by Linus on
December 24. There are CIFS and DVB updates
since -rc3, along with the usual pile of fixes. For those of you just
tuning in, 2.6.10 as a whole includes a new kernel events notification
mechanism, switchable I/O schedulers (and a new CFQ scheduler as well),
in-kernel cryptographic key management, DVD+RW and CDRW packet writing
support, ext3 block reservation and online resizing support, big updates
for many kernel subsystems, and a handful of security fixes. The
long-format
changelog (1.5MB) has all the details.
Linus's BitKeeper repository, as of this writing, contains the four-level
page table patch (see below), a VIA PadLock crypto engine driver, a
new SKB allocation function (see below), ACPI hotplug support, the full
InfiniBand patch set (covered here last
November), a big direct rendering manager (DRM) rework, a new and
simplified file readahead mechanism, a set of user-mode Linux patches, a
big set of input patches, a new set of "sparse" annotations, an NFS update,
an iptables update, support for the Fujitsu FR-V architecture, in-inode
extended attribute support for ext3, some SELinux scalability improvements,
and lots of fixes.
The current prepatch from Andrew Morton is 2.6.10-mm1. Recent additions to -mm include
some software suspend improvements, a PCMCIA update, a number of
NUMA-related cleanups, and a reiser4 update.
The current 2.4 prepatch remains 2.4.29-pre3, dating back to
December 22.
Comments (1 posted)
Kernel development news
After 2.6.9-ac its clear that the long 2.6.9 process worked very
badly. While 2.6.10 is looking much better its long period meant
the allegedly "official" base kernel was a complete pile of
insecure donkey turd for months. That doesn't hurt most vendor
users but it does hurt those trying to do stuff on the base kernels
very badly.
-- Alan Cox
Not all 2.6.x kernels will be good; but if we do releases every 1
or 2 weeks, some of them *will* be good. The problem with the -rc
releases is that we try to predict in advance which releases in
advance will be stable, and we don't seem to be able to do a good
job of that. If we do a release every week, my guess is that at
least 1 in 3 releases will turn out to be stable enough for most
purposes. But we won't know until after 2 or 3 days which releases
will be the good ones.
-- Ted Ts'o
Comments (3 posted)
As expected, one of the first things to be merged into Linus's BitKeeper
repository after the 2.6.10 release was the four-level page table patch.
Two weeks ago, we
noted that
Nick Piggin had posted an alternative patch which changed the organization
initially created by Andi Kleen. It was not clear, then, which version of
the patch would go in. In the end, Nick's changes to the four-level patch
were accepted.
Thus, in 2.6.11, the page table structure will include a new level, called
"PUD," placed immediately below the top-level PGD directory. The new page
table structure looks like this:
The PGD remains the top-level directory, accessed via the
mm_struct structure associated with each process. The PUD only
exists on architectures which are using four-level tables; that is only
x86-64, as of this writing, but other 64-bit architectures will probably
use the fourth level in the future as well. The PMD and PTE function as
they did in previous kernels; the PMD is absent if the architecture only
supports two-level tables.
| Architecture | Bits used |
| PGD | PUD | PMD | PTE |
| i386 | 22-31 | |
| 12-21 |
| i386 (PAE mode) | 30-31 | |
21-29 | 12-20 |
| x86-64 | 39-46 |
30-38 |
21-29 |
12-20 |
|
Each level in the page table hierarchy is indexed with a subset of the bits
in the virtual address of interest. Those bits are shown in the table to
the right (for a few architectures). In the classic i386 architecture,
only the PGD and PTE levels are actually used; the combined twenty bits
allow up to 1 million pages (4GB) to be addressed. The i386 PAE mode
adds the PMD level, but does not increase the virtual address space (it
does expand the amount of physical memory which may be addressed, however).
On the x86-64 architecture, four levels are used with a total of 35 bits
for the page frame number. Before the patch was merged, the x86-64
architecture could not effectively use the fourth level and was limited to
a 512GB virtual address space. Now x86-64 users can have a virtual address
space covering 128TB of memory, which really should last them for a little
while.
Those who are curious about how x86-64 uses its expanded address space may
want to take a look at this explanation
from Andi Kleen.
The merging of this patch demonstrates a few things about the current
kernel development model. Prior to 2.6, such a fundamental change could
never be applied during a "stable" kernel series; anybody needing the
four-level feature would have had to wait a couple more years for 2.8. The
new way of kernel development, for better or for worse, does bring new
features to users far more quickly than the old method did - and without
the need for distributor backports. This patch is also a clear product of
the peer review process. Andi's initial version worked fine, and could
certainly have been merged into the mainline. The uninvited participation
of another developer, however, helped to rework the patch into a less
intrusive form which brought minimal changes to code outside the VM core.
The end result is an improved kernel which can take full advantage of the
hardware on which it runs.
Comments (none posted)
The post-2.6.10 mainline kernel contains a set of patches designed to help
with the merging of the Xen virtual architecture. One of them is an
enhancement to the networking API which could have uses beyond Xen.
The "socket buffer" (SKB) is the core kernel data structure used to
represent packets as they pass through the system. The SKB API has been
described for 2.4 in LDD2; this interface has
changed little since then. SKB structures are allocated in various ways by
the networking layer; the Xen patches add a new way:
struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cache,
unsigned int size, int gfp_mask);
This function will allocate an SKB of the given size from the slab
cache provided. It assumes that the cache will provide a chunk of
memory of sufficient size for the buffer - and various bits of overhead
imposed by the SKB structure itself.
The new allocation function might speed things slightly for network drivers
which allocate large numbers of buffers of the same size - though the
existing allocation interfaces are already pretty fast. Xen has an
interesting use for this capability, however: fast networking between
virtual machines. By using the slab cache, Xen can ensure that every
packet is allocated a one-page buffer. When that packet is sent to another
virtual machine, the associated page can be unmapped from the source system
and mapped into the address space of the destination. It is, in other
words, a fairly straightforward zero-copy networking scheme. As a side
benefit, the Xen monitor benefits from the knowledge that the pages in
question have been used for network packets - since the contents of the
packet could be read by third parties while it is in transit, there is no
real point in worrying about zeroing out the data afterward.
Comments (2 posted)
In early December, this page
covered Christoph Lameter's
efforts to speed up the page fault mechanism by reducing lock contention.
That work speeds things significantly on multiprocessor systems, but is of
little help to uniprocessor users. That is not true of Christoph's other
page fault work, which can benefit users on all systems.
Christoph notes that, once
the locking issues are taken care of, the most expensive part of the page
fault handler is the code which zeroes anonymous pages before handing them
to the faulting process. He has concluded that, in some situations,
performance can be significantly improved by clearing those pages ahead of
time and having them ready when the page fault happens. Just zeroing pages
ahead of time is not particularly helpful - it is mostly an exercise in
moving work around to different places in the system. But, if (1) the
zeroing of pages can be made more efficient, and (2) the workload is
of the right type, things can be made quite a bit faster.
What is the right kind of workload? For the purposes of this patch set,
the best workload is one which allocates whole pages, but then only touches
parts of them. If those pages are already cleared, there is no need to
load an entire page into the processor cache when it is faulted in. The
improved cache behavior, along with the speedup in fault handling itself,
can yield big improvements. Some figures posted by Christoph show an
almost 4x improvement in the page fault rate in the right conditions. As
it turns out, many applications fit this profile, so "the right conditions"
should not be all that rare.
There are four parts to the prezeroing patch set. The first patch extends the page
allocation mechanism to make it explicitly handle requests for zeroed
memory. There is a new __GFP_ZERO allocation flag which tells
alloc_pages() (and thus functions like __get_free_page() and
kmalloc()) to return zeroed memory. Many places in the
kernel which clear their own pages have been fixed to request zeroed memory
instead. With only this patch applied, the kernel's code is cleaned up a
bit, but no performance improvements result - the __GFP_ZERO flag
just causes a call to clear_page() in the page allocator.
The second patch changes the
prototype of the clear_page() function to:
void clear_page(void *page, int order);
With the new interface, clear_page() can zero higher-order pages.
This change is an important part of the patch set: pages are most
efficiently zeroed if they can be done in larger groups. Often, the setup
cost is a big part of the total; the value of prezeroing pages is much
reduced if it can only be done one page at a time.
The kscrubd patch is where
things start to get interesting. This patch expands the zone
structure so that it can keep track of pages which are known to be clear.
Requests for zeroed pages are satisfied from this list when possible. A
new kernel thread (actually, a set of per-node threads) wakes up
occasionally and clears pages for future allocation. This thread does not
normally scrub zero-order (single) pages, but can be configured to do so
(via /proc) if desired.
The kscrubd patch also implements a linked list of "zero drivers," being
functions which can be called upon to zero pages efficiently. There are no
such drivers in this patch, so all pages are zeroed with a call to
clear_page(), which, as a comment in the code notes, can be hard
on the processor's cache. It would be nicer if pages could be zeroed
without the cache impacts. The
fourth patch shows how this can be done - at least, on Altix systems.
It adds a driver for the Altix block transfer engine which can zero memory
directly without the processor's involvement - at least, when relatively
large chunks of memory are involved. Drivers for other hardware have not
yet been posted, but it would not be surprising to see them begin to appear
after the prezeroing code has been merged.
And that could happen soon:
Linus, having been convinced by Christoph's results, has requested that this set of patches be merged
soon. So prezeroing could even find its way into the kernel prior to the
2.6.11 release. (Update: the __GFP_ZERO patch was merged
just as LWN was being published).
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>