Release status
Kernel release status
The current development kernel is 2.5.53, which was
released by Linus on December 23. It
contains a bunch of device mapper fixes, an SCTP update, some memory
management fixes, an ia-64 merge, some USB updates, a new aic7xxx driver,
the new x86 "sysenter" system call mechanism (discussed in
the December 19 LWN Kernel Page), and many
other fixes and updates.
The long-format
changelog has the details.
Linus's pre-2.5.54 BitKeeper repository contains a large number of patches,
most of which are the sorts of fixes that one would expect during a feature
freeze. There is also a new bit of compiler trickery to issue warnings
when deprecated functions are called, a number of kbuild fixes, a new
dev_printk() function for standardized device error reporting, the
removal of the much disliked hugetlb system calls (in favor of hugetlbfs),
a new "kmalloc for each CPU" API, and more loadable module fixes.
The current stable kernel is 2.4.20. Marcelo has not released any
2.4.21 prepatches since December 18.
Comments (none posted)
Kernel development news
Fixing up the shared page table patch
One patch that is still apparently being considered for 2.5 is the shared
page table code. Since this patch makes significant changes to the VM
subsystem, it is worth looking at why it is interesting, and what its
prospects are.
Shared page tables do exactly what one would expect: they allow processes to
share their page tables. The primary application of this technique is at
fork() time; when a process creates a new child, the two processes
share the same low-level page tables. These tables are shared in a "copy
on write" mode; when either process changes memory both the page being
changed and the page table that point to it are copied. The idea is that
if the new process calls exec() before changing much memory, much
of the page table copying overhead can be avoided entirely.
Shared page tables can also save significant amounts of memory when large
processes (or large shared memory segments) are involved, but the
fork() overhead is the real driving force behind this patch. The
2.5 kernel has a significantly slower fork() than 2.4, as a result
of the reverse mapping VM code. Copying page tables requires copying the
reverse map entries, which slows fork() down. Shared page tables,
it is hoped, can eliminate that copy and get fork() back to
something close to its 2.4 performance.
So it was a little disappointing when Andrew Morton ran some benchmarks and discovered that shared
page tables made fork() even slower than it was before. The
optimization, it seems, is really a pessimization - at least when
relatively small processes are involved, which is the case that matters to
most users.
Dave McCracken figured out what is going
on. Most smaller processes, it seems, have three distinct areas of
writable memory, being the data area, the stack, and the C library's data
area. On most systems, a single page table page holds enough page table
entries to map 4MB of actual memory. Unless the process is fairly large,
then, there will be exactly one page table page for each of the three
writable areas, or three in all.
The shared page table patch thus allows the deferral of the copying of
three pages worth of page table entries. As soon as either process changes
the memory mapped by one of those page table pages, that page can no longer
be shared and all page table entries within that page must be copied.
Unfortunately, even a process which does nothing but call exec()
will almost certainly write memory in all three areas, requiring the
unsharing of all three page table pages.
In other words, the shared page table patch is introducing the extra
overhead required to share and unshare page table pages, but, in most
cases, all of those pages will have to be unshared and copied anyway. So
the extra overhead just makes things even slower than they were before.
There are a couple of things that can be done to address this problem.
Dave posted a relatively simple fix: simply
do not share page tables unless the forking process has at least four pages
worth. It turns out that, if even one page table page need not be copied,
the sharing overhead is worthwhile. So, if you turn off sharing in the case
where it doesn't help, you get back to where you were before, and can enjoy
the benefits of page table sharing for very large processes.
A more involved approach would be to spread out a process's writable memory
so that it is mapped by more than one page table page. Writable process
memory comes in numerous distinct chunks; a look at the
/proc/.../maps entry for the emacs process being used to write
this article shows 33 separate, writable virtual memory areas (VMAs). If
each VMA is mapped on its own 4MB boundary, and thus has its own page table
page, then writing in one VMA does not require copying the page table
entries for all the other VMAs.
Andrew Morton gave this approach a try, and
saw a 5-10% speedup. Performance is improved, in other words, but is still
far short of what a 2.4 kernel can do.
The bottom line appears to be this: the shared page table patch, while
providing some benefits, is failing in its goal of mitigating the extra
fork() overhead brought by the reverse mapping VM. Unless
somebody finds a way to address this problem, shared page tables seem
unlikely to find their way into the 2.5 kernel.
Comments (3 posted)
Manipulating multiple address spaces
Back in November, LWN covered
a patch by Jeff Dike which made some User-mode Linux improvements
possible. Jeff needed a mechanism which would allow him to create multiple
address spaces for a single Linux process, manipulate those address spaces,
and switch the process between them. The interface he came up with was:
- Opening /proc/mm would return a file descriptor representing
a newly-created address space.
- Writing to that file descriptor would execute commands on the address
space, as described by the data "written." Mapping of segments,
changing permissions, etc. would be handled via this mechanism; in
this way, UML could set up an address space as needed for one of its
processes.
- An extension to the ptrace() system call allows UML to switch
a child process's address space.
This interface gets the job done, but it's not too surprising that Linus
did not like it. Performing virtual memory management operations via a
magic /proc file is just not the most elegant way of doing
things.
Cleaning up the first step - creating new address spaces - is relatively
easy. It's just a matter of adding a new create_mm() system
call. But then how does one manipulate that new space - mapping in a file,
or changing protections, for example? The system calls which normally
perform these functions (mmap(), mprotect(), ...) are not
set up to have a separate address space passed in as a parameter. One
could create a whole new set of system calls that take that extra
parameter, but that is a task that gets messy in a hurry.
So Linus has come up with another idea. Why
not add one more system call (mm_indirect()), which would invoke
any other system call in the context of a different address space?
mm_indirect() would simply switch the calling process over to the
new address space, invoke the real system call of interest, then switch
back. In this way, all system calls could be made to manipulate a
different address space without the need to modify any of them.
This solution will work for UML, and is thus likely to be implemented. It
may eventually lead to a number of currently unimagined "coprocess'
applications as well. One question remains unanswered, however: is this
sort of change really 2.5 material, or does it get to wait for the next
development series?
(As an aside, we look forward to seeing the results of Jeff's work running UML with the valgrind
memory debugger. Chances are it will turn up a lot of previously unnoticed
memory bugs in the Linux kernel.)
Comments (1 posted)
The end of the hugetlb system calls
The hugetlb (or "large page") patch was covered here
last August. This patch added a
couple of new system calls allowing a suitably privileged process to create
anonymous memory using the large page capability of most modern
processors. Using large pages cuts down on page table overhead, and,
crucially, optimizes the use of the processor's address translation cache.
The result is that applications using large memory arrays (Oracle, in
particular) run faster.
The large page capability is seen as useful by most developers, but there
has been a long series of complaints about the system call interface. The
system calls do pretty much what one would expect: allocate a large page
region, free it, share it with others. But not everybody sees the need for
a new set of system calls for performing what is (mostly) standard memory
operations. Then, there is the issue of permissions. The ability to
allocate huge pages can not be handed out to just anybody, since it is a
good vehicle for the creation of denial of service attacks. That means
that root access is required to make use of the large page capability.
Call them superstitious, but many users are reluctant to run Oracle with
root access.
Meanwhile, William Lee Irwin added hugetlbfs - a RAM-based filesystem which
uses large pages. An application wishing to create a memory region with
large pages can create a file in a hugetlbfs directory, then use
mmap() to map it into its address space. Sharing is nicely
handled by the filesystem itself, and need no longer be done with a
separate system call. And the permissions problem is solved by allowing a
system administrator to set protections on the hugetlbfs filesystem which
fit the site's needs. The filesystem interface provides a more flexible
interface to the large page facility. So, as of 2.5.54, the system call
interface will be removed.
All this could lead one to wonder why the hugetlb patch wasn't done this
way in the first place. The whole point of the kernel peer review process,
after all, is to keep poor interfaces out of the kernel. Linus's answer to this is simple: the patch simply was
not much discussed prior to merging because the companies behind it are
still unused to open code development. In fact, some companies have rules
which forbid the sorts of conversations needed to develop in an open source
environment.
So not only did you have a feature that is mostly useful only to a
smallish group of people - you had that group of people not used to
open communication in the first place, AND you had rules that made
some of the important part of the communication illegal in the
first place.
Still wonder why it wasn't widely discussed during development?
Intel engineers would basically take people aside in private at
conferences talking about what kinds of improvments Oracle was
seeing.
Developing code in the open seems like the only way to work for many
developers. This episode is a good reminder that not everybody, yet, has
really come to understand how the free software development process works.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Janitorial
Kernel building
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>