LWN.net Logo

Kernel development

Release status

Kernel release status

The current development kernel is 2.5.53, which was released by Linus on December 23. It contains a bunch of device mapper fixes, an SCTP update, some memory management fixes, an ia-64 merge, some USB updates, a new aic7xxx driver, the new x86 "sysenter" system call mechanism (discussed in the December 19 LWN Kernel Page), and many other fixes and updates. The long-format changelog has the details.

Linus's pre-2.5.54 BitKeeper repository contains a large number of patches, most of which are the sorts of fixes that one would expect during a feature freeze. There is also a new bit of compiler trickery to issue warnings when deprecated functions are called, a number of kbuild fixes, a new dev_printk() function for standardized device error reporting, the removal of the much disliked hugetlb system calls (in favor of hugetlbfs), a new "kmalloc for each CPU" API, and more loadable module fixes.

The current stable kernel is 2.4.20. Marcelo has not released any 2.4.21 prepatches since December 18.

Comments (none posted)

Kernel development news

Fixing up the shared page table patch

One patch that is still apparently being considered for 2.5 is the shared page table code. Since this patch makes significant changes to the VM subsystem, it is worth looking at why it is interesting, and what its prospects are.

Shared page tables do exactly what one would expect: they allow processes to share their page tables. The primary application of this technique is at fork() time; when a process creates a new child, the two processes share the same low-level page tables. These tables are shared in a "copy on write" mode; when either process changes memory both the page being changed and the page table that point to it are copied. The idea is that if the new process calls exec() before changing much memory, much of the page table copying overhead can be avoided entirely.

Shared page tables can also save significant amounts of memory when large processes (or large shared memory segments) are involved, but the fork() overhead is the real driving force behind this patch. The 2.5 kernel has a significantly slower fork() than 2.4, as a result of the reverse mapping VM code. Copying page tables requires copying the reverse map entries, which slows fork() down. Shared page tables, it is hoped, can eliminate that copy and get fork() back to something close to its 2.4 performance.

So it was a little disappointing when Andrew Morton ran some benchmarks and discovered that shared page tables made fork() even slower than it was before. The optimization, it seems, is really a pessimization - at least when relatively small processes are involved, which is the case that matters to most users.

Dave McCracken figured out what is going on. Most smaller processes, it seems, have three distinct areas of writable memory, being the data area, the stack, and the C library's data area. On most systems, a single page table page holds enough page table entries to map 4MB of actual memory. Unless the process is fairly large, then, there will be exactly one page table page for each of the three writable areas, or three in all.

The shared page table patch thus allows the deferral of the copying of three pages worth of page table entries. As soon as either process changes the memory mapped by one of those page table pages, that page can no longer be shared and all page table entries within that page must be copied. Unfortunately, even a process which does nothing but call exec() will almost certainly write memory in all three areas, requiring the unsharing of all three page table pages.

In other words, the shared page table patch is introducing the extra overhead required to share and unshare page table pages, but, in most cases, all of those pages will have to be unshared and copied anyway. So the extra overhead just makes things even slower than they were before.

There are a couple of things that can be done to address this problem. Dave posted a relatively simple fix: simply do not share page tables unless the forking process has at least four pages worth. It turns out that, if even one page table page need not be copied, the sharing overhead is worthwhile. So, if you turn off sharing in the case where it doesn't help, you get back to where you were before, and can enjoy the benefits of page table sharing for very large processes.

A more involved approach would be to spread out a process's writable memory so that it is mapped by more than one page table page. Writable process memory comes in numerous distinct chunks; a look at the /proc/.../maps entry for the emacs process being used to write this article shows 33 separate, writable virtual memory areas (VMAs). If each VMA is mapped on its own 4MB boundary, and thus has its own page table page, then writing in one VMA does not require copying the page table entries for all the other VMAs.

Andrew Morton gave this approach a try, and saw a 5-10% speedup. Performance is improved, in other words, but is still far short of what a 2.4 kernel can do.

The bottom line appears to be this: the shared page table patch, while providing some benefits, is failing in its goal of mitigating the extra fork() overhead brought by the reverse mapping VM. Unless somebody finds a way to address this problem, shared page tables seem unlikely to find their way into the 2.5 kernel.

Comments (3 posted)

Manipulating multiple address spaces

Back in November, LWN covered a patch by Jeff Dike which made some User-mode Linux improvements possible. Jeff needed a mechanism which would allow him to create multiple address spaces for a single Linux process, manipulate those address spaces, and switch the process between them. The interface he came up with was:

  • Opening /proc/mm would return a file descriptor representing a newly-created address space.

  • Writing to that file descriptor would execute commands on the address space, as described by the data "written." Mapping of segments, changing permissions, etc. would be handled via this mechanism; in this way, UML could set up an address space as needed for one of its processes.

  • An extension to the ptrace() system call allows UML to switch a child process's address space.

This interface gets the job done, but it's not too surprising that Linus did not like it. Performing virtual memory management operations via a magic /proc file is just not the most elegant way of doing things.

Cleaning up the first step - creating new address spaces - is relatively easy. It's just a matter of adding a new create_mm() system call. But then how does one manipulate that new space - mapping in a file, or changing protections, for example? The system calls which normally perform these functions (mmap(), mprotect(), ...) are not set up to have a separate address space passed in as a parameter. One could create a whole new set of system calls that take that extra parameter, but that is a task that gets messy in a hurry.

So Linus has come up with another idea. Why not add one more system call (mm_indirect()), which would invoke any other system call in the context of a different address space? mm_indirect() would simply switch the calling process over to the new address space, invoke the real system call of interest, then switch back. In this way, all system calls could be made to manipulate a different address space without the need to modify any of them.

This solution will work for UML, and is thus likely to be implemented. It may eventually lead to a number of currently unimagined "coprocess' applications as well. One question remains unanswered, however: is this sort of change really 2.5 material, or does it get to wait for the next development series?

(As an aside, we look forward to seeing the results of Jeff's work running UML with the valgrind memory debugger. Chances are it will turn up a lot of previously unnoticed memory bugs in the Linux kernel.)

Comments (1 posted)

The end of the hugetlb system calls

The hugetlb (or "large page") patch was covered here last August. This patch added a couple of new system calls allowing a suitably privileged process to create anonymous memory using the large page capability of most modern processors. Using large pages cuts down on page table overhead, and, crucially, optimizes the use of the processor's address translation cache. The result is that applications using large memory arrays (Oracle, in particular) run faster.

The large page capability is seen as useful by most developers, but there has been a long series of complaints about the system call interface. The system calls do pretty much what one would expect: allocate a large page region, free it, share it with others. But not everybody sees the need for a new set of system calls for performing what is (mostly) standard memory operations. Then, there is the issue of permissions. The ability to allocate huge pages can not be handed out to just anybody, since it is a good vehicle for the creation of denial of service attacks. That means that root access is required to make use of the large page capability. Call them superstitious, but many users are reluctant to run Oracle with root access.

Meanwhile, William Lee Irwin added hugetlbfs - a RAM-based filesystem which uses large pages. An application wishing to create a memory region with large pages can create a file in a hugetlbfs directory, then use mmap() to map it into its address space. Sharing is nicely handled by the filesystem itself, and need no longer be done with a separate system call. And the permissions problem is solved by allowing a system administrator to set protections on the hugetlbfs filesystem which fit the site's needs. The filesystem interface provides a more flexible interface to the large page facility. So, as of 2.5.54, the system call interface will be removed.

All this could lead one to wonder why the hugetlb patch wasn't done this way in the first place. The whole point of the kernel peer review process, after all, is to keep poor interfaces out of the kernel. Linus's answer to this is simple: the patch simply was not much discussed prior to merging because the companies behind it are still unused to open code development. In fact, some companies have rules which forbid the sorts of conversations needed to develop in an open source environment.

So not only did you have a feature that is mostly useful only to a smallish group of people - you had that group of people not used to open communication in the first place, AND you had rules that made some of the important part of the communication illegal in the first place.

Still wonder why it wasn't widely discussed during development? Intel engineers would basically take people aside in private at conferences talking about what kinds of improvments Oracle was seeing.

Developing code in the open seems like the only way to work for many developers. This episode is a good reminder that not everybody, yet, has really come to understand how the free software development process works.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Documentation

Janitorial

Kernel building

Memory management

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds