Brief items
The current 2.6 prepatch is 2.6.5-rc2, which was
announced by Linus on March 19.
This prepatch includes a USB update, some new SELinux features,
a reiserfs update, an ALSA update, a set of hotplug CPU patches, and lots
of fixes; see
the long-format changelog for
the details.
Linus's BitKeeper tree contains, as of this writing, some architecture
updates, a watchdog driver update, and various fixes.
The current tree from Andrew Morton is 2.6.5-rc2-mm2. Recent additions to the -mm
tree include journaling of ext3 quota files, a new fcntl()
file_operations method (see below), a new non-executable stack
patch, and lots
of fixes.
The current 2.4 prepatch is 2.4.26-pre5, which was released on March 20. This one includes
some SCSI fixes, a USB update, some ACPI work, and a small set of
fixes. This is, says Marcelo, probably the last prepatch before the 2.4.26
release candidates start.
The current neolithic kernel prepatch is 2.2.27-pre1; Marc-Christian
Petersen started the 2.2.27 process on
March 24. This prepatch contains about a dozen important fixes.
Comments (2 posted)
Kernel development news
Two weeks ago, this page
described Andrea
Arcangeli's "anon_vma" work in some detail. This work, remember, is an
attempt to improve memory scalability in the kernel by eliminating the
reverse mapping ("rmap") chains used to find page table entries which
reference a given page. The rmap chains can use significant amounts of low
memory and can slow down
fork() calls, so this work is of
interest.
Andrea has continued pushing the anon_vma effort through a series of kernel
tree releases. The latest, 2.6.5-rc2-aa2,
solves some of the remaining problems and comes with this statement:
The next target is the merging of the prio_tree, but that will be a
separated patch. After that this whole thing should be mergeable
into mainline.
(The prio_tree reference is about Rajesh Venkatasubramanian's priority tree patch which speeds the search for
interesting virtual memory areas when a page is mapped a large number of
times).
Andrea's work is proceeding nicely, but it's worth
noting that anon_vma is not the only approach to the implementation of an
object-based reverse mapping scheme for anonymous memory. There is
competition in the form of "anonmm" by Hugh Dickins. Hugh has recently
reworked the patch and posted it for comments; interested parties can find
this (multi-part) posting in the "patches" section below.
The anon_vma patch works by creating a linked list of virtual memory areas
(VMAs) which reference a given page. The anonmm patch, instead, creates a
connection between an anonymous page and the mm_struct structures
which reference it. The mm_struct is the top-level structure used
to manage a process's virtual address space; it contains pointers to all of
the process's VMAs and page tables, along with various bits of locking and
housekeeping information. If you have a pointer to a process's
mm_struct and a virtual address, you can quickly walk the page
tables and determine whether the given address is a reference to a specific
page.
Most of the object-based reverse mapping has worked with the VMA
structure. When performing reverse mapping of file-backed pages, use of
the VMA structure is unavoidable; if multiple processes have mapped the
file into their address spaces, each process likely has a different virtual
address for the same page. The VMA structure contains the necessary
information to determine which virtual address each process will have for a
specific offset within a file. Once that address is found, the page of
interest can be unmapped from that process's address space.
Anonymous pages are different from file-backed pages, however; they are
only shared between processes when a process forks (and, even then, it's a
copy-on-write sharing). That means that, with one exception that we'll get
to, shared anonymous pages have the same virtual address in every process.
Thus, if you can track an anonymous page's virtual address and the
processes which share that page, you can quickly find all of the page table
entries referencing the page.
The anonmm patch takes advantage of this fact. An anonymous page's virtual
address is stored in the index field of the page
structure. This field is normally used to give a page's offset within the
file that backs it, but, since anonymous pages have no backing file, the
field is available for this use. Hugh's patch then creates a new
anonmm structure which is used to create a linked list of
mm_struct structures; a pointer to this list is
also stored in the page structure. The resulting data structure
looks roughly like this:
With this structure in place, the kernel can follow the pointers to quickly
find the page tables referencing a given anonymous page. This approach, in
theory, should be a little simpler and faster than the anon_vma technique;
a process may have several VMAs for anonymous memory areas, but it will
never have more than one mm_struct.
There is one little problem with this whole scheme. It depends on the fact
that every process has the same virtual address for a given, shared
anonymous page. What happens when some wiseass process comes along and
moves a chunk of anonymous memory with mremap()? At that point,
the memory has a new address, and the anonmm algorithm will be unable to
find it. Hugh's solution for this problem is to simply copy the pages
being remapped. They are copy-on-write pages, so making copies will not
create any correctness issues. The copying could be expensive - it may
involve swapping in a number of pages so that they can be copied - but
remapping of anonymous memory should be a sufficiently rare operation that
a performance hit should not be a problem.
Which scheme is truly faster? Martin Bligh has posted a set of benchmarks showing that, while both
reverse mapping approaches are significantly faster than the mainline
kernel, neither is obviously faster than the other. Andrea's work is
marginally ahead in more tests than Hugh's, but, overall, the two produce
roughly equivalent results. So, if one of these implementations does find
its way into the 2.6 kernel, it will have to be chosen for reasons other
than performance. Either that, or it will be some combination of the two;
Andrea and Hugh are actively discussing ideas, so that sort of combination
could happen.
Comments (none posted)
The
file_operations structure contains pointers to functions which
implement I/O operations on files and char devices. These operations
include the usual suspects, such as "open", "read", "write", "llseek",
etc., along with some more esoteric ones ("sendfile",
"get_unmapped_area"). The
file_operations structure tends not to
change very often; changes here can force updating a great many filesystems
and drivers.
The NFS maintainers recently ran into a problem: it is not possible to
simultaneously implement the O_DIRECT and O_APPEND modes
over NFS. Rather than silently fail to implement a request to do so, the
NFS developers have submitted a patch which adds an fcntl() method
to the file_operations structure. Its prototype is:
long (*fcntl)(unsigned int fd, unsigned int cmd,
unsigned long arg, struct file *filp);
The fd, cmd, and arg parameters come straight
from user space. A file descriptor is an unusual argument for a
file_operations method, but the generic fcntl() code
needs it. filp is, as usual, a pointer to the file
structure for the open file.
If a module does not provide a fcntl() method, the call is handled
in the usual way. Otherwise, the new fcntl() function should
provide a complete implementation of that system call. Typically, the
method will perform whatever device- or filesystem-specific work is needed
(NFS simply checks for the O_DIRECT|O_APPEND combination and
returns a failure code if it's there),
then pass all four arguments to generic_file_fcnt(), which is
exported to modules.
This patch is currently in the -mm tree; it will likely find its way into
the mainline sometime after 2.6.5 comes out.
Comments (3 posted)
One of the tasks on the 2.5 "to do" list was the implementation of proper
write barriers in the block I/O subsystem. Any code which attempts to
implement true transactional behavior on disk-based files needs this
capability. Without it, systems like journaling filesystems and database
managers lack the control they need over the order in which data is written
to disk. Mis-ordered writes can lead to data corruption and other
unfortunate things.
The 2.6 block I/O subsystem was designed with barrier support as a core
feature. But, at this point, most low-level block drivers do not actually
implement barriers, and the filesystems do not use them. Patches to fill
in some of the gaps have been around for a while (LWN looked at barriers last October), but have not yet
been merged.
There has been a new surge of interest in proper barrier support, perhaps
as a result of applications vendors starting to take a hard look at the 2.6
kernel. Now Jens Axboe and Chris Mason have put together a new barrier support patch which
gets Linux closer to being able to provide real disk I/O guarantees. With
this patch, write barriers work, but only on IDE drives (not SCSI or serial
ATA), and only with the reiserfs and ext3 filesystems. Even then, things
are qualified: "ext3 works but only if things don't go wrong."
In other words, barrier support will be staying on the "to do" list for a
little while longer yet. But the work is being done, and 2.6 should be
able to implement real barriers before it is all over.
Comments (5 posted)
Patches and updates
Kernel trees
- Andrea Arcangeli: 2.6.5-rc1-aa1. "This implements anon_vma for the anonymous memory unmapping and objrmap
for the file mappings, effectively removing rmap completely and
replacing it with more efficient algorithms..."
(March 18, 2004)
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>