Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.5-rc2, which was announced by Linus on March 19. This prepatch includes a USB update, some new SELinux features, a reiserfs update, an ALSA update, a set of hotplug CPU patches, and lots of fixes; see the long-format changelog for the details.

Linus's BitKeeper tree contains, as of this writing, some architecture updates, a watchdog driver update, and various fixes.

The current tree from Andrew Morton is 2.6.5-rc2-mm2. Recent additions to the -mm tree include journaling of ext3 quota files, a new fcntl() file_operations method (see below), a new non-executable stack patch, and lots of fixes.

The current 2.4 prepatch is 2.4.26-pre5, which was released on March 20. This one includes some SCSI fixes, a USB update, some ACPI work, and a small set of fixes. This is, says Marcelo, probably the last prepatch before the 2.4.26 release candidates start.

The current neolithic kernel prepatch is 2.2.27-pre1; Marc-Christian Petersen started the 2.2.27 process on March 24. This prepatch contains about a dozen important fixes.

Comments (2 posted)

Reverse mapping anonymous pages - again

Two weeks ago, this page described Andrea Arcangeli's "anon_vma" work in some detail. This work, remember, is an attempt to improve memory scalability in the kernel by eliminating the reverse mapping ("rmap") chains used to find page table entries which reference a given page. The rmap chains can use significant amounts of low memory and can slow down fork() calls, so this work is of interest.

Andrea has continued pushing the anon_vma effort through a series of kernel tree releases. The latest, 2.6.5-rc2-aa2, solves some of the remaining problems and comes with this statement:

The next target is the merging of the prio_tree, but that will be a separated patch. After that this whole thing should be mergeable into mainline.

(The prio_tree reference is about Rajesh Venkatasubramanian's priority tree patch which speeds the search for interesting virtual memory areas when a page is mapped a large number of times).

Andrea's work is proceeding nicely, but it's worth noting that anon_vma is not the only approach to the implementation of an object-based reverse mapping scheme for anonymous memory. There is competition in the form of "anonmm" by Hugh Dickins. Hugh has recently reworked the patch and posted it for comments; interested parties can find this (multi-part) posting in the "patches" section below.

The anon_vma patch works by creating a linked list of virtual memory areas (VMAs) which reference a given page. The anonmm patch, instead, creates a connection between an anonymous page and the mm_struct structures which reference it. The mm_struct is the top-level structure used to manage a process's virtual address space; it contains pointers to all of the process's VMAs and page tables, along with various bits of locking and housekeeping information. If you have a pointer to a process's mm_struct and a virtual address, you can quickly walk the page tables and determine whether the given address is a reference to a specific page.

Most of the object-based reverse mapping has worked with the VMA structure. When performing reverse mapping of file-backed pages, use of the VMA structure is unavoidable; if multiple processes have mapped the file into their address spaces, each process likely has a different virtual address for the same page. The VMA structure contains the necessary information to determine which virtual address each process will have for a specific offset within a file. Once that address is found, the page of interest can be unmapped from that process's address space.

Anonymous pages are different from file-backed pages, however; they are only shared between processes when a process forks (and, even then, it's a copy-on-write sharing). That means that, with one exception that we'll get to, shared anonymous pages have the same virtual address in every process. Thus, if you can track an anonymous page's virtual address and the processes which share that page, you can quickly find all of the page table entries referencing the page.

The anonmm patch takes advantage of this fact. An anonymous page's virtual address is stored in the index field of the page structure. This field is normally used to give a page's offset within the file that backs it, but, since anonymous pages have no backing file, the field is available for this use. Hugh's patch then creates a new anonmm structure which is used to create a linked list of mm_struct structures; a pointer to this list is also stored in the page structure. The resulting data structure looks roughly like this:

With this structure in place, the kernel can follow the pointers to quickly find the page tables referencing a given anonymous page. This approach, in theory, should be a little simpler and faster than the anon_vma technique; a process may have several VMAs for anonymous memory areas, but it will never have more than one mm_struct.

There is one little problem with this whole scheme. It depends on the fact that every process has the same virtual address for a given, shared anonymous page. What happens when some wiseass process comes along and moves a chunk of anonymous memory with mremap()? At that point, the memory has a new address, and the anonmm algorithm will be unable to find it. Hugh's solution for this problem is to simply copy the pages being remapped. They are copy-on-write pages, so making copies will not create any correctness issues. The copying could be expensive - it may involve swapping in a number of pages so that they can be copied - but remapping of anonymous memory should be a sufficiently rare operation that a performance hit should not be a problem.

Which scheme is truly faster? Martin Bligh has posted a set of benchmarks showing that, while both reverse mapping approaches are significantly faster than the mainline kernel, neither is obviously faster than the other. Andrea's work is marginally ahead in more tests than Hugh's, but, overall, the two produce roughly equivalent results. So, if one of these implementations does find its way into the 2.6 kernel, it will have to be chosen for reasons other than performance. Either that, or it will be some combination of the two; Andrea and Hugh are actively discussing ideas, so that sort of combination could happen.

Comments (1 posted)

A new file_operations method

The file_operations structure contains pointers to functions which implement I/O operations on files and char devices. These operations include the usual suspects, such as "open", "read", "write", "llseek", etc., along with some more esoteric ones ("sendfile", "get_unmapped_area"). The file_operations structure tends not to change very often; changes here can force updating a great many filesystems and drivers.

The NFS maintainers recently ran into a problem: it is not possible to simultaneously implement the O_DIRECT and O_APPEND modes over NFS. Rather than silently fail to implement a request to do so, the NFS developers have submitted a patch which adds an fcntl() method to the file_operations structure. Its prototype is:

    long (*fcntl)(unsigned int fd, unsigned int cmd, 
                  unsigned long arg, struct file *filp);

The fd, cmd, and arg parameters come straight from user space. A file descriptor is an unusual argument for a file_operations method, but the generic fcntl() code needs it. filp is, as usual, a pointer to the file structure for the open file.

If a module does not provide a fcntl() method, the call is handled in the usual way. Otherwise, the new fcntl() function should provide a complete implementation of that system call. Typically, the method will perform whatever device- or filesystem-specific work is needed (NFS simply checks for the O_DIRECT|O_APPEND combination and returns a failure code if it's there), then pass all four arguments to generic_file_fcnt(), which is exported to modules.

This patch is currently in the -mm tree; it will likely find its way into the mainline sometime after 2.6.5 comes out.

Comments (3 posted)

The return of write barriers

One of the tasks on the 2.5 "to do" list was the implementation of proper write barriers in the block I/O subsystem. Any code which attempts to implement true transactional behavior on disk-based files needs this capability. Without it, systems like journaling filesystems and database managers lack the control they need over the order in which data is written to disk. Mis-ordered writes can lead to data corruption and other unfortunate things.

The 2.6 block I/O subsystem was designed with barrier support as a core feature. But, at this point, most low-level block drivers do not actually implement barriers, and the filesystems do not use them. Patches to fill in some of the gaps have been around for a while (LWN looked at barriers last October), but have not yet been merged.

There has been a new surge of interest in proper barrier support, perhaps as a result of applications vendors starting to take a hard look at the 2.6 kernel. Now Jens Axboe and Chris Mason have put together a new barrier support patch which gets Linux closer to being able to provide real disk I/O guarantees. With this patch, write barriers work, but only on IDE drives (not SCSI or serial ATA), and only with the reiserfs and ext3 filesystems. Even then, things are qualified: "ext3 works but only if things don't go wrong."

In other words, barrier support will be staying on the "to do" list for a little while longer yet. But the work is being done, and 2.6 should be able to implement real barriers before it is all over.

Comments (5 posted)

Linus Torvalds Linux 2.6.5-rc2 ?

Andrew Morton 2.6.5-rc2-mm1 ?

Andrew Morton 2.6.5-rc2-mm2 ?

Andrea Arcangeli 2.6.5-rc2-aa1 ?

Andrea Arcangeli 2.6.5-rc2-aa2 ?

Andrew Morton 2.6.5-rc1-mm2 ?

Andrea Arcangeli 2.6.5-rc1-aa1 "This implements anon_vma for the anonymous memory unmapping and objrmap for the file mappings, effectively removing rmap completely and replacing it with more efficient algorithms..." ?

Andrea Arcangeli 2.6.5-rc1-aa2 ?

Andrea Arcangeli 2.6.5-rc1-aa3 ?

Randy.Dunlap 2.6.5-rc1-kj1 patchset ?

Marcelo Tosatti Linux 2.4.26-pre5 ?

Rusty Russell Hotplug CPU toy for i386 ?

John Lee O(1) Entitlement Based Scheduler v1.1 ?

Dipankar Sarma RCU for low latency (experimental) ?

Stephen Rothwell PPC64 iSeries virtual tape driver ?

Jaroslav Kysela ALSA 2.6 update ?

Jeff Garzik latest libata (includes Silicon Image work) ?

Jeff Garzik libata update ?

Hanna Linder add class support to floppy tape driver zftape-init.c ?

Wim Van Sebroeck v2.6.5-rc2 watchdog patches ?

Bagalkote, Sreenivas megaraid 2.10.2 Driver ?

Len Brown ACPI for 2.6 ?

Moore, Eric Dean MPT Fusion driver 3.01.03 update ?

uaca@alumni.uv.es RFC/Doc/BUGs: CONFIG_PACKET_MMAP ?

Jens Axboe barrier patch set ?

Jorn Engel cowlinks v2 ?

=?ISO-8859-1?Q?R=FCdiger_Klaehn?= File change notification (enhanced dnotify) ?

Nigel Kukard DVD+-RW for 2.6.4 ?

Jan Kara Journalled quota patch ?

Hugh Dickins anobjrmap 1/6 objrmap ?

Hugh Dickins anobjrmap 2/6 linux/rmap.h ?

Hugh Dickins anobjrmap 3/6 page->mapping ?

Hugh Dickins anobjrmap 4/6 no pte_chains ?

Hugh Dickins anobjrmap 5/6 anonmm ?

Hugh Dickins anobjrmap 6/6 cleanup ?

Hugh Dickins anobjrmap 7/6 mremap moves ?

Rajesh Venkatasubramanian [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix ?

Kurt Garloff Non-Exec stack patches ?

Serge Hallyn New BSD Jail patch ?

Kernel development

Brief items

Kernel release status

Kernel development news

Reverse mapping anonymous pages - again

A new file_operations method

The return of write barriers

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

Memory management

Security-related