|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The 2.6.14 kernel still is not out yet as of this writing, though chances are good that it may have happened on the usual "right after LWN publishes" schedule. Linus did release 2.6.14-rc5 on October 19; it contained fixes for the show-stopper problems discussed here last week and a number of other fixes as well.

The current -mm tree is 2.6.14-rc5-mm1. Recent changes to -mm include some USB power management improvements, a tracing mechanism for the block layer, some page table scalability work (see below), demand paging for hugetlb pages, the ktimer patch, and a read-copy-update torture testing module.

Comments (1 posted)

Kernel development news

Quote of the week

Oh, and at least one major distro has been served with legal papers due to them shipping closed source kernel drivers, and more are on the way. That's the direction some developers are taking. Others, myself included, [are] taking the technical way and just making it so damn hard to write and ship a closed kernel module, that they will just give up eventually. Combine that with the EXPORT_SYMBOL_GPL() stuff in the kernel, and I give it about 1-2 more years before it's just technically impossible to write such a module.

-- Greg Kroah-Hartman

Comments (26 posted)

Page migration

NUMA systems have, by design, memory which is local to specific nodes (groups of processors). While all memory is accessible, local memory is faster to work with than remote memory. The kernel takes NUMA behavior into account by attempting to allocate local memory for processes, and by avoiding moving processes between nodes whenever possible. Sometimes processes must be moved, however, with the result that the local-allocation optimization can quickly become a pessimization instead. What would be nice, in such situations, would be the ability to move a process's memory when the process itself is shifted to a new node.

Memory migration patches have been circulating for some time now. The latest version is this patch set posted by Christoph Lameter. This patch deliberately does not solve the entire problem, but it does try to establish enough infrastructure that a full migration solution can be evolved eventually.

This patch does not automatically migrate memory for processes which have been moved; instead, it leaves the migration decision to user space. There is a new system call:

    long migrate_pages(pid_t pid, unsigned long maxnode,
                       unsigned long *old_nodes,
                       unsigned long *new_nodes);

This call will attempt to move any pages belonging to the given process from old_nodes to new_nodes. There is also a new MPOL_MF_MOVE option to the set_mempolicy() system call which can be used to the same effect. Either way, user space can request that a given process vacate a set of nodes. This operation can be performed in response to an explicit move of the process itself (which might be done by a system scheduling daemon, for example), or in response to other events, such as the impending shutdown and removal of a node.

The implementation is simple for now: the code iterates over the process's memory and attempts to force each page needing migration to be swapped. When the process faults the page back in, it should then be allocated on the process's current node. The force-out process actually takes a few passes over the list; initially it passes over locked pages and just concerns itself with pages which are easy to evict. In later passes, it will wait for locked pages and do the hard work of getting the final pages out of memory.

Migrating pages by way of the swap device is not the most efficient way of moving them across a NUMA system. Later work on the patch will be aimed at adding direct node-to-node migration, and other features as well. In the mean time, however, the developers would like to see the current implementation merged in time for 2.6.15. Andrew Morton has expressed some reservations, however: he would like to see an explanation of how this code can be made to work with near complete reliability. There are a number of things which can prevent the migration of pages; these include pages locked in place by user space, page undergoing direct I/O, and more. Christoph responded that the patch will get there, eventually. Whether this claim is sufficiently convincing to get the migration patches into 2.6.15 remains to be seen.

Comments (3 posted)

Another approach to page table scalability

Scalability - making Linux perform on ever-larger systems - is a constant theme in kernel development. Some may feel that this work only benefits the very small percentage of users who have big-iron systems, but the fact remains that today's big iron is tomorrow's laptop. Remember that supporting 1GB of memory (and beyond) was once a big-iron issue.

One scalability issue which has been receiving attention for a while is the single page table lock used to protect all operations on an address space's tables. Christoph Lameter's page fault scalability patches were covered here last year; that patch minimized the use of this lock, and introduced a number of atomic page table operations which could eliminate locking altogether in some situations. Those patches have never made it into the mainline, due to concerns over architecture support and general usefulness. The issue has not gone away, however.

Hugh Dickins, who has been thrashing up the -mm tree with memory management patches for the last few weeks, has now posted a new approach to paging scalability. Rather than play tricks to minimize page table lock hold times, Hugh has taken the classic approach of going to finer-grained locking. So, with his patch, the address space page table lock no longer controls access to individual pages within the tables. Instead, each page gets its own lock.

Pushing the lock down to individual page-table pages will eliminate much of the contention for the lock on large, multi-processor systems. It should work especially well for multi-threaded processes (which share the same address space) on those systems. Splitting the lock also enables the kernel to work at reclaiming pages in one part of an address space while pages are being faulted into another part. So, in some situations, this split should be a big performance win.

There is, however, the little problem of where to store the lock. Putting it into the page tables themselves is not an option; the format of page tables tends to be driven by the underlying hardware architecture, and CPU designers do not usually make provisions for in-table locks. One could create an array of locks elsewhere in the system, but a large system can contain a great many page table pages. The space overhead of a large lock array could thus get painful. Using a smaller, hashed array, as is done in other parts of the kernel, is an option, but Hugh didn't go that way. Instead, he put the lock into the page structures representing the page table pages in the system memory map. Expanding that structure is not an option, but it seems that the private field of struct page is not currently used on page table pages. So, with a bit of preprocessor trickery, that field becomes a spinlock for page table pages.

This finer-grained locking should be helpful on larger systems, but it is likely to just be more overhead on uniprocessor or small SMP systems. So it is only enabled on kernels configured for four CPUs or more. Depending on the results from wider testing, that threshold may be raised before the patch is proposed for merging into the mainline.

Comments (none posted)

Coming soon: eCryptfs

eCryptfs developer Michael Halcrow recently announced that he will shortly be putting eCryptfs up for inclusion into the -mm tree. This filesystem aims to make "enterprise level" (it comes from IBM, after all) file encryption capabilities available in a secure and easy to use manner. Those who are interested in trying it out early can download it from SourceForge.

The eCryptfs developers took the stacking approach, meaning that, rather than implement its own platter-level format, eCryptfs sits on top of another filesystem. It is, essentially, a sort of translation layer which makes encrypted file capabilities available. The system administrator can thus create encrypted filesystems on top of whatever filesystem is in use locally, or even over a network-mounted filesystem.

The design of eCryptfs envisions providing a great deal of flexibility in the use of the filesystem. Rather than encrypt the filesystem as a whole, eCryptfs deals with each file individually. Different files can be encrypted in different ways. The use of this sort of mechanism implies that eCryptfs must maintain metadata on how each file is to be handled. This metadata is placed in the first block of the file itself, meaning that the file can be backed up, copied, and even moved to another system without losing the metadata needed to decrypt it in the future.

Plans for eCryptfs include a wide range of features. There will be dynamic, public-key encryption with each user's GPG keyring. On systems equipped with "trusted platform" (TPM) modules, the TPM will be used for its encryption capabilities and the ability to lock files to a specific system. Key escrow systems can be worked in for companies which need that feature. For the upcoming 0.1 release, however, eCryptfs will only support a single passphrase mode. The rest can be added once the initial problems have been shaken out and some policy support work has been done.

Many of the advanced features have been implemented, however, and can be tried out by sufficiently motivated testers. The developers are interested in feedback from people who can give eCryptfs a try or look over the source. Having seen the difficulties experienced by some filesystem implementers as they tried to get their work merged, the eCryptfs hackers would, doubtless, like to get any potential issues resolved sooner rather than later.

Comments (7 posted)

Some block layer patches

Lest LWN readers think that all of the development activity is currently centered around memory management issues, it is worth pointing out that some significant patches to the block subsystem are circulating as well. Here is a quick summary.

Linux I/O schedulers are charged with presenting I/O requests to block devices in an optimal order. There are currently four schedulers in the kernel, each with a different notion of "optimal." All of them, however, maintain a "dispatch queue," being the list of requests which have been selected for submission to the device. Each scheduler currently maintains its own dispatch queue.

Tejun Heo has decided that the proliferation of dispatch queues is a wasteful duplication of code, so he has implemented a generic dispatch queue to bring things back together. The unification of the dispatch queues helps to ensure that all I/O schedulers implement queues with the same semantics. It also simplifies the schedulers by freeing them of the need to deal with non-filesystem requests. In general, the developers have been heard to say, recently, that the block subsystem is not really about block devices; it is, instead, a generic message queueing mechanism. The generic dispatch queue code helps to take things in that direction.

Tejun Heo has also reimplemented the I/O barrier code. The result should be much improved barrier handling, but it also involves some API changes visible to block drivers. The new code recognizes that different devices will support barriers in different ways. There are three variables which are taken into account:

  • Whether the device supports ordered tags or not. Ordered tags allows there to be multiple outstanding requests, with the device expected to handle them in the indicated order. In the absence of ordered tags, barriers can only be implemented by stopping the request queue and being sure that requests before the barrier complete before any subsequent requests are issued.

  • Whether an explicit flush operation is required prior to issuing the barrier operation. Devices which perform write caching usually will need to be flushed for the barrier semantics to be met.

  • Whether the device supports the "forced unit access" (FUA) mode. If FUA is supported, the actual barrier request can be issued in FUA mode, and there is no need to force a flush afterward. In the absence of FUA, flushes are usually required before and after the barrier operation.

A block driver will tell the system about how its device operates with blk_queue_ordered(), which has a new prototype:

    typedef void (prepare_flush_fn)(request_queue_t *q, 
                                    struct request *rq);
    int blk_queue_ordered(request_queue_t *q, unsigned ordered,
		          prepare_flush_fn *prepare_flush_fn,
		          unsigned gfp_mask);

The ordered parameter describes how barriers to be implemented; it has values like QUEUE_ORDERED_DRAIN_FLUSH to indicate that barriers are implemented by stopping the queue, and that flushes are required both before and after the barrier; or QUEUE_ORDERED_TAG, which says that ordered tags handle everything. The prepare_flush_fn() will be called to do whatever is required to make a specific operation force a flush to physical media. See Tejun's documentation patch for more details.

With the above information in hand, the block layer can handle the implementation of barrier requests. As long as the driver implements flushes when requested and recognizes I/O requests requiring the FUA mode (a helper function blk_fua_rq() is provided for this purpose), the rest is taken care of at the higher levels.

The barrier patch also adds an uptodate parameter to end_that_request_last(). This API change, which will affect most block drivers, is necessary to enable drivers to signal errors for non-filesystem requests.

The conversation on the lists suggests that both of the above patches are headed for the mainline sooner or later. Mike Christie's block layer multipath patch may take a little longer, however. The question of where multipath support should be implemented has often been discussed; more recently, the seeming consensus was that the device mapper layer was the right place. The result was that the device mapper multipath patches were merged early this year. So it is a bit surprising to see the issue come back now.

Mike has a few reasons for wanting to implement multipath at the lower level. These include:

  • Dealing with multipath hardware involves a number of strange SCSI commands, and, especially, error codes. With the current implementation, it is hard to get detailed error information up to the device mapper layers in any sort of generic way.

  • Lower-level multipath makes it easier to merge device commands (such as failover requests) with the regular I/O stream.

  • The request queue mechanism is a better place for handling retries and other related tasks.

  • Placing the I/O scheduler above the multipath mechanism allows scheduling decisions to be made at the right time.

  • In theory, a wider range of devices could benefit from the multipath implementation - should anybody have a need for a multipath tape drive.

A number of code simplifications are also said to result from the new organization. The new multipath code is essentially a repackaging of the device mapper code, reworked to deal with the block layer from underneath. It not being proposed for merging at this time, or even for serious review. So far, there has been little discussion of this patch.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux v2.6.14-rc5 ?
Andrew Morton 2.6.14-rc5-mm1 ?
Nick Piggin 2.6.14-rc5-np1 ?
Ingo Molnar 2.6.14-rc5-rt1 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds