Brief items
The 2.6.14 kernel still is not out yet as of this writing, though
chances are good that it may have happened on the usual "right after LWN
publishes" schedule. Linus did release
2.6.14-rc5 on October 19;
it contained fixes for the show-stopper problems discussed here
last week and a number of other
fixes as well.
The current -mm tree is 2.6.14-rc5-mm1. Recent changes
to -mm include some USB power management improvements, a tracing mechanism
for the block layer, some page table scalability work (see below), demand
paging for hugetlb pages, the ktimer patch, and a
read-copy-update torture testing module.
Comments (1 posted)
Kernel development news
Oh, and at least one major distro has been served with legal papers
due to them shipping closed source kernel drivers, and more are on
the way. That's the direction some developers are taking. Others,
myself included, [are] taking the technical way and just making it so
damn hard to write and ship a closed kernel module, that they will
just give up eventually. Combine that with the EXPORT_SYMBOL_GPL()
stuff in the kernel, and I give it about 1-2 more years before it's
just technically impossible to write such a module.
-- Greg
Kroah-Hartman
Comments (26 posted)
NUMA systems have, by design, memory which is local to specific nodes
(groups of processors). While all memory is accessible, local memory is
faster to work with than remote memory. The kernel takes NUMA behavior
into account by attempting to allocate local memory for processes, and by
avoiding moving processes between nodes whenever possible. Sometimes
processes must be moved, however, with the result that the local-allocation
optimization can quickly become a pessimization instead. What would be
nice, in such situations, would be the ability to move a process's memory
when the process itself is shifted to a new node.
Memory migration patches have been circulating for some time now. The
latest version is this patch
set posted by Christoph Lameter. This patch deliberately does not
solve the entire problem, but it does try to establish enough
infrastructure that a full migration solution can be evolved eventually.
This patch does not automatically migrate memory for processes which have
been moved; instead, it leaves the migration decision to user space. There
is a new system call:
long migrate_pages(pid_t pid, unsigned long maxnode,
unsigned long *old_nodes,
unsigned long *new_nodes);
This call will attempt to move any pages belonging to the given process
from old_nodes to new_nodes. There is also a new
MPOL_MF_MOVE option to the set_mempolicy()
system call which can be used to the same effect. Either way, user space
can request that a given process vacate a set of nodes. This operation can
be performed in response to an explicit move of the process itself (which
might be done by a system scheduling daemon, for example), or in response
to other events, such as the impending shutdown and removal of a node.
The implementation is simple for now: the code iterates over the process's
memory and attempts to force each page needing migration to be swapped.
When the process faults the page back in, it should then be allocated on
the process's current node. The force-out process actually takes a few
passes over the list; initially it passes over locked pages and just
concerns itself with pages which are easy to evict. In later passes, it
will wait for locked pages and do the hard work of getting the final pages
out of memory.
Migrating pages by way of the swap device is not the most efficient way of
moving them across a NUMA system. Later work on the patch will be aimed at
adding direct node-to-node migration, and other features as well. In the
mean time, however, the developers would like to see the current
implementation merged in time for 2.6.15. Andrew Morton has expressed some reservations, however: he would
like to see an explanation of how this code can be made to work with near
complete reliability. There are a number of things which can prevent the
migration of pages; these include pages locked in place by user space, page
undergoing direct I/O, and more. Christoph responded that the patch will get there,
eventually. Whether this claim is sufficiently convincing to get the
migration patches into 2.6.15 remains to be seen.
Comments (3 posted)
Scalability - making Linux perform on ever-larger systems - is a constant
theme in kernel development. Some may feel that this work only benefits
the very small percentage of users who have big-iron systems, but the fact
remains that today's big iron is tomorrow's laptop. Remember that
supporting 1GB of memory (and beyond) was once a big-iron issue.
One scalability issue which has been receiving attention for a while is the
single page table lock used to protect all operations on an address space's
tables. Christoph Lameter's page
fault scalability patches were covered here last year; that patch
minimized the use of this lock, and introduced a number of atomic page
table operations which could eliminate locking altogether in some
situations. Those patches have never made it into the mainline,
due to concerns over architecture support and general usefulness. The
issue has not gone away, however.
Hugh Dickins, who has been thrashing up the -mm tree with memory management
patches for the last few weeks, has now posted a new approach to paging scalability. Rather
than play tricks to minimize page table lock hold times, Hugh has taken the
classic approach of going to finer-grained locking. So, with his patch,
the address space page table lock no longer controls access to individual
pages within the tables. Instead, each page gets its own lock.
Pushing the lock down to individual page-table pages will eliminate much of
the contention for the lock on large, multi-processor systems. It should
work especially well for multi-threaded processes (which share the same
address space) on those systems. Splitting the lock also enables the
kernel to work at reclaiming pages in one part of an address space while
pages are being faulted into another part. So, in some situations, this
split should be a big performance win.
There is, however, the little problem of where to store the lock. Putting
it into the page tables themselves is not an option; the format of page
tables tends to be driven by the underlying hardware architecture, and CPU
designers do not usually make provisions for in-table locks. One could
create an array of locks elsewhere in the system, but a large system can
contain a great many page table pages. The space overhead of a large lock
array could thus get painful. Using a smaller, hashed array, as is done in
other parts of the kernel, is an option, but Hugh didn't go that way.
Instead, he put the lock into the page structures representing the
page table pages in the system memory map. Expanding that structure is not
an option, but it seems that the private field of struct
page is not currently used on page table pages. So, with a bit of
preprocessor trickery, that field becomes a spinlock for page table pages.
This finer-grained locking should be helpful on larger systems, but it is
likely to just be more overhead on uniprocessor or small SMP systems. So
it is only enabled on kernels configured for four CPUs or more. Depending
on the results from wider testing, that threshold may be raised before the
patch is proposed for merging into the mainline.
Comments (none posted)
eCryptfs developer Michael Halcrow recently
announced that he will shortly be putting
eCryptfs up for inclusion into the -mm tree. This filesystem aims to make
"enterprise level" (it comes from IBM, after all) file encryption
capabilities available in a secure and easy to use manner. Those who are
interested in trying it out early can download it from
SourceForge.
The eCryptfs developers took the stacking approach, meaning that, rather
than implement its own platter-level format, eCryptfs sits on top of
another filesystem. It is, essentially, a sort of translation layer which
makes encrypted file capabilities available. The system administrator can
thus create encrypted filesystems on top of whatever filesystem is in use
locally, or even over a network-mounted filesystem.
The design of eCryptfs envisions providing a great deal of flexibility in
the use of the filesystem. Rather than encrypt the filesystem as a whole,
eCryptfs deals with each file individually. Different files can be
encrypted in different ways. The use of this sort of mechanism implies
that eCryptfs must maintain metadata on how each file is to be handled.
This metadata is placed in the first block of the file itself, meaning that
the file can be backed up, copied, and even moved to another system without
losing the metadata needed to decrypt it in the future.
Plans for eCryptfs include a wide range of features. There will be
dynamic, public-key encryption with each user's GPG keyring. On systems
equipped with "trusted platform" (TPM) modules, the TPM will be used for
its encryption capabilities and the ability to lock files to a specific
system. Key escrow systems can be worked in for companies which need that
feature. For the upcoming 0.1 release, however, eCryptfs will only support
a single passphrase mode. The rest can be added once the initial problems
have been shaken out and some policy support work has been done.
Many of the advanced features have been implemented, however, and can be
tried out by sufficiently motivated testers. The developers are interested
in feedback from people who can give eCryptfs a try or look over the
source. Having seen the difficulties experienced by some filesystem
implementers as they tried to get their work merged, the eCryptfs hackers
would, doubtless, like to get any potential issues resolved sooner rather
than later.
Comments (7 posted)
Lest LWN readers think that all of the development activity is currently
centered around memory management issues, it is worth pointing out that
some significant patches to the block subsystem are circulating as well.
Here is a quick summary.
Linux I/O schedulers are charged with presenting I/O requests to block
devices in an optimal order. There are currently four schedulers in the
kernel, each with a different notion of "optimal." All of them, however,
maintain a "dispatch queue," being the list of requests which have been
selected for submission to the device. Each scheduler currently maintains
its own dispatch queue.
Tejun Heo has decided that the proliferation of dispatch queues is a
wasteful duplication of code, so he has implemented a generic dispatch queue to bring
things back together. The unification of the dispatch queues helps to
ensure that all I/O schedulers implement queues with the same semantics.
It also simplifies the schedulers by freeing them of the need to deal with
non-filesystem requests. In general, the developers have been heard to
say, recently, that the block subsystem is not really about block devices;
it is, instead, a generic message queueing mechanism. The generic dispatch
queue code helps to take things in that direction.
Tejun Heo has also reimplemented
the I/O barrier code. The result should be much improved barrier
handling, but it also involves some API changes visible to block drivers.
The new code recognizes that different devices will support barriers in
different ways. There are three variables which are taken into account:
- Whether the device supports ordered tags or not. Ordered tags allows
there to be multiple outstanding requests, with the device expected to
handle them in the indicated order. In the absence of ordered tags,
barriers can only be implemented by stopping the request queue and
being sure that requests before the barrier complete before any
subsequent requests are issued.
- Whether an explicit flush operation is required prior to issuing the
barrier operation. Devices which perform write caching usually will
need to be flushed for the barrier semantics to be met.
- Whether the device supports the "forced unit access" (FUA) mode. If
FUA is supported, the actual barrier request can be issued in FUA
mode, and there is no need to force a flush afterward. In the absence
of FUA, flushes are usually required before and after the barrier
operation.
A block driver will tell the system about how its device operates with
blk_queue_ordered(), which has a new prototype:
typedef void (prepare_flush_fn)(request_queue_t *q,
struct request *rq);
int blk_queue_ordered(request_queue_t *q, unsigned ordered,
prepare_flush_fn *prepare_flush_fn,
unsigned gfp_mask);
The ordered parameter describes how barriers to be implemented; it
has values like QUEUE_ORDERED_DRAIN_FLUSH to indicate that
barriers are implemented by stopping the queue, and that flushes are
required both before and after the barrier; or QUEUE_ORDERED_TAG,
which says that ordered tags handle everything. The
prepare_flush_fn() will be called to do whatever is required to
make a specific operation force a flush to physical media. See Tejun's documentation patch for more details.
With the above information in hand, the block layer can handle the
implementation of barrier requests. As long as the driver implements
flushes when requested and recognizes I/O requests requiring the FUA mode
(a helper function blk_fua_rq() is provided for this purpose), the
rest is taken care of at the higher levels.
The barrier patch also adds an uptodate parameter to
end_that_request_last(). This API change, which will affect most
block drivers, is necessary to enable drivers to signal errors for
non-filesystem requests.
The conversation on the lists suggests that both of the above patches are
headed for the mainline sooner or later. Mike Christie's block layer multipath patch
may take
a little longer, however. The question of where multipath support should
be implemented has often been discussed; more recently, the seeming
consensus was that the device mapper layer was the right place. The result
was that the device mapper
multipath patches were merged early this year. So it is a bit
surprising to see the issue come back now.
Mike has a few reasons for wanting to implement multipath at the lower
level. These include:
- Dealing with multipath hardware involves a number of strange SCSI
commands, and, especially, error codes. With the current
implementation, it is hard to get detailed error information up to the
device mapper layers in any sort of generic way.
- Lower-level multipath makes it easier to merge device commands (such
as failover requests) with the regular I/O stream.
- The request queue mechanism is a better place for
handling retries and other related tasks.
- Placing the I/O scheduler above
the multipath mechanism allows scheduling decisions to be made at the right
time.
- In theory, a wider
range of devices could benefit from the multipath implementation - should
anybody have a need for a multipath tape drive.
A number of code simplifications are also said to result from the new
organization.
The new multipath code is essentially a repackaging of the device mapper
code, reworked to deal with the block layer from underneath. It not being
proposed for merging at this time, or even for serious review. So far,
there has been little discussion of this patch.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>