Brief items
The current stable 2.6 kernel is 2.6.19,
released by Linus on November 29.
Says Linus:
It's
one of those rare "perfect" kernels. So if it doesn't happen to compile
with your config (or it does compile, but then does unspeakable acts of
perversion with your pet dachshund), you can rest easy knowing that it's
all your own d*mn fault, and you should just fix your evil ways.
For those just tuning in, major user-visible changes in 2.6.19 include the
parallel ATA driver
subsystem, the GFS2
and ext4 filesystems, a long
list of new drivers, eCryptfs, and more. See the LWN kernel API page for
a list of internal API changes, and the KernelNewbies 2.6.19 page
for vast amounts of detail.
The current -mm tree is 2.6.19-rc6-mm2. Recent changes
to -mm include some driver core tweaks, suspend/resume support for a number
of parallel ATA drivers, the file capabilities patch (see below), and a
per-task I/O accounting feature.
For older 2.6 kernels: the current 2.6.18 kernel is 2.6.18.4, released on November 29.
It contains a single fix for a buffer overflow in the network bridging code.
For 2.6.16 users, Adrian Bunk has released 2.6.16.33 and 2.6.16.34 with a number of fixes
and (in .34) a few new drivers.
Comments (9 posted)
Kernel development news
I believe that the reason we such such stunning progress on things
like the Linux kernel is that, among other things, the governing
process is transparent and damn simple.
-- Michael Tiemann
Comments (2 posted)
The
workqueue mechanism allows
kernel code to defer processing to a later time. Workqueues are
characterized by the existence of one or more dedicated processes which
execute queued jobs; since work is done in process context, it can sleep if
need be. Workqueues can also delay the execution of specific jobs for a
caller-specified period. They are used in many places throughout the
kernel.
David Howells recently took a look at workqueues and noticed that the
work_struct structure, which describes a task to be executed, is
rather large. It can be 96 bytes on 64-bit machines. That is fairly heavy
for structures which can be used in reasonably large quantities. So he set
out to find ways to make it smaller. He succeeded, but at the cost of some
changes to the workqueue API.
The causes of bloat in struct work_struct are:
- The timer structure embedded in each one. Many users of workqueues
never need the delay feature, but every queued bit of work carries
along a timer_list structure, just in case.
- The private data pointer, which is passed to the actual work
function. Many work functions use that pointer, but it can often be
calculated from the work_struct pointer using
container_of().
- An entire word is used to store a single bit: the "pending" flag which
indicates that a work_struct is currently in a queue waiting
to be executed.
David addressed each of these issues. As a result, there are now two types
of work structure (struct work_struct and struct
delayed_work); the timer information has been removed from the
former. The private data pointer is gone; work functions instead get a
pointer to the associated work_struct (or delayed_work)
structure. And some internal trickery was used to get rid of the word
holding the "pending" bit.
The result of these changes is that almost every part of the workqueue API
has changed. There are now two ways of declaring a workqueue entry:
typedef void (*work_func_t)(struct work_struct *work);
DECLARE_WORK(name, func);
DECLARE_DELAYED_WORK(name, func);
The prototype for the work function has changed; it is now a pointer to the
relevant work queue entry. Note that a work_struct pointer is
always passed, even in the case of delayed work. It would appear that the
programmer is expected to count on the fact that struct
work_struct is the first field of struct delayed_work, so
container_of() should work as expected. As long as nobody
rearranges struct delayed_work, anyway.
For work structures which must be set up at run time, the initialization
macros now look like this:
INIT_WORK(struct work_struct work, work_func_t func);
PREPARE_WORK(struct work_struct work, work_func_t func);
INIT_DELAYED_WORK(struct delayed_work work, work_func_t func);
PREPARE_DELAYED_WORK(struct delayed_work work, work_func_t func);
The INIT_* versions initialize the entire structure; they must be
used the first time a structure is initialized. Thereafter, the
PREPARE_* versions, which are slightly faster, can be used.
The functions for adding entries to workqueues (and canceling them) now
look like this:
int queue_work(struct workqueue_struct *queue,
struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue,
struct delayed_work *work);
int queue_delayed_work_on(int cpu,
struct workqueue_struct *queue,
struct delayed_work *work);
int cancel_delayed_work(struct delayed_work *work);
int cancel_rearming_delayed_work(struct delayed_work *work);
Interestingly, David has added a variant on the workqueue declaration and
initialization macros:
DECLARE_WORK_NAR(name, func);
DECLARE_DELAYED_WORK_NAR(name, func);
INIT_WORK_NAR(name, func);
INIT_DELAYED_WORK_NAR(name, func);
PREPARE_WORK_NAR(name, func);
PREPARE_DELAYED_WORK_NAR(name, func);
The "NAR" stands for "non-auto-release." Normally, the workqueue subsystem
resets a work entry's pending flag prior to calling the work function; that
action, among other things, allows the function to resubmit itself if need
be. If the entry is initialized with one of the above macros, however,
this reset will not happen, and the work function is expected to reset the
flag itself (with a call to work_release()). The stated purpose
is to prevent the workqueue entry from being released before the work
function is done with it - but there is nothing in the clearing of the
pending bit which would cause that release to happen. Perhaps that is why
there are no users of the _NAR variants in David's patch. It may
be that somebody is thinking about implementing reference-counted workqueue
structures in the future.
Meanwhile, these changes require a lot of fixes throughout the kernel tree;
that drew a complaint from Andrew Morton,
who was unable to make those changes mesh with all of the other patches
queued up for the opening of the 2.6.20 merge window. Andrew suggested
that the workqueue patches could be merged after 2.6.20-rc1 comes out, as
was done with the interrupt handler function prototype in 2.6.19. But
Linus, who likes the workqueue patches, would
rather get them in sooner:
I'd actually prefer to take it before -rc1, because I think the
previous time we did something after -rc1 was a failure (the whole
irq argument handling thing). It just exposed too many problems too
late in the dev cycle. I'd rather have the problems be exposed by
the time -rc1 rolls out, and keep the whole "we've done all major
nasty ops by -rc1" thing.
So it seems that, somehow, all of the pieces will be made to fit and the
workqueue API will change in 2.6.20.
Comments (6 posted)
Memory fragmentation is a kernel programming issue with a long history. As
a system runs, pages are allocated for a variety of tasks with the result
that memory fragments over time. A busy system with a long uptime may have
very few blocks of pages which are physically-contiguous. Since Linux is a
virtual memory system, fragmentation normally is not a problem; physically
scattered memory can be made virtually contiguous by way of the page
tables.
But there are a few situations where physically-contiguous memory
is absolutely required. These include large kernel data structures (except
those created with vmalloc()) and any memory which must appear
contiguous to peripheral devices. DMA buffers for low-end devices (those
which cannot do scatter/gather I/O) are a classic example. If a large
("high order") block of memory is not available when needed, something will
fail and yet another user will start to consider switching to BSD.
Over the years, a number of approaches to the memory fragmentation problem
have been considered, but none have been merged. Adding any sort of
overhead to the core memory management code tends to be a hard sell. But
this resistance does not mean that people stop trying. One of the most
persistent in this area has been Mel Gorman, who has been working on an
anti-fragmentation patch set for some years. Mel is back with version 27 of his
patch, now rebranded "page clustering." This version appears to have
attracted some interest, and may yet get into the mainline.
The core observation in Mel's patch set remains that some types of memory
are more easily reclaimed than others. A page which is backed up on a
filesystem somewhere can be readily discarded and reused, for example,
while a page holding a process's task structure is pretty well nailed
down. One stubborn page is all it takes to keep an entire large block of
memory from being consolidated and reused as a physically-contiguous
whole. But if all of the easily-reclaimable pages could be kept together,
with the non-reclaimable pages grouped into a separate region of memory, it
should be much easier to create larger blocks of free memory.
So Mel's patch divides each memory zone into three types of blocks:
non-reclaimable, easily reclaimable, and movable. The "movable" type is a
new feature in this patch set; it is used for pages which can be easily
shifted elsewhere using the kernel's page migration mechanism. In
many cases, moving a page might be easier than reclaiming it, since there
is no need to involve a backing store device. Grouping pages in this way
should also make the creation of larger blocks "just happen" when a process
is migrated from one NUMA node to another.
So, in this patch, movable pages (those marked with __GFP_MOVABLE)
are generally those belonging to user-space processes. Moving a user-space
page is just a matter of copying the data and changing the page table
entry, so it is a relatively easy thing to do. Reclaimable pages
(__GFP_RECLAIMABLE), instead, usually belong to the kernel. They
are either allocations which are expected to be short-lived (some kinds of
DMA buffers, for example, which only exist for the duration of an I/O
operation) or can be discarded if needed (various types of caches).
Everything else is expected to be hard to reclaim.
By simply grouping different types of allocation in this way, Mel was able
to get some pretty good results:
In benchmarks and stress tests, we are finding that 80% of memory
is available as contiguous blocks at the end of the test. To
compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the
end of stress tests.
Linus has, in the past, been generally opposed to efforts to reduce memory
fragmentation. His comments this time
around have been much more detail-oriented, however: should allocations be
considered movable or non-movable by default? The answer would appear to
be "non-movable," since somebody always has to make some effort to ensure
that a specific allocation can be moved. Since the discussion is now
happening at this level, some sort of fragmentation avoidance might just
find its way into the kernel.
A related approach to fragmentation is the lumpy reclaim mechanism posted
by Andy Whitcroft but originally by Peter Zijlstra. Memory reclaim in
Linux is normally done by way of a least-recently-used (LRU) list; the hope
is that, if a page must be discarded, going after the least recently used
page will minimize the chances of throwing out a page which will be needed
soon. This mechanism will tend to free pages which are scattered randomly
in the physical address space, however, making it hard to create larger
blocks of free memory.
The lumpy reclaim patch tries to address this problem by modifying the LRU
algorithm slightly. When memory is needed, the next victim is chosen from
the LRU list as before. The reclaim code then looks at the surrounding
pages (enough of them to form a higher-order block) and tries to free them
as well. If it succeeds, lumpy reclaim will quickly create a larger free
block while reclaiming a minimal number of pages.
Clearly, this approach will work better if the surrounding pages can be
freed. As a result, it combines well with a clustering mechanism like Mel
Gorman's. The distortion of the LRU approach could have performance
implications, since the neighboring pages may be under heavy use when the
lumpy reclaim code goes after them. In an attempt to minimize this effect,
lumpy reclaim only happens when the kernel is having trouble satisfying a
request for a larger block of memory.
If - and when - these patches may be merged is yet to be seen. Core memory
management patches tend to inspire a high level of caution; they can easily
create chaos when exposed to real-world workloads. The problem doesn't go
away by itself, however, so something is likely to happen, sooner or later.
Comments (4 posted)
The capability model has some real appeal. It replaces the "all or
nothing" security model inherent in the root account with a set of
fine-grained permissions describing exactly what a given process can do.
Linux has supported capabilities for years, but this feature has seen
little use for a number of reasons; see
this article from last September
for more general discussion of capabilities.
The fact that capabilities have not been used much has not stopped
developers from trying to improve the feature. The latest attempt is the
file capabilities patch by
Serge Hallyn. This patch allows a system administrator to add specific
capabilities to an executable file; when that file is executed, the
process's capability masks will be set to the capabilities associated with
the file. This feature thus functions somewhat like the file setuid bit,
but with finer control.
On the kernel side, file-based capabilities work through the extended attribute
mechanism. Capabilities are added to a file by setting a attribute named
security.capability; the value of the attribute will be this
structure:
struct vfs_cap_data_disk {
__le32 version;
__le32 effective;
__le32 permitted;
__le32 inheritable;
};
The version field holds the current capability version; the other
three hold the expected capability masks.
There are a few interesting features of this implementation:
- One might wonder what keeps the user from just setting an extended
attribute and obtaining whatever capabilities might be desired. While
setting extended attributes is not a privileged operation, setting
attributes whose name starts with "security." is. So, unless
the user has root privileges, he or she will not be able to set
capability attributes. (For the curious, the other restricted
attributes are trusted.*, which only root can query or
change, and user.*, which, in some situations, can only be
changed by the owner of the file).
- The capability masks stored with the file completely overwrite the
process's current capabilities. So, if the root user executes a file
with capabilities set, it may run with fewer capabilities than
it would have otherwise had.
- The setting of capabilities is done outside of the check for
filesystems mounted with the nosuid option. This behaviour
would appear to open the system up to attacks via a removable
filesystem created on a different system.
A set of user-space tools exists for working with file-based capability
masks; see the filesystem
capabilities page for downloads, documentation, and examples.
Before celebrating the arrival of file capabilities, it is worth asking
whether system administrators really need another 31 (at last count)
permission bits - multiplied by three separate capability masks - to manage
on every executable file on the system. It can be hard to keep file
permissions bits in proper order even without capabilities. A full
capability-based system would approach SELinux in complexity, and may thus
be beyond the ability of most people to manage. But one could use this
feature to assign restricted capabilities to programs which currently run
setuid root. In many cases, root privilege is only need to bind to a
low-numbered socket, adjust the system time, or perform raw I/O.
Restricting a program to its needed capabilities should reduce the changes
of that program being used to do something unexpected.
Comments (9 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>