Brief items
The current 2.6 prepatch is 2.6.10-rc2,
announced by Linus on November 14.
Patches merged since -rc1 include fixes for the
ELF loader security problems, Anubis block
cypher support, an ALSA update, a number of networking updates, kprobes
support for the x86-64 architecture, a frame buffer device update, a set of
user-mode Linux patches, an NTFS update, version 2.0 of the USB gadget
serial driver, some kernel build tweaks (the preferred name for kernel
makefiles is now
Kbuild), the ext3 block reservation and online
resizing patches,
sysfs backing store,
locking behavior annotations for the "sparse" utility, a reworking of spin
lock initialization, the un-exporting of
add_timer_on(),
sys_lseek(), and a number of other kernel functions, an x86 signal
delivery optimization, an IDE update,
I/O space
write barrier support, a frame buffer driver update, more scheduler
tweaks, some big kernel lock preemption patches, a large number of
architecture updates, and lots of fixes. See
the long-format changelog (600KB) for the
details.
Linus has noted that now would be a good time to calm down and stick to bug
fixes until 2.6.10 comes out. His BitKeeper repository shows that he is
sticking to that; it contains mostly fixes. There is also a memory
technology device (and JFFS2) update, a frame buffer device update, some
user-mode Linux patches, some page allocator tuning, and a few architecture
updates.
The current prepatch from Andrew Morton is 2.6.10-rc2-mm1. Recent changes to -mm include
some kmap_atomic() changes (see below), the ability to disable a
subset of "magic sysrq" features, some SELinux scalability work, enhanced
I/O and memory usage accounting data collection, and an updated reiser4
filesystem.
The 2.4.28 kernel has been released; Marcelo announced its availability on
November 17. The biggest change since 2.4.27, for many people, will
be the serial ATA and networking improvements, but many other fixes have
gone in as well.
Comments (none posted)
Kernel development news
Well, yes, your base appetites have led you to the name "pud",
where my refined intellect led me to "phd", with h for higher ;)
-- Hugh Dickins on the philosophy of page
table naming.
Comments (none posted)
"High memory," on a Linux system is, by definition, memory which is not
normally mapped into the kernel's virtual address space. It is a mechanism
which enables 32-bit architectures to make use of more physical memory than
would otherwise be possible. When the kernel needs to directly manipulate
the contents of a high-memory page, it must explicitly create a virtual
address for it. The traditional functions for creating and removing those
addresses are:
void *kmap(struct page *page);
void kunmap(struct page *page);
These functions work as intended, but they can be expensive to use. The
virtual address space they use is limited, and shared across all
processors. As a result, each kmap() and kunmap() invocation
requires a global
TLB flush. Often, however, high memory does not need to be mapped for
long periods of time, and does not need to be shared across processors. To
improve performance in such situations, the notion of an "atomic kmap" was
added:
void *kmap_atomic(struct page *page, enum km_type type);
void kunmap_atomic(void *address, enum km_type type);
Atomic kmaps use a very small set of predefined virtual "slots," which are
not shared across processors. The type argument specifies which
slot is to be used, with the callers taking responsibility for not stepping
on each others' toes. Slots are dedicated to specific purposes - two for code called
in user context, two for interrupt handlers, two for page table management,
etc. In practice, it all works out; conflicts over atomic kmap slots don't
happen.
Another problem has come up, however, and that has led to a small
change in the prototypes of the atomic kmap functions in the -mm kernel. The
regular kmap functions have a symmetrical interface in that both take a
struct page * argument. kunmap_atomic(), instead,
takes a void * argument - the kernel virtual address to be
unmapped. It is a common mistake, however, to pass in the associated
struct page pointer instead. Since the argument type is
void *, the compiler does not complain, and the discovery of
the problem does not come until (possibly much) later.
The solution is straightforward: redefine the function as follows:
char *kmap_atomic(struct page *page, enum km_type type);
void kunmap_atomic(char *address, enum km_type type);
With this change, the compiler will issue a warning whenever somebody tries
to pass a struct page pointer to kunmap_atomic().
The patch has generated a surprising number of follow-on fixes, mostly to
suppress warnings caused by the change. Many kunmap_atomic()
calls now explicitly cast the address argument to the char *
type. In the end, though, the result should be one more potential mistake
which can be caught before it burns somebody - as long as programmers
don't "fix" warnings by casting struct page pointers.
Comments (5 posted)
Linux currently offers a wealth of projects which are working to extend the
classic Unix permissions mechanism with more flexible schemes. One recent
entry is
an LSM port of Trustees Linux,
which has been done by Andrew Ruder. Trustees Linux starts with the idea
that access control lists are overly complicated and inefficient; achieving
the desired goals can require hanging ACLs on thousands of files, and
keeping all of those ACLs in sync can be a challenge.
The Trustees approach, instead, is to create a separate, central database
which contains filesystem permissions. This database can assign a
"trustee" to a directory; this trustee provides access permissions which
apply to the directory and, by default, everything below that directory. A
single rule can, thus, cover a large part of the filesystem hierarchy.
The trustee rules cover the usual sorts of permissions; who can search for,
read, and write files in a given subtree. The format is somewhat terse;
one of the rules provided in the examples is:
[/dev/hda1]/var/log:zavadsky:REB
This rule enables user "zavadsky" to wander around in (and under)
/var/log and read files there.
Mr. Ruder's port is centered around the Linux security module
inode_permission() hook; that code examines the trustees which
apply to a given inode and decides whether the requested access is to be
allowed or not.
It's all pretty straightforward, but there is an interesting
twist to how Trustees works with file permissions: the module gives the
CAP_DAC_OVERRIDE capability to every process, allowing them to
override the existing Unix file permissions. The Trustees module will, in
turn, apply those permissions itself much of the time, but it is possible
to write rules which override them. In this sense, Trustees functions as
an authoritative module, which is not how LSM modules are supposed to
work. If Trustees Linux is ever proposed for merging into the mainline,
that little feature could come back to haunt it.
Comments (7 posted)
There has, in recent times, been a small increase in the number of
complaints from users who have seen processes killed by the kernel in
response to an out-of-memory (OOM) situation. The only problem is that the
system should not have been quite that hard up for memory at the time.
Even if the user is doing something which requires completely irrational
amounts of memory ("
yum update", say), it seems like the system
should have been able to muddle along without killing low-priority
processes, like the ssh server. These unwanted OOM killer experiences have
driven a few developers to take a closer look at what was going on.
Marcelo Tosatti has been working on the problem for a bit; he put together
a patch which tries to avoid invocations of
the OOM killer if things might get better soon. The idea is that, while a
full scan of a memory zone may have failed to turn up any free pages, it
may have kicked I/O into motion that will, very soon, make some pages
free. So the OOM killer is kept in its cage until the no-memory situation
has persisted for a few seconds. Marcelo reported that this patch improved
things significantly for his test cases.
It turns out, though, that the real problem was elsewhere; the token-based thrashing control patch appears to
be the real culprit. This patch, remember, tries to reduce system
thrashing in memory-constrained situations by exempting one process at a
time from the page reclaim mechanism. That process will, in theory, make
use of its sheltered time to make some real progress before the token moves
on and its pages are, once again, subject to eviction. The token-based
mechanism has been shown to truly improve the situation when memory is
tight.
Until it gets too tight, as it turns out. A process which needs a page,
but which does not hold the token, may find that all of the (otherwise)
reclaimable pages belong to the process currently holding the token. The
unlucky process thus finds no pages to grab, and pushes the big red OOM
button. The system is not truly out of memory, however; it has simply been
told that all the good pages are temporarily off limits.
Rik van Riel put his finger on the problem, and Andrew Morton put together
a simple patch to fix it. Essentially, the
VM subsystem will now ignore the swap token when finding reclaimable pages
gets too hard. During normal operation, the token-based mechanism holds
sway, but it can be set aside as a preferable alternative to killing random
processes in the system. The patch appears to have solved the problems
without taking away the benefits of the token-based approach.
Marcelo acknowledged that this was the right fix, grumbled that he had
wasted a bunch of time, and promised
"Next time I should be looking into the easy stuff before trying
miraculous solutions." It was his work, however, which shone a
light on the problem in the first place, and led to its eventual solution.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Janitorial
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>