The 2.6.26 merge window remains open
, so there is no released 2.6
development kernel. See the article below for a summary of patches merged
over the last week.
No stable kernel releases have been made over the last week. As of
this writing, the 184.108.40.206 and 220.127.116.11 stable updates are in the review
process; if all goes well, these updates should be released on May 1.
Comments (none posted)
Kernel development news
Those who have been watching the linux-kernel list know that the 2.6.26
merge window has been a little rougher than some of those which came
before. That has led to some fairly strong discussion over how changes
find their way into the mainline. Here's a few selections.
I'm not saying the patch is wrong ... or that just because it broke
voyager it shouldn't be done. What I'm saying is that it shouldn't have
been put into the x86 tree without mailing list review.
Running a git tree isn't a private fiefdom, it's a public trust; to keep
the trust of other developers, you have to run the tree in a transparent
fashion ... and making the mailing list the only input to it is one way
of ensuring this. It also helps with review that we're all so worried
about so little being done ...
-- James Bottomley
But, we'd not mind at all posting 1000 x86.git patches to lkml (or
another list) every 3 months (or more frequently), if people request
-- Ingo Molnar
You can post whatever patches you like a million times to lkml.
That's not the problem.
It's that the patches don't get reviewed, posting them more or to a
different place doesn't help that.
-- David Miller
Sorting x86 arch code is inevitably going to break a few eggs, but I
suspect the time cost has been more in Dave v Ingo (12 rounds, two falls,
two submissions or a knockout) than actually sorting out the fallout of a
couple of problem cases.
-- Alan Cox
So here's how we're going to fix David's problem:
- Everyone gets their stuff into linux-next.
- Lots of people _test_ linux-next. Just once a week.
Those two steps will improve the merge-window chaos a lot. Things will get
-- Andrew Morton
IMO, the merge window is way too short for actually testing anything. I rebuild
the kernel once or even twice a day and there's no way I can really test it.
I can only check if it breaks right away. And if it does, there's no time to
find out what broke it before the next few hundreds of commits land on top of
-- Rafael Wysocki
And yes, there is a solution: don't develop so much. Don't allow thousands
of developers to be involved. Do a small core group, and make development
so hard or inconvenient that you only have a few tens of people who write
code, and vet them and force them to jump through hoops when adding new
features (or fixing old ones, for that matter).
-- Linus Torvalds
Comments (4 posted)
Since last week's summary
written, another 3700 changesets have found their way into the
mainline git repository. The most significant user-visible changes
- New drivers have been merged for Wolfson WM9713 codecs,
TI DAVINCI AC97 sound chips,
Emagic Audiowerk 2 soundcards,
x86 PC speakers (new driver which makes them look like sound cards),
Asus AV100 (Xonar DX) sound cards,
Micron MT9M001 and MT9V022 cameras,
PXA27x Quick Capture cameras,
Kworld ATSC 120 tuners,
cx23417 MPEG encoders,
Integrant ITD1000 tuners,
Philips TDA10048HN-based demodulators,
Philips SAA7171/3/4 audio/video decoders (the last out-of-tree IVTV
Auvitek AU8522 demodulators,
Samsung S5H1411-based tuners,
framebuffer, keyboard, and mouse virtual devices (for Xen),
several Wolfson Microelectronics touchscreens,
wireless Xbox 360 controllers,
Zhen Hua PPM-4CH transmitters,
SPCP8x5 USB to serial adaptors,
NCR 53c9x SCSI controllers (replacement driver),
Freescale 8610 and 5121 display interface units,
Intel 965G/965GM integrated graphics controllers,
TI OMAP sound controllers (including the one on the Nokia 810),
Eee PC function keys, and
Intel IXP4xx Ethernet devices.
- There is now "basic" support for braille screen readers.
- Support for the One Laptop Per Child XO architecture has been merged
into the mainline.
- The new virtual files found in /proc/pid/mountinfo
provide information on all filesystem mounts visible to the relevant
- The new virtual file /proc/vmallocinfo displays information
on use of vmalloc space within the kernel.
- The SPARC Niagara architecture now has NUMA support.
- The Xen balloon driver (allowing memory to be added to or removed from
virtual guests) has been merged.
- By default, /dev/mem can no longer be used to access RAM;
Fedora and Red Hat have applied this patch for years, but now it has
found its way into the mainline.
- The KVM paravirtualization subsystem now supports the S/390, PowerPC
440, and ia64 architectures.
- Per-process "securebits" are supported. These bits control how a
process's capability bits are managed; the patch is intended to help
those who would transition over to a fully capability-based system.
See this article for a
more detailed description of this feature.
- The getrusage() system call has a new RUSAGE_THREAD
option which causes it to return information about the current thread
- The device whitelist control group patch (described briefly in this article) has been
- It is now possible to create and use partitions with network block
device (NBD) devices.
- The audit subsystem can now test events against the type of the file
being operated upon.
- The VFS now makes backing device information available under
/sys/class/bdi. Interested people can look at per-device
readahead and writeback variables there.
- The FUSE filesystem now supports the creation of shared writable
Changes visible to kernel developers include:
- ioremap() on the x86 architecture will now always return an
uncached mapping. Previously, it had taken a more relaxed approach,
leaving the caching as the BIOS had set it up. The practical result
was to almost always create uncached mappings, but with
occasional exceptions. Drivers which depend on a cached mapping will
now break; they will need to use ioremap_cache() instead.
- The Video4Linux2 API now defines a set of controls for camera devices;
they allow user space to work with parameters like exposure type, tilt
and pan, focus, and more.
- On the x86 architecture, there is a new configuration parameter which
allows gcc to make its own decisions about the inlining of functions,
even when functions are declared inline. In some cases, this
option can reduce the size of the kernel's text segment by over 2%.
- The legacy IDE layer has gone through a lot of internal changes which
will break any remaining IDE drivers.
- The nopage() virtual memory area operation has been removed;
all in-tree code is now using fault() instead.
- The SLUB allocator supports a new sysfs file
(/sys/kernel/slab/name/order) which allows system
administrators to change the size of page allocations used by the
- A condition which triggers a warning from WARN_ON will now
also taint the kernel.
- The get_info() interface for /proc files has been
removed. There is also a new function for creating /proc
struct proc_dir_entry *proc_create_data(const char *name, mode_t mode,
struct proc_dir_entry *parent,
const struct file_operations *proc_fops,
This version adds the data pointer, ensuring that it will be
set in the resulting proc_dir_entry structure before user
space can try to access it.
- The object debugging
infrastructure has been merged.
The merge window remains open; tune in next week for (what should be) the
final set of changes merged for 2.6.26.
Comments (2 posted)
Linux capabilities have had a long and somewhat tortuous journey as part of
the Linux kernel. Slowly—and very carefully—functionality is
being added to this security feature to get it to a point where it is a
viable alternative to the all-or-nothing setuid(0) model. A
recently merged patch
adds a per-process securebits feature that will allow capabilities-based
daemons or subsystems to coexist with existing setuid utilities.
Linux capabilities break up the privileged tasks
normally associated with root (i.e. uid 0) into finer-grained abilities
which can be individually granted or revoked for specific processes. The
idea is to change the standard Unix model that root has all special
privileges while all other users have none.
The terminology is always a bit contentious, though, as Linux capabilities are
derived from a POSIX proposal that was never adopted, but shares the name
"capabilities" with an entirely
different approach; this article is only concerned with capabilities of
the Linux variety.
There has long been interest in creating a Linux system that did not rely upon
a single root account. Capabilities are seen as the way to
get there, but they have suffered from a bit of a chicken-and-egg problem.
With the recent work to add file-based
capabilities and restore
CAP_SETPCAP to its original meaning, a true
capabilities-based system is becoming possible. In the patch, which has
been merged for 2.6.26, Andrew Morgan describes the new functionality:
The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0. This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()). This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.
The patch removes the global securebits variable, replacing it with an
entry in struct task_struct, that can be manipulated by a process,
but only for itself—and any children. Morgan envisions hybrid
systems that have
some utilities using capabilities to get their privileges along with some
setuid(0) utilities. In that scenario, a capabilities-based
utility or daemon may wish to limit what its children can do, even if they execute a
setuid(0) binary. As part of the evolution, process trees can be
created that cannot get root privileges.
Processes which have the CAP_SETPCAP capability can change their securebits setting
via the prctl() system call. There are three separate bits that
govern the interaction of capabilities and setuid:
- SECURE_NOROOT – enabling this gives no special privileges to uid
- SECURE_NO_SETUID_FIXUP – setting this bit disables capability
fixes when transitioning from or to uid 0 via setuid. This might be
done for compatibility with older programs that use setuid to
reduce their privileges.
- SECURE_KEEP_CAPS – when set, a process can retain its
capabilities even when transitioning to a normal (not uid 0) user. This
bit is cleared by exec().
Each of these bits also has a companion *_LOCKED
bit that, if set,
allow any user program to alter the corresponding setting.
As Morgan notes in the patch, a program that can set its capabilities (has
) can drop all privileges for itself and any child
process by doing:
This is the equivalent of setting SECURE_NOROOT
The memory of the sendmail-capabilities bug from 2000 makes some
a bit queasy—or worse—about any patches that involve
capabilities and setuid. Andrew
Morton asks: "what was the bug which
caused us to cripple capability inheritance back in the days of yore? (Some
That bug was caused because unprivileged users could take away the
CAP_SETUID capability from setuid binaries like
sendmail. When sendmail then used setuid to drop its privileges,
it failed, but sendmail did not check, so it was still running with full
privilege. This could be leveraged by a user to gain root privileges. It
was a disconnect between capabilities and
the longstanding behavior of Unix-like systems when dropping privileges.
Morgan has written a
description of the sendmail-capabilities bug in response to Morton's
questions. He makes it clear that he wants to move toward full capability
support without breaking existing code:
I'm basically interested in evolving the capability implementation
back to the POSIX.1e model and making it whole - but most certainly
*without crippling legacy superuser support in the process* .
As folk get more comfortable with this full capability model. I
believe we can delete more cruft from the main kernel, but even that
clean up will leave a fully functional legacy model in place. I feel
it should be for something like init, or one of its children to be
able to run subsystems in capability-only or legacy modes.
Morton seemed satisfied that his concerns had been addressed, but still
wonders about the future for capabilities: "So how do we ever get to the stage where we can recommend that distributors
turn these things on, and have them agree with us?" This was echoed by Ismail Dönmez, who was looking
for concrete examples of how to use the per-process securebits feature.
Morgan provides a pointer to some examples along with his belief that
sometime soon the capabilities developers will become confident enough to
recommend turning off the "experimental" flag for the
SECURITY_FILE_CAPABILITIES kernel configuration. That flag
governs both the file-based capabilities as well as the per-process
securebits. In addition, Morgan says:
More importantly I'm hopeful that in that time we'll have accumulated
enough documentation and user-space experience and examples to convince
others that this is, indeed, a viable feature to support in mainstream
article on file-based capabilities by Serge Hallyn and a web page on POSIX
capabilities by Chris Friedhoff were both mentioned in the thread as
good references for the work being done to actually use capabilities
in systems. Those pre-date the securebits work, so Dönmez was looking
for use-cases for the new feature. Morgan replied that containers were
one, deferring to Hallyn who has some ideas on
We tend to talk about 'system containers' versus 'application
containers'. A system container would be like a vserver or openvz
instance, something which looks like a separate machine. I was
going to say I don't imagine per-process securebits being useful
there, but actually since a system container doesn't need to do any
hardware setup it actually might be a much easier start for a full
SECURE_NOROOT distro than a real machine. Heck, on a real machine init
and a few legacy [daemons] could run in the init namespace, while users
log in and apache etc run in a SECURE_NOROOT container.
But I especially like the thought of for instance postfix running in a
carefully crafted application container (with its own virtual network
card and limited file tree and no visibility of other processes) with
Capabilities are an interesting, but complicated, security feature. For
most of the ten years they have been part of the Linux kernel, they have
either been broken, ignored, or both. With the latest work being done by
Hallyn, Morgan, and others, capabilities are finally becoming a fully-working
alternative to things like SELinux. It will be interesting to see if
more user utilities will become capability-aware and whether distributions
start using capabilities. Some day, root may just fade away.
Comments (4 posted)
The kernel developers are generally quite good about responding to security
problems. Once a vulnerability in the kernel has been found, a patch comes
out in short order; system administrators can then apply the patch (or get
a patched kernel from their distributor), reboot the system, and get on
with life knowing that the vulnerability has been fixed. It is a system
which works pretty well.
One little problem remains, though: rebooting the system is a pain. At a
minimum, it requires a few minutes of down time. In many situations, that
down time cannot be tolerated. Reboots also disrupt any ongoing work,
break existing network connections, and can cause the loss of results from
long-running processes. And, most importantly of all, reboots prove
traumatic for a certain subset of Linux administrators who prize a long
uptime above almost all other things. Administrators currently have to
choose between multi-year uptimes and security fixes; anything which frees
them from a dilemma of this magnitude can only be welcome.
That "anything" might just be a recently-announced project called ksplice. With ksplice, system
administrators can have the best of both worlds: security fixes without
An in-depth explanation of how ksplice works can be found in this document [PDF].
In short, ksplice requires as input the source tree for the running kernel
and the security patch. It will then build two kernels, one with the patch
and one without; the kernels are built with a special set of options which
makes it easy to figure out which functions change as a result of the
patch. The two kernels will be compared, with the purpose of finding those
functions. Changes can propagate further than one might expect, especially
if, for example, an inline function is modified.
Once a list of changed functions has been made, the updated code for those
functions is packaged into a kernel module and loaded
into the system. Then comes the tricky part: getting the
running kernel to start using the new code. That requires patching the
running code, which is a risky thing to do. Ksplice starts with a call to
stop_machine_run(), which dumps a high-priority thread onto each
processor, thus taking control of all processors in the system. It then
examines all threads in the system to ensure that none of them are running
in the functions to be replaced; if so, trampoline jumps are patched into
the beginning of each replaced function (they "bounce" the call to the old
code into the replacement code) and life continues. Otherwise
ksplice will back off and try again later.
This method imposes a number of limitations. One is that only code changes
can be patched in with ksplice; patches which make changes to data
structures cannot be accommodated. Another comes from the retry-based
approach to ensuring that no threads are running in the patched functions;
what happens if one of those functions is never free? Kernel functions
like schedule(), sys_poll(), or sys_waitid() are
likely to always have processes running within them. In cases like this,
ksplice will eventually give up and inform the user that the patch cannot
be done; it is simply not possible to make changes to those particular
These limitations mean that, out of 50 security patches examined by the
ksplice developers, eight could not be applied with ksplice. So multi-year
uptimes are probably still incompatible with the application of all
security patches. Even so, ksplice certainly has the potential to reduce
patch-related downtime considerably. Chances are good that there will be a
fair amount of interest in ksplice in sites running high-uptime,
There are few things in the way of an immediate merge of this code into the
mainline. One is a matter of coding quality and can be fixed. Then, there
is the matter of the lead developer being
unconvinced that merging this code makes sense since it is,
essentially, a standalone feature. Andi Kleen's response made the (usual) reasons for merging
the code clear:
To be honest you weren't the first to come up with something like
this (although you're the first to post to l-k as far as I
know). But the usual problem of something that is kept out of tree
is that it eventually bitrots and gets forgotten. The only sane way
to make such extensions a generically usable linux feature is to
merge them to mainline.
So, presumably, the code will eventually be proposed for a mainline merge.
But there is one other little difficulty pointed out by Tomasz Chmielewski:
Microsoft holds a
patent described this way:
A system and method for automatically updating software components
on a running computer system without requiring any interruption of
service. A software module is hotpatched by loading a patch into
memory and modifying an instruction in the original module to jump
to the patch.
Microsoft came up with this novel new technique in the distant past: 2002.
The posting immediately brought out a crowd of surprised graybeards who
distinctly remember using such techniques on their PDP-11 systems some
decades before Microsoft "invented" hot-patching. The basic claim of the
patent would thus appear to be invalidated by some decades' worth of prior
art, but some of the dependent claims include features (such as capturing
all other processors on the system) which were unlikely to be useful on
Given that the kernel developers are now well aware of this
patent, they must take it into account when deciding whether to accept this
code into the mainline. It would not be surprising if they chose to avoid
baiting the Microsoft FUD machine in this way, even if they all agreed that
the patent lacked validity. So a promising technology risks being left out
of the kernel as the result of a software patent which was filed at least
30 years too late.
Comments (64 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>