The current 2.6 development kernel is 2.6.27-rc3
on August 12.
Along with the expected pile of fixes, this release includes a bunch of big
kernel lock pushdown work in the watchdog subsystem, an SMSC SCH5027 i2c
driver, an Analog Devices AD7414 temperature monitoring chip driver, and
the new ath9k driver (for Atheros 802.11n devices) contributed by Atheros.
See the short-form changelog
for lots of details.
As of this writing, no changes have been committed to the mainline
repository since the 2.6.27-rc3 release.
No stable kernel updates have been made over the last week.
Comments (none posted)
Kernel development news
Now computer security is a bit different because it has some night
of the living dead type properties where the zombies don't just
sneak in through the toilet window but they go around turning
security guards into zombies too but the basic premise is very much
-- Alan Cox
So after about a week of trying to squeeze information out of
anti-malware companies I'm starting to feel like I can better speak
for their needs (although they probably don't like what I have to
say). I would like to point out that many enterprises are going to
run this stuff on their machines. Period. End of story.
Personally I'd rather support a clean interface than have to try to
support support problems my customers have when their hacked
fragile systems have trouble.
-- Eric Paris
gives TALPA a threat model
Comments (6 posted)
The Linux Foundation has sent out a
announcing the availability of How
to participate in the Linux community
, an extended guide written by LWN
editor Jonathan Corbet. "'The Linux Foundation hears from developers
all over the world who want to participate in the kernel community but
sometimes struggle with exactly how,' said Amanda McPherson, vice
president, marketing and developer programs. 'This new guide will make that
process easier and bring new companies and developers into the Linux
Comments (none posted)
The Association for Computing Machinery (ACM) has released a special topics issue of Operating Systems Review
that covers the Linux kernel.
has papers on various topics of interest to kernel hackers and watchers. "Included are 12 papers about the advances that have been merged or are
candidates to be merged into the Linux kernel, as well as new idea
papers discussing promising experimental work.
" Click below for more information including a table of contents.
Full Story (comments: 1)
Your editor, who has carefully hidden several years of experience in
Fortran-based scientific programming from this readership, encountered
checkpoint and restart facilities a long time ago. In those days, programs
which would run for days of hard-won CPU time on an unimaginably fast CDC
or Cray mainframe would occasionally checkpoint themselves, minimizing the
amount of compute time lost when (not if) the system went down at an
inopportune time. It was a sort of insurance policy, with the premiums
being paid in the form of regular checkpoint calls.
Central processor time is no longer in such short supply, but there is
still interest in the ability to checkpoint a running application and
restore its state at some future time. One obvious application of this
capability is to restore the application on a different machine; in this
way, running applications can be moved from one host to another. If the
"application" is an entire container full of tasks, you now have the
ability to shift those containers around without the contained tasks even
being aware of what is going on. That, in turn, can provide for load
balancing, or just the ability to move containers off a machine which is
being taken down.
Linux does not have this capability now. Anybody who thinks about adding
it must certainly find the prospect daunting; applications have a
lot of state hidden throughout the system. This state includes open
files (and positions within the files), network sockets and pipes connected
to remote peers, signal states, outstanding timers, special-purpose file
descriptors (for epoll_wait(), for example), ptrace()
status, CPU affinities, SYSV semaphores, futexes, SELinux state, and much
failure to save and properly restore all of that state will result in a
broken process. It is no wonder that Linux does not do checkpoint and
restart; most rational developers would be driven away by the complexities
involved in making it work in an even remotely robust manner.
But, then, there was a time when rational programmers would not have
attempted the creation of Linux in the first place. So it should not be
surprising to see that developers are working on the checkpoint and restart
problem. The latest attempt can be seen in this patch set posted by Dave
Hansen (but originally written by Oren Laadan). It is far from being ready
for prime-time use, but it does show the sort of approach which is being
For some time, the prevailing wisdom was that checkpoint and restart should
be pushed as much into user space as possible. A user-space process could
handle the marshaling of process state and writing it to a file; the
kernel would only get involved when it was strictly necessary. It turns
out, though, that this involvement is required fairly often, requiring the
addition of "lots of new, little kernel interfaces" to make everything
work. So, at a meeting at OLS, the checkpoint/restart developers decided
to take a different approach and move the work into the kernel. The result
is the creation of just two new system calls:
int checkpoint(pid_t pid, int fd, unsigned long flags);
int restart(int crid, int fd, unsigned long flags);
A call to checkpoint() will write an image of the current process
to the given fd. The pid argument identifies the init
process for the current process's container; it is saved to the image but
not otherwise used in the current patch. If the operation succeeds, the
return value will be a unique (until the system reboots) "checkpoint image
restart() reverses the process; crid is the image
identifier, which is not currently used. The flags argument is
currently unused in both system calls.
These interfaces seem likely to change; future enhancements to the
interface are likely to include capabilities like checkpointing other
processes and groups of processes.
The CAP_SYS_ADMIN capability is currently required for both
checkpoint() and restart(). That is somewhat
unfortunate, in that it would be nice if ordinary, unprivileged processes
were able to checkpoint and restart themselves. There are some real
security implications which must be kept in mind, though, especially when
one considers the sort of damage that could result from an attempt to
restart a carefully-manipulated checkpoint image. Making
restart() secure for unprivileged use will not be a job for the
faint of heart.
At this stage of development, the patch does not even attempt to solve the
entire problem. It is able to save the current state of virtual memory
(but only in the absence of non-private, shared mappings), current
processor state, and the contents of the task structure. That is enough to
checkpoint and restart a "hello, world" program, but not a whole lot more.
But that is a reasonable place to start. Given the complexity of the
problem, proceeding in careful baby steps seems like the right way to go.
So we're probably not going to have a working checkpoint facility in the
kernel in the near future, but, with luck and patience, we'll eventually
have something that works.
Comments (16 posted)
Solid-state, flash-based storage devices are getting larger and cheaper, to
the point that they are starting to displace rotating disks in an
increasing number of systems. While flash requires less power, makes less
noise, and is faster (for random reads, at least), it has some peculiar
quirks of its own. One of those is the need for wear leveling - trying to
keep the number of erase/write cycles on each block about the same to avoid
wearing out the device prematurely.
Wear leveling forces the creation of an indirection layer mapping logical
block numbers (as seen by the computer) to physical blocks on the media.
Sometimes this mapping is done in a translation layer within the flash
device itself; it can also be done within the kernel (in the UBI layer, for example) if the
kernel has direct access to the flash array. Either way, this remapping
comes into play anytime a block is written to the device; when that
happens, a new block is chosen from a list of free blocks and the data is
written there. The block which previously contained the data is then added
to the free list.
If the device fills up with data, that list of free blocks can get quite
short, making it difficult to deal with writes and compromising the wear
leveling algorithm. This problem is compounded by the fact that the
low-level device does not really know which blocks contain useful data.
You may have deleted the several hundred pieces of spam backscatter from
your mailbox this morning, but the flash mapping layer has no way of
knowing that, so it carefully preserves that data while scrambling for free
blocks to accommodate today's backscatter. It would be nice if the
filesystem layer, which knows when the contents of files are no longer
wanted, could communicate this information to the storage layer.
At the lower levels, groups like the T13
committee (which manages the ATA standards) have created protocol
extensions to allow the host computer to indicate that certain sectors are
no longer in use; T13 calls its new command "trim." Upon receipt of a trim
command, an ATA device can immediately add the indicated sectors to its
free list, discarding any data stored there. Filesystems, in turn, can
cause these commands to be issued whenever a file is deleted (or
truncated). That will allow the storage device to make full use of the
space which is truly free, making the whole thing work better.
What Linux lacks now, though, is the ability for filesystems to tell
low-level block drivers about unneeded sectors. David Woodhouse has posted
a proposal to fill that gap in the form of the discard requests patch set. As
one might expect, the patches are relatively simple - there's not much to
communicate - though some subtleties remain.
At the block layer, there is a new request function which can be called by
int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
unsigned nr_sects, bio_end_io_t end_io);
This call will enqueue a request to bdev, saying that
nr_sects sectors starting at the given sector are no
longer needed and can be discarded. If the low-level block driver is
unable to handle discard requests, -EOPNOTSUPP will be returned.
Otherwise, the request goes onto the queue, and the end_io()
function will be called when the discard request completes. Most of the
time, though, the filesystem will not really care about completion - it's
just passing advice to the driver, after all - so end_io() can be
NULL and the right thing will happen.
At the driver level, a new function to set up discard requests must be
typedef int (prepare_discard_fn) (struct request_queue *queue,
struct request *req);
void blk_queue_set_discard(struct request_queue *queue,
To support discard requests, the driver should use
blk_queue_set_discard() to register its
prepare_discard_fn(). That function, in turn, will be called
whenever a discard request is enqueued; it should do whatever setup work is
needed to execute this request when it gets to the head of the queue.
Since discard requests go through the queue with all other block requests,
they can be manipulated by the I/O scheduler code. In particular, they can
be merged, reducing the total number of requests and, perhaps, pulling
together enough sectors to free a full erase block. There is a danger
here, though: the filesystem may well discard a set of sectors, then write
new data to them once they are allocated to a new file. It would be a
serious mistake to reorder the new writes ahead of the discard operation,
causing the newly-written data to be lost. So discard operations will need
to function as a sort of I/O barrier, preventing the reordering of writes
before and after the discard. There may be an option to drop the barrier
behavior, though, for filesystems which are able to perform their own
Outside of filesystems, there may occasionally be a need for other programs
to be able to issue discard requests; David's example is mkfs,
which could discard the entire contents of the device before making a new
filesystem. For these applications, there is a new ioctl() call
(BLKDISCARD) which creates a discard request. Needless to say,
applications using this feature should be rare and very carefully written.
David's patch includes tweaks for a number of filesystems, enabling them to
issue discard requests when appropriate. Some of the low-level flash
drivers have been updated as well. What's missing at this point is a fix
to the generic ATA driver; this will be needed to make discard requests
work with flash devices using built-in translation layers - which is most
of the devices on the market, currently. That should be a relatively small
piece of the puzzle, though; chances are good that this patch set will be
in shape for inclusion into 2.6.28.
Comments (25 posted)
Once upon a time, a Linux distribution would be installed with a
directory fully populated with device files. Most of them
represented hardware which would never be present on the installed system,
but they needed to be there just in case. Toward the end of this era, it
was not uncommon to find systems with around 20,000 special files in
, and the number continued to grow. This scheme was unwieldy
at best, and the growing number of hotpluggable devices (and devices in
general) threatened to make the whole structure collapse under its own
weight. Something, clearly, needed to be done.
For a little while, it seemed like that something might be devfs, but that
story did not end well. The
real solution to the /dev mess turned
out to be a tool called "udev," originally written by Greg Kroah-Hartman.
Udev would respond to device addition and removal events from the kernel,
creating and removing special files in /dev. Over time, udev
gained more powerful features, such as the ability to run external programs
which would help to create persistent names for transient devices. Udev is
now a key component in almost all Linux systems. It's like the plumbing in
a house; most people never notice it until it breaks. Then they realize
how important a component it really is.
Udev is configured via a set of rules, found under
/etc/udev/rules.d on most systems. These rules specify how
devices should be named, what their ownership and permissions should be,
which kernel modules should be loaded, which programs should be run, and so
on. The udev rule set also allows distributors and system administrators
to tweak the system's device-related behavior to match local needs and
Or maybe not. Udev maintainer Kay Sievers has recently let it be known that he would like all
distributors to be using the set of udev rules shipped with the program
itself. Says Kay:
We should all unify as far as possible. Red Hat, SUSE and Gentoo
are already using the same rules files, with a minimal rules set
on top, in a distro specific file. We ask the rest of the universe
to join us, and do the same.
This request was surprising to some. A Linux system is full of utilities
with configuration files under /etc; there is not normally a push
for all distributions to use the same ones. So why should all distributors
use the same udev rules? The reasoning here would
appear to come down to these points:
- The udev rules files are not really configuration files - they are,
instead, code written in a domain-specific language. For a
distributor to change those files is akin to patching the underlying C
code; far from unheard of, but generally seen as being undesirable.
As a way of underscoring this point, the udev developers are moving
the udev rules out of /etc and into /lib.
- There is little reason for distributors to differentiate themselves
based on their device naming schemes, and every reason to have all
Linux systems use the same device names. For the situations where
reasonable distributions may still differ - which group should own a
device, for example - there is a mechanism to add distributor-specific
- Increasingly, other packages will depend on a specific udev setup for
the underlying system. Distributors which use their own rules will
have a harder time making these new tools work right.
That last point refers, in particular, to DeviceKit, a
set of tools designed to make the management of devices easier. Between
them, udev and DeviceKit are being positioned to replace most of the
functionality in the much-maligned hal utility. See this
posting from David Zeuthen for lots more information on DeviceKit and
the migration away from hal in general.
The only problem is that some distributors aren't playing along. Marco
d'Itri, the Debian udev maintainer, responded that a common set of udev rules is
"not going to happen." The default rules, he says, do not meet Debian's
need to support older kernels, and, besides, "I consider my rules
much more readable and elegant than yours". Ubuntu maintainer Scott
James Remnant is also reluctant to use the
Scott appears to be willing to consider a change to the default rules if it
can be made to work right; Marco, instead, seems determined to hold out.
When encouraged to send patches to improve the default rules (and make them
more elegant), he responded:
Tell me what's missing from my rules instead, I will fix it and
then you will be able to use them. If nothing is missing, then you
can replace the files right now.
It appears likely that most of the distributors will come to see the udev
rules as code which is to be maintained upstream; even Debian may come
along eventually. As this happens, the layer of "plumbing" which sits just
on top of the kernel should be worked into better shape. Kernel developers
may find themselves involved in this process; David has posted a proposal that all new kernel subsystems,
before being merged, must be provided with a set of udev rules. That would
help the udev developers get a set of default rules into shape before the
distributors feel the need to step in to make things work.
Increasingly, the operation of the kernel is being tied to a set of
low-level user-space applications; there is not much which can be done with
a bare kernel. How all of this low-level plumbing should work, and how it
should interoperate with the kernel, is still being worked out. The
management of udev
policies is just one of the outstanding issues. So the
upcoming Linux Plumbers
Conference would seem to be well timed; there's a lot to talk about.
Comments (72 posted)
Patches and updates
Core kernel code
- Eduard - Gabriel Munteanu: kmemtrace.
(August 11, 2008)
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>