Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.27-rc3, released on August 12. Along with the expected pile of fixes, this release includes a bunch of big kernel lock pushdown work in the watchdog subsystem, an SMSC SCH5027 i2c driver, an Analog Devices AD7414 temperature monitoring chip driver, and the new ath9k driver (for Atheros 802.11n devices) contributed by Atheros. See the short-form changelog for details, or the full changelog for lots of details.As of this writing, no changes have been committed to the mainline repository since the 2.6.27-rc3 release.
No stable kernel updates have been made over the last week.
Kernel development news
Quotes of the week
Linux kernel participation guide published by the Linux Foundation
The Linux Foundation has sent out a press release announcing the availability of How to participate in the Linux community, an extended guide written by LWN editor Jonathan Corbet. "'The Linux Foundation hears from developers all over the world who want to participate in the kernel community but sometimes struggle with exactly how,' said Amanda McPherson, vice president, marketing and developer programs. 'This new guide will make that process easier and bring new companies and developers into the Linux fold.'"
ACM Operating Systems Review issue on the Linux Kernel available
The Association for Computing Machinery (ACM) has released a special topics issue of Operating Systems Review that covers the Linux kernel. The issue has papers on various topics of interest to kernel hackers and watchers. "Included are 12 papers about the advances that have been merged or are candidates to be merged into the Linux kernel, as well as new idea papers discussing promising experimental work." Click below for more information including a table of contents.
Kernel-based checkpoint and restart
Your editor, who has carefully hidden several years of experience in Fortran-based scientific programming from this readership, encountered checkpoint and restart facilities a long time ago. In those days, programs which would run for days of hard-won CPU time on an unimaginably fast CDC or Cray mainframe would occasionally checkpoint themselves, minimizing the amount of compute time lost when (not if) the system went down at an inopportune time. It was a sort of insurance policy, with the premiums being paid in the form of regular checkpoint calls.Central processor time is no longer in such short supply, but there is still interest in the ability to checkpoint a running application and restore its state at some future time. One obvious application of this capability is to restore the application on a different machine; in this way, running applications can be moved from one host to another. If the "application" is an entire container full of tasks, you now have the ability to shift those containers around without the contained tasks even being aware of what is going on. That, in turn, can provide for load balancing, or just the ability to move containers off a machine which is being taken down.
Linux does not have this capability now. Anybody who thinks about adding it must certainly find the prospect daunting; applications have a lot of state hidden throughout the system. This state includes open files (and positions within the files), network sockets and pipes connected to remote peers, signal states, outstanding timers, special-purpose file descriptors (for epoll_wait(), for example), ptrace() status, CPU affinities, SYSV semaphores, futexes, SELinux state, and much more. Any failure to save and properly restore all of that state will result in a broken process. It is no wonder that Linux does not do checkpoint and restart; most rational developers would be driven away by the complexities involved in making it work in an even remotely robust manner.
But, then, there was a time when rational programmers would not have attempted the creation of Linux in the first place. So it should not be surprising to see that developers are working on the checkpoint and restart problem. The latest attempt can be seen in this patch set posted by Dave Hansen (but originally written by Oren Laadan). It is far from being ready for prime-time use, but it does show the sort of approach which is being taken.
For some time, the prevailing wisdom was that checkpoint and restart should be pushed as much into user space as possible. A user-space process could handle the marshaling of process state and writing it to a file; the kernel would only get involved when it was strictly necessary. It turns out, though, that this involvement is required fairly often, requiring the addition of "lots of new, little kernel interfaces" to make everything work. So, at a meeting at OLS, the checkpoint/restart developers decided to take a different approach and move the work into the kernel. The result is the creation of just two new system calls:
int checkpoint(pid_t pid, int fd, unsigned long flags);
int restart(int crid, int fd, unsigned long flags);
A call to checkpoint() will write an image of the current process to the given fd. The pid argument identifies the init process for the current process's container; it is saved to the image but not otherwise used in the current patch. If the operation succeeds, the return value will be a unique (until the system reboots) "checkpoint image identifier". restart() reverses the process; crid is the image identifier, which is not currently used. The flags argument is currently unused in both system calls. These interfaces seem likely to change; future enhancements to the interface are likely to include capabilities like checkpointing other processes and groups of processes.
The CAP_SYS_ADMIN capability is currently required for both checkpoint() and restart(). That is somewhat unfortunate, in that it would be nice if ordinary, unprivileged processes were able to checkpoint and restart themselves. There are some real security implications which must be kept in mind, though, especially when one considers the sort of damage that could result from an attempt to restart a carefully-manipulated checkpoint image. Making restart() secure for unprivileged use will not be a job for the faint of heart.
At this stage of development, the patch does not even attempt to solve the entire problem. It is able to save the current state of virtual memory (but only in the absence of non-private, shared mappings), current processor state, and the contents of the task structure. That is enough to checkpoint and restart a "hello, world" program, but not a whole lot more. But that is a reasonable place to start. Given the complexity of the problem, proceeding in careful baby steps seems like the right way to go. So we're probably not going to have a working checkpoint facility in the kernel in the near future, but, with luck and patience, we'll eventually have something that works.
Block layer discard requests
Solid-state, flash-based storage devices are getting larger and cheaper, to the point that they are starting to displace rotating disks in an increasing number of systems. While flash requires less power, makes less noise, and is faster (for random reads, at least), it has some peculiar quirks of its own. One of those is the need for wear leveling - trying to keep the number of erase/write cycles on each block about the same to avoid wearing out the device prematurely.Wear leveling forces the creation of an indirection layer mapping logical block numbers (as seen by the computer) to physical blocks on the media. Sometimes this mapping is done in a translation layer within the flash device itself; it can also be done within the kernel (in the UBI layer, for example) if the kernel has direct access to the flash array. Either way, this remapping comes into play anytime a block is written to the device; when that happens, a new block is chosen from a list of free blocks and the data is written there. The block which previously contained the data is then added to the free list.
If the device fills up with data, that list of free blocks can get quite short, making it difficult to deal with writes and compromising the wear leveling algorithm. This problem is compounded by the fact that the low-level device does not really know which blocks contain useful data. You may have deleted the several hundred pieces of spam backscatter from your mailbox this morning, but the flash mapping layer has no way of knowing that, so it carefully preserves that data while scrambling for free blocks to accommodate today's backscatter. It would be nice if the filesystem layer, which knows when the contents of files are no longer wanted, could communicate this information to the storage layer.
At the lower levels, groups like the T13 committee (which manages the ATA standards) have created protocol extensions to allow the host computer to indicate that certain sectors are no longer in use; T13 calls its new command "trim." Upon receipt of a trim command, an ATA device can immediately add the indicated sectors to its free list, discarding any data stored there. Filesystems, in turn, can cause these commands to be issued whenever a file is deleted (or truncated). That will allow the storage device to make full use of the space which is truly free, making the whole thing work better.
What Linux lacks now, though, is the ability for filesystems to tell low-level block drivers about unneeded sectors. David Woodhouse has posted a proposal to fill that gap in the form of the discard requests patch set. As one might expect, the patches are relatively simple - there's not much to communicate - though some subtleties remain.
At the block layer, there is a new request function which can be called by filesystems:
int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
unsigned nr_sects, bio_end_io_t end_io);
This call will enqueue a request to bdev, saying that nr_sects sectors starting at the given sector are no longer needed and can be discarded. If the low-level block driver is unable to handle discard requests, -EOPNOTSUPP will be returned. Otherwise, the request goes onto the queue, and the end_io() function will be called when the discard request completes. Most of the time, though, the filesystem will not really care about completion - it's just passing advice to the driver, after all - so end_io() can be NULL and the right thing will happen.
At the driver level, a new function to set up discard requests must be provided:
typedef int (prepare_discard_fn) (struct request_queue *queue,
struct request *req);
void blk_queue_set_discard(struct request_queue *queue,
prepare_discard_fn *dfn);
To support discard requests, the driver should use blk_queue_set_discard() to register its prepare_discard_fn(). That function, in turn, will be called whenever a discard request is enqueued; it should do whatever setup work is needed to execute this request when it gets to the head of the queue.
Since discard requests go through the queue with all other block requests, they can be manipulated by the I/O scheduler code. In particular, they can be merged, reducing the total number of requests and, perhaps, pulling together enough sectors to free a full erase block. There is a danger here, though: the filesystem may well discard a set of sectors, then write new data to them once they are allocated to a new file. It would be a serious mistake to reorder the new writes ahead of the discard operation, causing the newly-written data to be lost. So discard operations will need to function as a sort of I/O barrier, preventing the reordering of writes before and after the discard. There may be an option to drop the barrier behavior, though, for filesystems which are able to perform their own request ordering.
Outside of filesystems, there may occasionally be a need for other programs to be able to issue discard requests; David's example is mkfs, which could discard the entire contents of the device before making a new filesystem. For these applications, there is a new ioctl() call (BLKDISCARD) which creates a discard request. Needless to say, applications using this feature should be rare and very carefully written.
David's patch includes tweaks for a number of filesystems, enabling them to issue discard requests when appropriate. Some of the low-level flash drivers have been updated as well. What's missing at this point is a fix to the generic ATA driver; this will be needed to make discard requests work with flash devices using built-in translation layers - which is most of the devices on the market, currently. That should be a relatively small piece of the puzzle, though; chances are good that this patch set will be in shape for inclusion into 2.6.28.
Udev rules and the management of the plumbing layer
Once upon a time, a Linux distribution would be installed with a /dev directory fully populated with device files. Most of them represented hardware which would never be present on the installed system, but they needed to be there just in case. Toward the end of this era, it was not uncommon to find systems with around 20,000 special files in /dev, and the number continued to grow. This scheme was unwieldy at best, and the growing number of hotpluggable devices (and devices in general) threatened to make the whole structure collapse under its own weight. Something, clearly, needed to be done.For a little while, it seemed like that something might be devfs, but that story did not end well. The real solution to the /dev mess turned out to be a tool called "udev," originally written by Greg Kroah-Hartman. Udev would respond to device addition and removal events from the kernel, creating and removing special files in /dev. Over time, udev gained more powerful features, such as the ability to run external programs which would help to create persistent names for transient devices. Udev is now a key component in almost all Linux systems. It's like the plumbing in a house; most people never notice it until it breaks. Then they realize how important a component it really is.
Udev is configured via a set of rules, found under /etc/udev/rules.d on most systems. These rules specify how devices should be named, what their ownership and permissions should be, which kernel modules should be loaded, which programs should be run, and so on. The udev rule set also allows distributors and system administrators to tweak the system's device-related behavior to match local needs and taste.
Or maybe not. Udev maintainer Kay Sievers has recently let it be known that he would like all distributors to be using the set of udev rules shipped with the program itself. Says Kay:
This request was surprising to some. A Linux system is full of utilities with configuration files under /etc; there is not normally a push for all distributions to use the same ones. So why should all distributors use the same udev rules? The reasoning here would appear to come down to these points:
- The udev rules files are not really configuration files - they are,
instead, code written in a domain-specific language. For a
distributor to change those files is akin to patching the underlying C
code; far from unheard of, but generally seen as being undesirable.
As a way of underscoring this point, the udev developers are moving
the udev rules out of /etc and into /lib.
- There is little reason for distributors to differentiate themselves
based on their device naming schemes, and every reason to have all
Linux systems use the same device names. For the situations where
reasonable distributions may still differ - which group should own a
device, for example - there is a mechanism to add distributor-specific
rules.
- Increasingly, other packages will depend on a specific udev setup for the underlying system. Distributors which use their own rules will have a harder time making these new tools work right.
That last point refers, in particular, to DeviceKit, a set of tools designed to make the management of devices easier. Between them, udev and DeviceKit are being positioned to replace most of the functionality in the much-maligned hal utility. See this posting from David Zeuthen for lots more information on DeviceKit and the migration away from hal in general.
The only problem is that some distributors aren't playing along. Marco
d'Itri, the Debian udev maintainer, responded that a common set of udev rules is
"not going to happen." The default rules, he says, do not meet Debian's
need to support older kernels, and, besides, "I consider my rules
much more readable and elegant than yours
". Ubuntu maintainer Scott
James Remnant is also reluctant to use the
default rules.
Scott appears to be willing to consider a change to the default rules if it can be made to work right; Marco, instead, seems determined to hold out. When encouraged to send patches to improve the default rules (and make them more elegant), he responded:
It appears likely that most of the distributors will come to see the udev rules as code which is to be maintained upstream; even Debian may come along eventually. As this happens, the layer of "plumbing" which sits just on top of the kernel should be worked into better shape. Kernel developers may find themselves involved in this process; David has posted a proposal that all new kernel subsystems, before being merged, must be provided with a set of udev rules. That would help the udev developers get a set of default rules into shape before the distributors feel the need to step in to make things work.
Increasingly, the operation of the kernel is being tied to a set of low-level user-space applications; there is not much which can be done with a bare kernel. How all of this low-level plumbing should work, and how it should interoperate with the kernel, is still being worked out. The management of udev policies is just one of the outstanding issues. So the upcoming Linux Plumbers Conference would seem to be well timed; there's a lot to talk about.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page:
Distributions>>
