Release status
Kernel release status
The current 2.6 prepatch is 2.6.22-rc1,
released by Linus on
May 12. Features added to 2.6.22 since last week's Kernel Page
include the
eventfd system
calls, a new IEEE 1394 ("Firewire") stack "designed for robustness and
simplicity," drivers for KingSun DS-620 USB-IrDA dongles, Native
Instruments USB audio devices, and WM8753 audio codecs (as found in the
OpenMoko phone), a large set of fixes to the "libertas" wireless driver,
and support for a number of new ARM processors. See
the short-form changelog for the details, or
the
log-format log for vast amounts of detail.
As of this writing, about 100 changesets (almost all fixes) have been added
to the mainline repository since 2.6.22-rc1.
The current -mm tree is 2.6.22-rc1-mm1. Recent changes
to -mm include a number of filesystem writeback fixes, a CRC7
implementation, an improved version of the swap prefetch code, and an early
startup development tree for the i386 architecture.
Comments (6 posted)
Kernel development news
Quote of the week
It's quite amazing really at the same time as people are posting
crypto keys everywhere in defiance of USSA law, we've got free
software people trying to remove references to a piece of out of
tree software, and one that is free software.
--
Alan Cox
Comments (8 posted)
2.6 and the user-space ABI
The 2.6.22-rc1 kernel is out, and the reports of regressions are beginning
to trickle in. A couple of those involve user-space binary interface
changes: one in the video4linux2 interface and one in the i2c code (which
affects hardware monitoring utilities). The V4L2 regression involves a
change made to a structure passed to and from user space; chances are good
that it will be reverted before the final 2.6.22 kernel comes out. For the
time being, the i2c problem is "fixed" by upgrading to version 2.10.3 of
the lm_sensors package.
Linus isn't happy about the forced
lm_sensors upgrade; he has asked for a way to avoid that requirement. In
response, i2c maintainer Jean Delvare raised
some misgivings about the stable ABI policy:
While I'm all for keeping things relatively stable and not asking
the user to constantly upgrade user-space, I believe that we just
can't promise to never break user-level interfaces while keeping
the development pace we have right now. We can promise to grant
people significant delay before we drop compatibility options, but
"forever" doesn't scale.
Those comments notwithstanding, Linux has managed to maintain user-space
ABI compatibility quite nicely for many years. There are certainly
exceptions, but they are few enough and far enough between that each one
stands out. But, as Christoph Hellwig points
out, the situation is not perfect:
Except for very rare case (modules support comes to mine) syscall
compatibility works perfectly. But that's because syscalls are a
very visible ABI and people don't break them by accident. They
also don't decide they have a cool new scheme all syscalls need to
follow now.
Now compare that to sysfs..
The user-space ABI now goes well beyond system calls. The huge sysfs
interface (4800 files on your editor's desktop) is a big piece of user
space's view of the system, and it is a piece which is difficult to avoid
breaking. Directories in sysfs correspond directly to data structures
within the kernel; changes to those structures will often have consequences
in sysfs. So kernel developers may think that they are operating far away
from the user-space interface, but end up breaking it anyway. Netlink,
/proc, and ioctl() also make up part of the ABI, and
they, too, can be easy to break. The V4L2 regression is the result of an
attempt to extend one ioctl() call breaking others.
The new development model can also make it harder to maintain
compatibility. Four or five major releases per year, each with a full load
of new major features, adds up to a lot of code changes. There is also no
clear point where whatever changes do prove to be unavoidable can be
made without surprising users. If the kernel developers were to disappear
for a year or two and return with a 3.0 release, nobody would be surprised
if it required a small amount of adaptation on the user-space side. But a
2.6.22 release - which contains needed fixes and new drivers along with new
features - is not expected to break working systems.
Arguments for returning to the older development model are hard to find,
though. Despite occasional glitches, things are generally working far
better than they did before 2.6 came out. The pace of development is
unlikely to slow. So the problem of occasional ABI regressions is likely
to remain with us. As is often the case, the best way to avoid such
problems - after a high degree of attention by the developers - is
extensive testing. User-space ABI changes caught during the development
cycle will almost certainly not survive into the final release, but it is
hard to fix problems that nobody knows about. As is also often the case,
automating this testing is hard; nobody can put together all of the
hardware and software combinations that the kernel will face. So the
worthy cause of maintaining a stable user-space interface is likely to
require a fair amount of human attention for the foreseeable future.
Comments (7 posted)
LogFS
Rotating magnetic storage technology has served us well for a long time.
It offers high capacities (for an ever-increasing value of "high"),
relatively fast and relatively uniform access times, and relatively good
reliability. It is generally accepted that rotating disks will be part of
our systems for some time yet. For smaller sizes, however, disks are
increasingly being pushed aside by solid state flash memory - and "smaller"
is an ever-increasing value as well. Flash is more compact, requires less
power, and offers truly random access, so it's not surprising to see it
being deployed in more situations.
Flash is not without its drawbacks. Its relatively high cost limits its
applications and it brings its own set of quirks which must be understood
and addressed by filesystem developers. Even so, some special-purpose
laptops rely on flash for their persistent storage needs now, and there are
rumors of more flash-based systems in the near future.
The most significant of the "quirks" mentioned above are:
- Flash storage cannot be simply overwritten like magnetic storage;
instead, a flash block must be explicitly erased and rewritten in two
separate steps. The size of the "erase blocks" may not match the
block size as understood by the operating system; often, the erase
blocks are relatively large.
- There are limits to the number of times a block of flash memory can be
erased and rewritten before it loses the ability to reliably store
data. That limit is generally around 100,000 cycles.
These hardware features have some interesting implications. What, for
example, happens when the operating system decides to rewrite a single
block within a larger flash erase block? A naive implementation would read
the entire erase block, perform the erase operation, then write the data
back with the new portion included. Should the system go down in the
middle of this operation, however, all of the data within the erase block
may be lost forever. If the operating system ignores the block lifetime
issues, it is likely to cycle some erase blocks much more frequently than
others, significantly shortening the overall life of the device. When one
is dealing with a low-duty-cycle device, such as a USB thumb drive, it's
possible to get away with ignoring the limitations that flash has. When a
flash drive is the primary storage device, though, a smarter approach is
called for.
Being smarter is usually a matter of using a filesystem which was
explicitly designed to work well with flash hardware. These filesystems
can dispense with the great care that other filesystems must take in how
blocks are laid out - there are no seek time or rotational latency issues
with flash drives. On the other hand, flash-aware filesystems must be
written with erase cycles in mind; they must not risk losing data during
these cycles and they should endeavor to spread these cycles across the
drive to maximize its lifetime.
The end result is that filesystems designed for flash devices take the
log-structured approach. The device is treated like a sort of circular
buffer, with new data always being written to the end. This approach makes
for fast write operations, but the read side can be a more complex story.
One approach taken is to attach some metadata to each erase block describing
which file that block belongs to and its version number. When an erase
block is to be rewritten, a new copy is made at the end with a higher
version number; reading the file is simply a matter of finding the erase
block with the highest version number.
Finding that block requires scanning the disk - something which, most
likely, one does not want to do for every read operation. The in-kernel
JFFS2 filesystem solves this problem by performing a scan when the
filesystem is mounted. It builds an in-memory data structure which speeds
subsequent accesses considerably. There is a cost, though: the initial
scan can make mounting slow, and the in-memory tree can take a considerable
amount of space. Given that flash filesystems are often used on small,
embedded systems - where both boot time and memory are at a premium - these
costs are significant.
Jörn Engel thinks he has a better way in the form of the LogFS filesystem, currently
proposed for inclusion into the mainline. The core idea behind LogFS is
that, rather than building the filesystem tree at mount time, the
filesystem code should store the tree
on the device itself, much like traditional filesystems do. Putting
the tree on the flash device reduces mount times (Jörn says that an
OLPC system goes from 3.3 seconds under JFFS2 to 60ms under LogFS) and
should reduce the runtime memory requirements considerably.
The on-flash tree looks much like the structure used by ext2. There are
some differences in how it is managed, however. The log structure of the
filesystem implies that blocks cannot be rewritten in place; any time a
block is changed it must be moved and written to a new location. If there
are pointers to the moved block (think about the usual indirect blocks used
to store the layout of larger files), the blocks containing the pointers
must also be changed, and thus moved. That, in turn, will require changes
at the next level up in the tree. Thus changes at the bottom of the tree
will propagate upward all the way to the root. This is the "wandering
tree" algorithm. One of the advantages is that the old filesystem
structure remains valid until the root is rewritten - a crash could cause
the loss of the last operation, but it will leave previous data and the
structure of the filesystem intact.
Actually managing the entire directory tree as a wandering tree would be
expensive; beyond that, files with multiple hard links break the tree
structure and make wandering trees much harder to implement. So the actual
tree implemented by LogFS just has two levels. There is an "inode file"
containing the inode structures for every file and directory existing
within the filesystem; each inode then points to the associated blocks
holding the file's data. Directory entries contain a simple integer index
giving the inode offset within the inode file. So changes to an inode only
require writing the inode itself and the inode file; the rest of the
directory structure need not be touched.
To tie it all together, LogFS sets aside a group of blocks as the "anchor
area," where versioned pointers to the root inode can be found. Mounting
the filesystem requires scanning this anchor area to find the current
version of the root inode, at which point the rest of the filesystem
becomes accessible. This mechanism allows the root to be found in constant
time without the need to scan the entire device.
LogFS has been through a couple rounds of review, with significant changes
each time. Barring significant problems, it should be getting close to
ready, perhaps it will be merged in time for 2.6.23.
(See also: Jörn's
LogFS paper from which much of the above was cribbed).
Comments (3 posted)
Scatterlist chaining
High-performance I/O generally involves the use of direct memory access
(DMA) operations. With DMA, the I/O device transfers data directly to and
from main memory without the intervention of the CPU. In the simplest form
of DMA, the controller is handed a pointer to a region of memory, given the
length, and told to do its thing. The processor can then forget about the
operation until the device signals that the work is done.
This simple view has a drawback, however, in that it assumes that the data
to be transferred is stored contiguously in memory. When the I/O buffer is
in kernel space, the kernel can often arrange for it to be physically
contiguous - though that gets harder as the size of the buffers gets
larger. If the buffer is in user space, it is guaranteed to be scattered
around physical memory. So it would be nice if DMA operations could work
with buffers which are split into a number of distinct pieces.
In fact, with any reasonably capable peripheral device, buffers can be
split this way. The term for operations on such buffers is "scatter/gather
I/O"; scatter/gather has been well supported under Linux for some time.
The DMA chapter of Linux Device Drivers
covers scatter/gather in a fair amount of detail. In short, a driver
starts by filling in an array of scatterlist structures, which (on
the i386 architecture) look like:
struct scatterlist {
struct page *page;
unsigned int offset;
dma_addr_t dma_address;
unsigned int length;
};
For each segment, the page pointer tells where the segment is to be
found in memory, offset tells where the data begins within the
page, and length is the length of the segment. Once the list has
been filled in, the driver calls:
int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction direction);
This operation, at a minimum, fills in the dma_address field of
each scatterlist entry with an address which can be given to the peripheral. It
might do more, though: physically contiguous pages may be coalesced into a
single scatterlist entry, or the system's I/O memory management
unit might be programmed to make parts (or all) of the list virtually
contiguous from the device's point of view. All of this - including the
exact form of struct scatterlist - is architecture dependent, but
the scatter/gather interface is set up so that drivers need not worry about
architecture details.
Recently, a particular constraint in the scatter/gather interface has
turned up. For various reasons, scatterlists must fit within a single
page; that restriction puts an upper limit on the number of segments which
may be represented. On the i386 architecture, with high memory enabled,
struct scatterlist requires 20 bytes, which limits a scatterlist
to 204 entries. If each scatterlist entry points to a full page, the
maximum size for a DMA transfer will be about 800KB. On an x86-64 system,
the situation is worse: the structure uses 40 bytes, cutting the maximum
length in half.
There are situations where larger I/O operations are desirable. The block
I/O subsystem is one of them, but there are certainly others:
high-resolution video capture devices are an example. The limitation on
scatterlist length is one of the factors motivating developers who are
working on large block size
support. By increasing the effective page size, they are able to
increase the maximum I/O operation size.
Increasing the page size is not the only feasible approach, though; another
is simply to make scatterlists longer. Multi-page contiguous scatterlists
are not really in the cards, but chaining single-page scatterlists can be
done. Jens Axboe has been working on that approach; his scatterlist chaining patch is
on its sixth revision as of this writing.
Chaining is done by overloading the page pointer in the last
scatterlist entry in a page. The least significant bit is set to indicate
that the entry is, in fact, a chain link rather than another segment to
transfer. The change is almost transparent to drivers. In current
kernels, the code which iterates through a scatterlist usually looks
something like this:
struct scatterlist *sg = &the_scatterlist[0];
for (i = 0; i < nentries; i++) {
program_io_operation(sg);
sg++;
}
When chaining is being used, simply incrementing through the array no
longer works. So Jens has added a simple sg_next() macro to
follow the the chain links when necessary. So the sg++ line above
turns into something like:
sg = sg_next(sg);
Since a driver change is required, chained scatterlists should not be used
unless one knows for sure that the driver is prepared for them. The patch
from Jens fixes up a number of drivers, especially in the block subsystem.
Even so, the maximum I/O size must be raised explicitly by the
administrator (via a sysfs file) before chaining will be turned on. Once
it's enabled, however, multi-megabyte I/O operations become possible. No
intrusive memory management changes required.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
- Bartlomiej Zolnierkiewicz: IDE update.
(May 11, 2007)
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>