The current 2.6 kernel is 2.6.2
, which was announced
by Linus on February 3. Very
few changes have been made since the last release candidate. For those of
you just tuning in, the major changes since 2.6.1 include a bunch of block
device hotplug work, many big driver updates, sysfs support for many new
types of devices, a big XFS update, some sleep_on()
and lots of fixes; see the long-format
for the details.
Linus's BitKeeper tree contains, as of this writing, a fair number of
patches. One of them, is a VFS fix by Stephen Tweedie which
addresses a problem (triggered, but not caused, by SELinux) that delayed
the first Fedora Core 2 test release. Other patches which have been
merged include some architecture updates, some dead code removal, a RAID
update, the deprecation of the raw block device driver, the
readX_relaxed() functions for reading from PCI space without
ordering requirements, a large set of gcc-3.5 fixes, some network driver
updates, and various other fixes.
The current patch set from Andrew Morton is 2.6.2-rc3-mm1. Recent additions to the -mm
tree include the CPU hotplug patch, the "large number of groups" patch, a
new variant on snprintf() (see below), and lots of fixes. Note
that the large groups patch breaks the intermezzo filesystem, which appears
to be unmaintained under 2.6 for now.
The current 2.4 kernel is 2.4.24. Marcelo released 2.4.25-pre8 on January 29; it contains a fair
amount of new stuff: a big USB update (including the new gadget code), CIFS
work from 2.6, some SCSI driver updates, various architecture updates, and
more. This is, says Marcelo, probably the last prepatch (before the release
Comments (1 posted)
Kernel development news
The better part of a year ago, your editor replaced his ancient Sony Vaio
laptop with a new Vaio laptop. The new machine is quite nice in many ways,
but it came with an interesting surprise: the old BIOS-based
suspend-to-disk functionality was no more. In the modern world, suspending
the system is supposed to be done by the operating system, not by the
hardware; that's what we call "progress."
Ever since getting the new laptop, your editor has been interested in the
software suspend patch, which promises to restore that missing
functionality. Versions of that patch have been working reasonably well
for a while, but software suspend work has not stood still. The announcement of the software suspend 2.0
patch was thus of interest.
The new patch brings with it a number of new improvements. Software
suspend now works on systems with high memory (up to 4GB, which will be
sufficient for most laptops for a little while yet), SMP systems (2.4
only), and preemptive kernels. Suspend-to-disk will now work with swap
files, not just dedicated partitions. Compression of the saved image is
supported, which can lead to faster suspends and resumes on some systems.
And, of course, there is a nicer, splash-screen enabled user interface.
The fact remains, however, that software suspend is a hard problem, and the
Linux version still has some ground to cover before it is truly ready for
general use. Your editor had no end of trouble getting the 2.0 patch to
work until the software suspend hackers pointed out the USB code which had
been built into the kernel. USB and power management do not yet play very
well together, it seems. The only way to make the 2.0 patch work reliably
on systems with USB is to compile all of the USB code in modular form so
that it may be removed from the kernel prior to suspending. There are also
issues with AGP video, SMP under 2.6, and various other parts of the
system. Software suspend can be made to work well, but you have to be
prepared to dig into the kernel a bit to get there.
It is encouraging to see how quickly this work is proceeding, however. A
stable, safe, reliable software suspend functionality later in the 2.6
series could well come about. (If you are interested in how software
suspend works, see the May 1, 2003 LWN
Comments (1 posted)
Device driver authors sometimes find that they have to perform DMA operations
on very small pieces of memory. It is tempting to just perform this sort
of DMA (often just a few bytes) directly into or out of a kernel data
structure. The problem with this approach is that caching issues can
arise; memory adjacent to the region being read or written by the device
can end up with the wrong values. Needless to say, this sort of memory
corruption is not good for long-term system stability.
This problem can be avoided through the use of "PCI pools." A PCI pool is
simply a source of small pieces of memory which are suitable for DMA
operations. A driver which makes use of a PCI pool for its small DMA
needs will not have memory corruption issues.
There is only one problem with PCI pools: not all devices are attached to a
PCI bus. With the intent of making the PCI pool functionality available to
a wider class of devices, Deepak Saxena has posted a set of patches implementing a new "DMA pool"
new interface is strikingly similar to the old one - to the point that the
old pci_pool_ functions can be emulated with simple macros. As a
result, drivers using the old PCI functions will continue to work without
the new scheme, DMA
pools are allocated and destroyed with:
struct dma_pool *dma_pool_create(const char *name, struct device *dev,
size_t size, size_t align,
void dma_pool_destroy(struct dma_pool *pool);
Parameters for the creation of the pool include its name, the device which
will be using the pool, the size of blocks to be allocated from the pool,
and the required alignment. Optionally, the allocation parameter
can be used to keep pool memory from crossing a specific memory size
barrier; if allocation is 4096, for example, no pool allocation will cross a 4K
The main difference
from the old pci_pool_create() function is the use of a
device structure rather than a pci_dev structure.
The allocation and deallocation functions are:
void *dma_pool_alloc(struct dma_pool *pool, int mem_flags,
void dma_pool_free(struct dma_pool *pool, void *vaddr,
Internally, the new pool functions bear a strong resemblance to the old
ones - with the obvious exception that the memory for the pools is now
allocated using the generic DMA functions.
This patch has been received well; chances are it will appear in a kernel
sometime after 2.6.2 comes out.
Comments (none posted)
Any C coder worth his or her salt knows that encoding text into a string
invites buffer overflows, and is thus dangerous.
The proper way of doing things is with snprintf()
, which takes the
length of the destination string as a parameter, and will not overrun it.
Callers to snprintf()
generally assume that the return value is
the length of what was actually encoded into the destination array. That
turns out, however, to not be the case. As per the C99 standard,
returns the length the resulting string would
be, assuming it all fit into the destination array. As a result of this
misunderstanding, the kernel is full of snprintf()
calls which use
the return value incorrectly.
This mistake is rarely a problem; snprintf() almost never has to
truncate its output, so the return value is what the programmer is
expecting. Every miscoded use is an invitation for trouble, however, and
really should be fixed. To that end, the 2.6.2-rc3-mm1 tree contains a patch by Juergen
Quade which adds a couple of new functions:
int scnprintf(char *buf, size_t size, const char *format, ...);
int vscnprintf(char *buf, size_t size, const char *format, va_list args);
The new functions work the way many programmers expected the old ones to:
they return the length of the string actually created in buf. The
plan is to migrate the kernel over to the new functions; the patch fixes
well over 200 snprintf() and vsnprint() calls. Unless
the old functions are eventually removed, however, they are likely to be a
source of programming errors well into the future.
Comments (13 posted)
The sysfs virtual filesystem is one of the many additions to the 2.6
kernel. sysfs is the user-space presentation of the kernel's device model;
it is used by the udev
utility to create device nodes for hardware
and, eventually, numerous other purposes. There is a lot of information
about the system available under sysfs; it may, eventually, replace many of
the files currently found under /proc
There is one little problem with sysfs, however. It is built as a simple
kernel filesystem using the VFS cache as its backing store. This is an
easy way to build a kernel filesystem, since the generic VFS code does most
of the hard work for you. It does, however, require the kernel to maintain
a directory entry ("dentry") cache entry and an inode in memory for every
file and directory in
the filesystem. As sysfs has grown, the amount of memory it dedicates to
dentries and inodes has grown as well. Even a small system can have
several hundred files in /sys; that number can grow impressively
for larger systems. The memory that all those sysfs nodes occupy can be
painful for very small systems (which do not have much memory to spare) and
for very large systems (because sysfs lives in low memory, which is at a
In order to deal with this problem, Maneesh Soni has been working on a set
of patches which provides a true backing store for sysfs. These patches
(the full set can be found in the "patches and updates" section, below)
retain the current VFS-level cache for directories; doing otherwise turns
out to open a fairly large can of worms in how the device model and the VFS
interact. All of the attribute files (which make up 70% or so of sysfs
entries), however, can be more compactly represented by the sysfs code
itself. All that is really needed for an attribute, after all, is its name
and pointers to the "show" and "store" functions.
To this end, the patches create a new sysfs_dirent structure which
describes a node in the sysfs hierarchy. These structures implement an
in-core representation of the sysfs tree that takes up far less space than
the full VFS-cached version. When user space accesses a specific attribute
node, it is a fairly straightforward matter to create the inode and dentry
structures on the spot. Neither structure need be pinned into memory, so
they can be aged out with the rest of the VFS cache.
The result of all this work, Maneesh claims,
is a savings of 145MB of low memory on his (massive) test system. The
number of active dentries in this system drops from over 60,000 to under 9,000.
Unlike early versions of this patch, the current effort also avoids making
changes to the kobject structure, so no penalty is paid for
structures using kobjects which do not appear in sysfs. As the patch has
evolved, the number of criticisms has gone down; sysfs backing store
appears to be getting closer to ready for inclusion.
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>