User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.2, which was announced by Linus on February 3. Very few changes have been made since the last release candidate. For those of you just tuning in, the major changes since 2.6.1 include a bunch of block device hotplug work, many big driver updates, sysfs support for many new types of devices, a big XFS update, some sleep_on() removal work, and lots of fixes; see the long-format changelog for the details.

Linus's BitKeeper tree contains, as of this writing, a fair number of patches. One of them, is a VFS fix by Stephen Tweedie which addresses a problem (triggered, but not caused, by SELinux) that delayed the first Fedora Core 2 test release. Other patches which have been merged include some architecture updates, some dead code removal, a RAID update, the deprecation of the raw block device driver, the readX_relaxed() functions for reading from PCI space without ordering requirements, a large set of gcc-3.5 fixes, some network driver updates, and various other fixes.

The current patch set from Andrew Morton is 2.6.2-rc3-mm1. Recent additions to the -mm tree include the CPU hotplug patch, the "large number of groups" patch, a new variant on snprintf() (see below), and lots of fixes. Note that the large groups patch breaks the intermezzo filesystem, which appears to be unmaintained under 2.6 for now.

The current 2.4 kernel is 2.4.24. Marcelo released 2.4.25-pre8 on January 29; it contains a fair amount of new stuff: a big USB update (including the new gadget code), CIFS work from 2.6, some SCSI driver updates, various architecture updates, and more. This is, says Marcelo, probably the last prepatch (before the release candidates start).

Comments (1 posted)

Kernel development news

Software Suspend 2.0

The better part of a year ago, your editor replaced his ancient Sony Vaio laptop with a new Vaio laptop. The new machine is quite nice in many ways, but it came with an interesting surprise: the old BIOS-based suspend-to-disk functionality was no more. In the modern world, suspending the system is supposed to be done by the operating system, not by the hardware; that's what we call "progress."

Ever since getting the new laptop, your editor has been interested in the software suspend patch, which promises to restore that missing functionality. Versions of that patch have been working reasonably well for a while, but software suspend work has not stood still. The announcement of the software suspend 2.0 patch was thus of interest.

The new patch brings with it a number of new improvements. Software suspend now works on systems with high memory (up to 4GB, which will be sufficient for most laptops for a little while yet), SMP systems (2.4 only), and preemptive kernels. Suspend-to-disk will now work with swap files, not just dedicated partitions. Compression of the saved image is supported, which can lead to faster suspends and resumes on some systems. And, of course, there is a nicer, splash-screen enabled user interface.

The fact remains, however, that software suspend is a hard problem, and the Linux version still has some ground to cover before it is truly ready for general use. Your editor had no end of trouble getting the 2.0 patch to work until the software suspend hackers pointed out the USB code which had been built into the kernel. USB and power management do not yet play very well together, it seems. The only way to make the 2.0 patch work reliably on systems with USB is to compile all of the USB code in modular form so that it may be removed from the kernel prior to suspending. There are also issues with AGP video, SMP under 2.6, and various other parts of the system. Software suspend can be made to work well, but you have to be prepared to dig into the kernel a bit to get there.

It is encouraging to see how quickly this work is proceeding, however. A stable, safe, reliable software suspend functionality later in the 2.6 series could well come about. (If you are interested in how software suspend works, see the May 1, 2003 LWN Kernel Page).

Comments (1 posted)

Generic DMA pools

Device driver authors sometimes find that they have to perform DMA operations on very small pieces of memory. It is tempting to just perform this sort of DMA (often just a few bytes) directly into or out of a kernel data structure. The problem with this approach is that caching issues can arise; memory adjacent to the region being read or written by the device can end up with the wrong values. Needless to say, this sort of memory corruption is not good for long-term system stability.

This problem can be avoided through the use of "PCI pools." A PCI pool is simply a source of small pieces of memory which are suitable for DMA operations. A driver which makes use of a PCI pool for its small DMA needs will not have memory corruption issues.

There is only one problem with PCI pools: not all devices are attached to a PCI bus. With the intent of making the PCI pool functionality available to a wider class of devices, Deepak Saxena has posted a set of patches implementing a new "DMA pool" abstraction. The new interface is strikingly similar to the old one - to the point that the old pci_pool_ functions can be emulated with simple macros. As a result, drivers using the old PCI functions will continue to work without changes.

In the new scheme, DMA pools are allocated and destroyed with:

    struct dma_pool *dma_pool_create(const char *name, struct device *dev,
                                     size_t size, size_t align,
				     size_t allocation);
    void dma_pool_destroy(struct dma_pool *pool);

Parameters for the creation of the pool include its name, the device which will be using the pool, the size of blocks to be allocated from the pool, and the required alignment. Optionally, the allocation parameter can be used to keep pool memory from crossing a specific memory size barrier; if allocation is 4096, for example, no pool allocation will cross a 4K page boundary. The main difference from the old pci_pool_create() function is the use of a device structure rather than a pci_dev structure.

The allocation and deallocation functions are:

    void *dma_pool_alloc(struct dma_pool *pool, int mem_flags,
                         dma_addr_t *handle);
    void dma_pool_free(struct dma_pool *pool, void *vaddr, 
                       dma_addr_t handle);

Internally, the new pool functions bear a strong resemblance to the old ones - with the obvious exception that the memory for the pools is now allocated using the generic DMA functions.

This patch has been received well; chances are it will appear in a kernel sometime after 2.6.2 comes out.

Comments (none posted)

snprintf() confusion

Any C coder worth his or her salt knows that encoding text into a string with sprintf() invites buffer overflows, and is thus dangerous. The proper way of doing things is with snprintf(), which takes the length of the destination string as a parameter, and will not overrun it. Callers to snprintf() generally assume that the return value is the length of what was actually encoded into the destination array. That turns out, however, to not be the case. As per the C99 standard, snprintf() returns the length the resulting string would be, assuming it all fit into the destination array. As a result of this misunderstanding, the kernel is full of snprintf() calls which use the return value incorrectly.

This mistake is rarely a problem; snprintf() almost never has to truncate its output, so the return value is what the programmer is expecting. Every miscoded use is an invitation for trouble, however, and really should be fixed. To that end, the 2.6.2-rc3-mm1 tree contains a patch by Juergen Quade which adds a couple of new functions:

    int scnprintf(char *buf, size_t size, const char *format, ...);
    int vscnprintf(char *buf, size_t size, const char *format, va_list args);

The new functions work the way many programmers expected the old ones to: they return the length of the string actually created in buf. The plan is to migrate the kernel over to the new functions; the patch fixes well over 200 snprintf() and vsnprint() calls. Unless the old functions are eventually removed, however, they are likely to be a source of programming errors well into the future.

Comments (13 posted)

Trimming down sysfs

The sysfs virtual filesystem is one of the many additions to the 2.6 kernel. sysfs is the user-space presentation of the kernel's device model; it is used by the udev utility to create device nodes for hardware and, eventually, numerous other purposes. There is a lot of information about the system available under sysfs; it may, eventually, replace many of the files currently found under /proc.

There is one little problem with sysfs, however. It is built as a simple kernel filesystem using the VFS cache as its backing store. This is an easy way to build a kernel filesystem, since the generic VFS code does most of the hard work for you. It does, however, require the kernel to maintain a directory entry ("dentry") cache entry and an inode in memory for every file and directory in the filesystem. As sysfs has grown, the amount of memory it dedicates to dentries and inodes has grown as well. Even a small system can have several hundred files in /sys; that number can grow impressively for larger systems. The memory that all those sysfs nodes occupy can be painful for very small systems (which do not have much memory to spare) and for very large systems (because sysfs lives in low memory, which is at a premium).

In order to deal with this problem, Maneesh Soni has been working on a set of patches which provides a true backing store for sysfs. These patches (the full set can be found in the "patches and updates" section, below) retain the current VFS-level cache for directories; doing otherwise turns out to open a fairly large can of worms in how the device model and the VFS interact. All of the attribute files (which make up 70% or so of sysfs entries), however, can be more compactly represented by the sysfs code itself. All that is really needed for an attribute, after all, is its name and pointers to the "show" and "store" functions.

To this end, the patches create a new sysfs_dirent structure which describes a node in the sysfs hierarchy. These structures implement an in-core representation of the sysfs tree that takes up far less space than the full VFS-cached version. When user space accesses a specific attribute node, it is a fairly straightforward matter to create the inode and dentry structures on the spot. Neither structure need be pinned into memory, so they can be aged out with the rest of the VFS cache.

The result of all this work, Maneesh claims, is a savings of 145MB of low memory on his (massive) test system. The number of active dentries in this system drops from over 60,000 to under 9,000. Unlike early versions of this patch, the current effort also avoids making changes to the kobject structure, so no penalty is paid for structures using kobjects which do not appear in sysfs. As the patch has evolved, the number of criticisms has gone down; sysfs backing store appears to be getting closer to ready for inclusion.

Comments (none posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Device drivers


Filesystems and block I/O


Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds