User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.25 merge window is still open, so there have not yet been any prepatches for this development cycle. Patches continue to flow into the mainline repository, with some 7500 changesets merged (as of this writing) for 2.6.25.

The current -mm tree is 2.6.24-mm1. Recent changes to -mm include the dropping of a number of subsystem trees due to patch conflicts and the movement of vast numbers of patches into the mainline.

For older kernels:, released on February 6, contains a significant number of fixes. It is likely to be the last release in the 2.6.22.x series.

Comments (none posted)

Kernel development news

Quotes of the week

I don't think that "developer-centric" debugging is really even remotely our problem, and that I'm personally a lot more interested in infrastructure that helps normal users give better bug-reports. And kgdb isn't even _remotely_ it.
-- Linus Torvalds, still not sold on kernel debuggers.

I used kgdb continuously for 4-5 years until it broke. I don't think I ever used it much for "debugging" as such. I used it more for general observation of what's going on in the kernel.
-- Andrew Morton

Comments (6 posted)

More stuff for 2.6.25

By Jonathan Corbet
February 6, 2008
Since last week's installment, some 3800 changesets have been merged into the mainline git repository. Some of the more interesting user-visible changes found in that patch stream include:

  • Support for new hardware, including RDC R-321x system-on-chip processors, Onkyo SE-90PCI and SE-200PCI sound devices, Xilinx ML403 AC97 controllers, TI TLV320AIC3X audio codecs, Realtek ALC889/ALC267/ALC269 codecs, VIA VT1708B HD audio codecs, SiS 7019 Audio Accelerator devices, C-Media 8788 (Oxygen) audio chipsets, Asus AV200-based sound cards, Freescale MPC8610 audio devices, Audiotrak Prodigy 7.1 HiFi audio devices, Conexant 5051 audio codecs, MediaTek/TempoTec HiFier Fantasia sound cards, wireless RNDIS devices (and Broadcom 4320-based devices in particular), USB printer gadgets (intended for use in printer firmware), and NetEffect 1/10Gb ethernet adapters.

  • The nearly-unused ALSA sequencer instrument layer has been removed.

  • SELinux has a new set of checks which allow the creation of policies which control the flow of packets into and out of the system.

  • Netfilter has a more flexible "hashlimit" mechanism for limiting the number of packets to/from a given source over time.

  • There is a new "flow" classifier for the network fair queueing code which allows the more flexible creation of traffic policies.

  • The futex mechanism has a new "bitset wait" mechanism which allows for more targeted wakeups. This feature will be used by glibc to implement optimized reader-writer locks.

  • PCI hotplug is no longer an experimental feature.

  • Support for PCI Express ASPM, a power management protocol, has been added.

  • The virtio "balloon" driver (which can be used to change the amount of memory used by a KVM guest) and PCI driver have been added.

  • The CLONE_STOPPED bit (for the clone() system call) is said to be unused and is planned for removal. For 2.6.25, a warning will be printed.

  • The timerfd() system call is back, with a reworked, more capable API.

  • The page map patches, which enable much better accounting of memory use by processes, have been merged.

  • The "PM QOS" infrastructure allows both kernel and user-space code to register quality-of-service requirements (in the form of CPU DMA latency, network latency, and network throughput). These requirements will be taken into account when the kernel considers putting the system into a lower-power state.

  • Per-process capability bounding sets (which permanently remove potential capabilities from a process) are now supported. 64-bit capability mask support has also been merged.

  • The simplified mandatory access control kernel (SMACK) security module has been merged.

  • The smbfs filesystem has (finally) been deprecated in favor of CIFS. It is now scheduled for removal in 2.6.27.

  • There is a new RPC transport module allowing (client) NFS mounts using RDMA.

Changes visible to kernel developers include:

  • A large number of SUNRPC symbols (rpc_* and rpcauth_*) have been changed to GPL-only exports.

  • The x86 architecture merger continues, with quite a few files being coalesced.

  • The "flatmem" and "discontigmem" memory models have been removed on the 64-bit x86 architecture; "sparsemem" is now used for all builds.

  • The x86 spinlock implementation has been replaced with a "ticket spinlock" mechanism which provides fair FIFO behavior.

  • The fastcall function attribute didn't do anything on the x86 architecture, so it has been removed.

  • x86 has a new set of functions for easily manipulating page attributes. They are:

        set_memory_uc(unsigned long addr, int numpages); /* Uncached */
        set_memory_wb(unsigned long addr, int numpages); /* Cached */
        set_memory_x(unsigned long addr, int numpages);  /* Executable */
        set_memory_nx(unsigned long addr, int numpages); /* Non-executable */
        set_memory_ro(unsigned long addr, int numpages); /* Read-only */
        set_memory_rw(unsigned long addr, int numpages); /* Read-write */

    There is also a set of set_pages_* functions which take a struct page pointer rather than a beginning address.

  • Early-boot debugging of x86 systems via the FireWire port is now supported.

  • Bidirectional command support has been added to the SCSI layer.

  • There is a new process state called TASK_KILLABLE. It is a blocked state similar to TASK_UNINTERRUPTIBLE, with the difference that a wakeup will happen upon delivery of a fatal signal. The idea is to allow (almost) uninterruptible sleeps, but to still allow the process to be killed outright - thus ending the problem of unkillable processes stuck in the "D" state. There is a new set of functions for using this state: wait_event_killable(), schedule_timeout_killable(), mutex_lock_killable(), etc.

  • add_disk_randomness() has been unexported as there are no more in-tree users.

  • pci_enable_device_bars() has been replaced by two more-specific functions: pci_enable_device_io() and pci_enable_device_mem().

  • The high-resolution timer API has been augmented with:

        unsigned long hrtimer_forward_now(struct hrtimer *timer,
                                          ktime_t interval);

    It will move the given timer's expiration forward past the current time as determined by the associated clock.

  • The device structure now holds a pointer to a device_dma_parameters structure:

        struct device_dma_parameters {
    	unsigned int max_segment_size;
    	unsigned long segment_boundary_mask;

    These parameters are used by the DMA mapping layer (and the IOMMU mapping code in particular) to ensure that I/O operations are set up within the device's constraints. The PCI layer supports this feature with two new functions:

        int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int size);
        int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask);

    Drivers for devices with unusually strict DMA limitations should probably use these functions to ensure that those restrictions are respected.

One thing which has not made it into 2.6.25 is the KGDB debugger for the x86 architecture. Amusingly, a kernel mini-conf discussion of "sneaking" KGDB past Linus proceeded for some time before the participants noticed him standing in the back of the room listening to the whole thing. His current position is that he won't pull it as part of the x86 tree, and he's still not much interested in the idea in general.

As of this writing, the merge window is still open and could stay that way for as much as a week. So more interesting code could still find its way in through this merge window; stay tuned.

Comments (3 posted)


By Jake Edge
February 6, 2008

Performance, or lack thereof, has often been a knock against the venerable Network File System (NFS), but no real competition has emerged. NFS also has some serious flaws for programmers and users, with behavior that is markedly different from that of local filesystems. Both of these problems are spurring the creation of new network filesystems; two of which were announced in the last week.

The Coherent Remote File System (CRFS) was introduced last week at by Zach Brown of Oracle. It uses BTRFS—pronounced "butter-f-s"—as its storage on the server, rather than layering atop any POSIX filesystem as NFS does. According to Brown, BTRFS has a number of important features that outweigh the inconvenience for users of getting their data into a BTRFS volume. The biggest is the ability to do compound operations (creating or unlinking a file for example) in an atomic and idempotent manner.

CRFS has a userspace daemon (crfsd) that talks to the BTRFS volume as well as multiple clients. The clients use the kernel VFS caching infrastructure extensively, thus are implemented as kernel modules. A user wishing to access the underlying BTRFS volume on the server, must mount it as a CRFS volume; crfsd must have exclusive access to the BTRFS. This is also different from NFS which will cooperate with local mounts of the underlying filesystem.

The basic idea behind CRFS is to have clients cache as much of the filesystem data as they can while using cache coherency protocols to reduce the amount of network traffic that gets generated. Clients keep track of the cache state for each object they have stored, while the server tracks the cache state of all objects that any client has. The messages between server and client consist of cache state transitions and the data being transferred.

Data transfer in both directions is done using CRFS "item ranges". CRFS objects use the BTRFS key scheme to represent objects (file data, directories, directory entries, inodes, etc.) in the filesystem. An item range is a contiguous section of the key space, specified by a minimum and maximum key value as part of the message. When the client is filling its cache, it can request a particular key but also offer to take other surrounding keys as part of the response; if the server sees those keys in the BTRFS leaf node, it can send them along as well.

Something on the order of a 3x speedup over asynchronous NFS mounts is the current performance of CRFS for a simple untar. Comparing to synchronous NFS mounts (where each write has to actually hit the remote disk) is not a sensible comparison; there is a roughly 10x speed difference between the two types of NFS mounts. Brown has been working on CRFS for "about a year" and is planning to release the code eventually. Until that happens, the slides [PDF] and video [Theora] from his talk—as well as a few postings to his weblog—are the only sources of information about CRFS.

Another filesystem, that aims to have a broader reach than CRFS, is the Parallel Optimized Host Message Exchange Layered File System (POHMELFS), announced in linux-kernel posting by Evgeniy Polyakov. POHMELFS is meant to be a building block for a distributed filesystem that would offer a multi-server architecture and allow for disconnected filesystem operations. Polyakov has only been working on it for a month, so it is, at best, the start of a proof of concept.

The POHMELFS vision is in some ways similar to CRFS in that the clients will handle as much as possible locally, with minimal server interaction. Like CRFS, client kernel modules talk to a server userspace daemon, using cache coherency protocols to keep the data and metadata in sync. For CRFS, the coherency is not yet implemented, but is fleshed out to some extent, while POHMELFS has quite a bit of fleshing out to do. Unlike CRFS, POHMELFS supports POSIX filesystems on the server side and the code is available now.

There are some rather large hurdles to overcome in the POHMELFS vision, not least of which is handling file IDs in separate client-side filesystems such that they can be synchronized with the server. The current code implements a write-through cache version that creates objects on the server before they are used in the client side cache. There is also an additional patch that implements a hack to disable the writeback cache and use only the client side caching. The latter is, not surprisingly, very fast, but not terribly usable for multiple mounts of the filesystem. Essentially Polyakov is showing the benefits of client-side caching, but in the context of a broader scheme.

It will be a long time, if ever, that we see some descendant of either of these filesystems in the kernel. There is much work to be done, but they are worth looking at to see where networking and distributed filesystems may be headed. For them to be useful outside of just the Linux world—like the ubiquity of NFS—there would have to be some kind of standardization followed by adoption by the major players. That will take a very long time.

Comments (11 posted)

Ticket spinlocks

By Jonathan Corbet
February 6, 2008
Spinlocks are the lowest-level mutual exclusion mechanism in the Linux kernel. As such, they have a great deal of influence over the safety and performance of the kernel, so it is not surprising that a great deal of optimization effort has gone into the various (architecture-specific) spinlock implementations. That does not mean that all of the work has been done, though; a patch merged for 2.6.25 shows that there is always more which can be done.

On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by an integer value. A value of one indicates that the lock is available. The spin_lock() code works by decrementing the value (in a system-wide atomic manner), then looking to see whether the result is zero; if so, the lock has been successfully obtained. Should, instead, the result of the decrement option be negative, the spin_lock() code knows that the lock is owned by somebody else. So it busy-waits ("spins") in a tight loop until the value of the lock becomes positive; then it goes back to the beginning and tries again.

Once the critical section has been executed, the owner of the lock releases it by setting it to 1.

This implementation is very fast, especially in the uncontended case (which is how things should be most of the time). It also makes it easy to see how bad the contention for a lock is - the more negative the value of the lock gets, the more processors are trying to acquire it. But there is one shortcoming with this approach: it is unfair. Once the lock is released, the first processor which is able to decrement it will be the new owner. There is no way to ensure that the processor which has been waiting the longest gets the lock first; in fact, the processor which just released the lock may, by virtue of owning that cache line, have an advantage should it decide to reacquire the lock quickly.

One would hope that spinlock unfairness would not be a problem; usually, if there is serious contention for locks, that contention is a performance issue even before fairness is taken into account. Nick Piggin recently revisited this issue, though, after noticing:

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or "unfairly" granted the lock up to 1 000 000 (!) times.

This sort of runtime difference is certainly undesirable. But lock unfairness can also create latency issues; it is hard to give latency guarantees when the wait time for a spinlock can be arbitrarily long.

Nick's response was a new spinlock implementation which he calls "ticket spinlocks." Under the initial version of this patch, a spinlock became a 16-bit quantity, split into two bytes:

[ticket spinlock]

Each byte can be thought of as a ticket number. If you have ever been to a store where customers take paper tickets to ensure that they are served in the order of arrival, you can think of the "next" field as being the number on the next ticket in the dispenser, while "owner" is the number appearing in the "now serving" display over the counter.

So, in the new scheme, the value of a lock is initialized (both fields) to zero. spin_lock() starts by noting the value of the lock, then incrementing the "next" field - all in a single, atomic operation. If the value of "next" (before the increment) is equal to "owner," the lock has been obtained and work can continue. Otherwise the processor will spin, waiting until "owner" is incremented to the right value. In this scheme, releasing a lock is a simple matter of incrementing "owner."

The implementation described above does have one small disadvantage in that it limits the number of processors to 256 - any more than that, and a heavily-contended lock could lead to multiple processors thinking they had the same ticket number. Needless to say, the resulting potential for mayhem is not something which can be tolerated. But the 256-processor limit is an unwelcome constraint for those working on large systems, which already have rather more processors than that. So the add-on "big ticket" patch - also merged for 2.6.25 - uses 16-bit values when the configured maximum number of processors exceeds 256. That raises the maximum system size to 65536 processors - who could ever want more than that?

With the older spinlock implementation, all processors contending for a lock fought to see who could grab it first. Now they wait nicely in line and grab the lock in the order of arrival. Multi-thread run times even out, and maximum latencies are reduced (and, more to the point, made deterministic). There is a slight cost to the new implementation, says Nick, but that gets very small on contemporary processors and is essentially zero relative to the cost of a cache miss - which is a common event when dealing with contended locks. The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree.

Comments (29 posted)

Patches and updates

Kernel trees


Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers

  • sukadev-r/ Devpts namespace. (February 6, 2008)


  • Eric Leblond: ulogd. (February 5, 2008)

Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds