User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.31-rc6, released on August 14. "There's nothing hugely interesting there, although I'm personally gratified by the fact that I have a few more commits than usual. Even if they are all really small, it makes me feel useful." See the announcement for the short-form changelog and some comments on the latest NULL pointer vulnerability fix.

The current stable kernel is, released (along with on August 16. Both contain a long list of fixes for a number of significant problems. Note that had a compilation problem; was released shortly thereafter with a fix for that one issue.

The current ancient kernel is, released on August 13 with a fix for the latest null pointer vulnerability (which affects 2.4 kernels as well as 2.6).

Comments (none posted)

Kernel development news

Quotes of the week

And I'm tired of being stupid. So this patch makes me smart and forward-thinking instead.
-- Linus Torvalds

There is also a feeling from some people in this community (they've told me themselves in private) that certain mainline folk have been actively trying to undermine the ARM community - given the recent "discussions" (more like flames) over things like the device tree, the HTC stuff, etc.
-- Russell King

So is part of the kernel API too? Sigh.
-- Andrew Morton

Comments (none posted)

In brief

By Jonathan Corbet
August 19, 2009
Low-memory mapping. The recent set of null-pointer vulnerabilities has not been helped by the confusion around how the mmap_min_addr knob and security modules interact. The 2.6.31-rc7 kernel will see a couple of changes intended to clarify and rationalize this interaction. With these patches, SELinux will no longer bypass the mmap_min_addr check; any process wanting to map memory below that address will require the CAP_SYS_RAWIO capability. However, SELinux will also implement its own low-memory limit, controlled by a separate knob in the SELinux policy. As a result, it will be possible to turn off the mmap_min_addr protection but still only allow specific programs to do low-memory mapping.

Module loading security. Another SELinux-related change - not yet merged into the mainline - adds a new hook to request_module(). The idea here is to try to limit the ability of user-space programs to load arbitrary modules into a running kernel. In future versions of the SELinux policy, the ability to trigger module loads is likely to be reduced to a much smaller set of roles.

HugeTLB mappings. The "hugetlb" feature allows processes to create pseudo-anonymous memory mappings backed by pages which are larger (perhaps much larger) than the normal system page size. For certain kinds of applications, these mappings can improve performance by reducing pressure on the CPU's translation lookaside buffer (TLB). The kernel code resides within such a mapping for the same reason. Using hugetlb pages in user space is a bit awkward, though; it requires mounting the special hugetlbfs filesystem and mapping files from there.

Eric Munson has put together a patch implementing an easier way. With this patch, the mmap() system call understands a new MAP_HUGETLB flag; when that flag is present, the kernel will attempt to create a mapping backed by huge pages. Underneath it all, the mapping is still implemented as a hugetlbfs file, but user space need no longer be aware of that fact.

spin_is_locked(). Kernel code can test the current state of a spinlock with spin_is_locked(). But what should this function return on a single-processor system, where spinlocks do not exist at all? Kumar Gala ran into trouble because one uniprocessor spin_is_locked() implementation returned zero. The problem was code in this form:

    /* Ensure we have the requisite lock */

So Kumar thinks that the return value should always be true. But there are other situations where that is just the wrong thing to do; Linus gave an example where code is waiting for a lock to become free.

The real problem is that a predicate like spin_is_locked() simply lacks a well-defined meaning when the spinlock does not exist. So there is no way to always give the "right" answer in such situations. What may happen instead is that, in a future kernel, spin_is_locked() will be deprecated. Instead, there will be new expect_spin_locked() and expect_spin_unlocked() primitives for testing the state of a spinlock. When the code is this explicit about what it is looking for, the default answer can make sense; both would return true on uniprocessor systems.

localmodconfig. Many kernel testers want to build a kernel which looks like the kernel shipped with their distribution. But distributor kernels come with a configuration which builds almost everything. So our poor tester ends up waiting for a very long time as the system builds a bunch of modules which will never be used. One could avoid this problem by creating a new configuration from scratch, but that process can be a little daunting as well. There are a lot of configuration options in a contemporary kernel.

Steven Rostedt has posted a build system change intended to help with this problem. A user who types:

    make localmodconfig

will get a configuration which builds all modules currently loaded into the running system, but no others. That should be a configuration which supports the system's hardware, but which lacks hundreds of useless modules. There is also a localyesconfig option which builds the required drivers directly into the kernel.

Comments (21 posted)

Runtime power management

By Jonathan Corbet
August 19, 2009
While a great deal of power management work has been done on Linux systems in recent years, much of that work has been directed toward the creation of working suspend and hibernation capabilities. But there is more to power management than that; there is real value in being able to minimize the power consumption of a running system. That is as true for large data center machines as it is for laptops; reduced power usage and lower air conditioning requirements have both economic and environmental benefits. Now that the suspend problem is mostly solved, increasing amounts of attention are being paid to the other aspects of power management; some recent patches show how the infrastructure for runtime power management is coming together.

The core of the future power management structure appears certain to be Rafael Wysocki's runtime power management patch. This patch set adds structure to the power management code to facilitate the suspending and resuming of individual system components at run time. The dev_pm_ops structure is augmented with three new functions:

    int (*runtime_suspend)(struct device *dev);
    int (*runtime_resume)(struct device *dev);
    int (*runtime_idle)(struct device *dev);

These functions are to be implemented by the core device code for each bus type; they may then be turned into bus-specific driver callbacks. The power management code will call runtime_suspend() to prepare a specific device for a lower-power state. This call does not imply that the device itself must suspend, but the device does need to prepare for a condition where it is no longer able to communicate with the CPU or memory. In other words, even if the device does not suspend, hardware between that device and the rest of the system might suspend. A return value of -EBUSY or -EAGAIN will abort the suspend operation.

A call to runtime_resume() should prepare the device to operate again with the rest of the system; the driver should power up the device, restore registers, and do anything else needed to get the device functioning again. The runtime_idle() callback, instead, is called when the core thinks that the device is idle and might be a good candidate for suspending. The callback should decide whether the device can really be suspended (this could include checking the state of any other devices connected to it) and, if all seems well, initiate a suspend operation.

Along with these callbacks is, of course, a set of core functions designed to manage suspend and resume activities, deal with mid-course cancellations, allow outside code to make power management changes, and so on. See the associated document file for more information on how this subsystem works.

The code described above has been through several review iterations and would appear to be on track for merging in 2.6.32. Rafael's asynchronous suspend and resume patch, instead, is rather newer and may take a little longer. The idea behind this patch is to extend the runtime power management code to allow suspend and resume callbacks to be invoked asynchronously; that, in turn, would allow them to be run in parallel. As long as there are no dependencies between a pair of devices, suspending or resuming them in parallel should make full-system transitions faster.

The problem is in the dependencies; running a bunch of power management operations in parallel increases the risk of getting the order wrong. To avoid this outcome, the patch adds a new completion object to each device; when a device is to be suspended, the completions will be used to ensure that any dependent devices are suspended first. At resume time the completions are used in the reverse direction: devices wait for their parent devices to be resumed before resuming themselves. As long as the dependency information is correct, this mechanism should ensure that a set of power management threads can run in parallel without breaking the system.

Ensuring that the dependencies are correct was one of the reasons for the creation of the Linux device model years ago. With a properly-constructed tree, the system can know, for example, that it cannot suspend a USB controller until all USB devices plugged into it have been suspended. In turn, the PCI controller to which the USB controller is attached must remain functional until the USB controller is suspended, and so on. The problem is that system dependencies are not always that simple. A PCI device may also require the services of an I2C controller, for example, or devices can be combined on multi-function chips in surprising ways. So the device tree has proved unable to represent all of the power management dependencies in the system.

Rafael has addressed this problem in a later version of the patch, which adds a new framework for representing power management dependencies. At the core of it is this structure:

    struct pm_link {
    	struct device *master;
    	struct list_head master_hook;
    	struct device *slave;
    	struct list_head slave_hook;

One of these structures exists for each dependency known to the system. It indicates that the "master" device must always be functional whenever the "slave" device is; the master must be suspended after the slave and resumed before it. Many of these links can be created by the power management core itself; others will have to be generated by the relevant drivers. There have been some concerns raised about the memory use of this structure, but a better solution has not yet been proposed.

Meanwhile, Matthew Garrett has taken the core power management code one step further with a set of runtime power management patches of his own. He has pushed the new power management calls down into the PCI and USB bus layers and used them to suspend devices opportunistically as the system runs; he reports a power savings of around 0.2 watts as a result. Review comments resulted in these patches being withdrawn for now for repairs, but they show the direction things are heading. With sufficient software support and cooperative hardware, Linux should be able to further reduce the operating power needed for whole classes of systems. That cannot fail to be a good thing.

Comments (4 posted)

A new API for kfifo?

By Jake Edge
August 19, 2009

The kernel FIFO implementation, kfifo, is not that widely used and Stefani Seibold would like to see that change. To that end, she has proposed a new kfifo API that embodies many of the design patterns for data structures that were described by Neil Brown. The new interface is simpler in many ways, so it should be easier to use, which, in turn, could lead to more use throughout the kernel.

The basic problems with the current kfifo interface stem from the constraints it places on its users. A spinlock is required, though it is not needed by the majority of current kfifo users. Also, the kfifo structure can only be allocated dynamically, so users cannot put FIFOs inside of other structures—only pointers to FIFOs. Moreover, according to Seibold, the current API is too simple and doesn't provide a number of important features.

The new API has 23 separate functions, while the current has 9, but there are quite a few variants that increase the total. Those variants include copying from or to user space, supporting fixed-length records, as well as being able to use spinlocks to synchronize adding or removing data from the FIFO. It supports many different use cases in the style of Brown's "Broad Interfaces" pattern.

A kfifo is declared using the DECLARE_KFIFO() macro which can be used inside of a struct or union declaration. FIFOs declared with with DECLARE_KFIFO() must be initialized using INIT_KFIFO(). Alternatively, DEFINE_KFIFO() handles both the declaration and initialization for FIFOs outside of struct/union declarations. The macros take name (to name the variable or struct/union element) and size (in bytes) parameters:

    DECLARE_KFIFO(name, size)

    DEFINE_KFIFO(name, size)

There are two functions to support dynamic buffer allocation:

    int kfifo_alloc(struct kfifo *fifo, unsigned int size, gfp_t gfp_mask)
which allocates a buffer of size bytes using the gfp_mask as flags to pass to kmalloc(). Or, a pre-allocated buffer can be attached to a kfifo using:
    void kfifo_init(struct kfifo *fifo, unsigned char *buffer, unsigned int size)
Any buffer allocated by kfifo_alloc() should later be freed by calling kfifo_free().

To put data into the FIFO, kfifo_in() is used:

    unsigned int kfifo_in(struct kfifo *fifo, 
        unsigned char *from, unsigned int n)
which returns the number of bytes stored. As mentioned above there are variants for getting the data from user space, as well as for locking:
    unsigned int kfifo_from_user(struct kfifo *fifo, 
        const void __user *from, unsigned int n)

    unsigned int kfifo_in_locked(struct kfifo *fifo,
	const unsigned char *from, unsigned int n, spinlock_t *lock)

As one might guess, the calls to remove data from the FIFO look similar—if reversed:

    unsigned int kfifo_out(struct kfifo *fifo, 
        unsigned char *to, unsigned int n)

    unsigned int kfifo_to_user(struct kfifo *fifo, 
        void __user *to, unsigned int n)

    unsigned int kfifo_out_locked(struct kfifo *fifo,
	unsigned char *to, unsigned int n, spinlock_t *lock)
In the common case, with only one reader and one writer, extra locks are not required to add or remove data from the FIFO, but for more complicated scenarios, the *_locked() variants allow the caller to use the appropriate lock.

The expected tests for FIFO full and empty (kfifo_is_full() and kfifo_is_empty()) are present, as are ways to get FIFO size and available space (kfifo_size(), kfifo_len(), and kfifo_avail()). One can also consume some FIFO bytes without returning them using kfifo_skip() or clear the entire FIFO with kfifo_reset().

There is also support for fixed-length records, with either 1- or 2-byte lengths stored with each entry. Each call must pass a recsize parameter that specifies the size of the length field (i.e. 1 or 2, though 0 is supported for no length) stored with each entry:

    unsigned int kfifo_in_rec(struct kfifo *fifo,
	void *from, unsigned int n, unsigned int recsize)

    unsigned int kfifo_from_user_rec(struct kfifo *fifo,
	const void __user *from, unsigned int n, unsigned int recsize)

    unsigned int kfifo_out_rec(struct kfifo *fifo,
	void *to, unsigned int n, unsigned int recsize,
	unsigned int *total)

    unsigned int kfifo_to_user_rec(struct kfifo *fifo,
	void __user *to, unsigned int n, unsigned int recsize,
	unsigned int *total)

These functions return the number of bytes not copied, which is a bit strange. For the functions that remove data from the FIFO, *total stores the number of bytes actually copied. This part of the interface seems a little dubious, and may not survive in its present form, though no comments along those lines have been seen.

Overall, the interface has been fairly well-received. There were a few comments on an earlier version of the API, which Seibold has mostly addressed. The only comments on the most recent version (v0.4) were a disagreement between Alan Cox and Andrew Morton over naming conventions.

Cox would rather not see the current kfifo_put() and kfifo_get() calls get removed quite yet, noting "We are about to set fifo loose through all the USB and some other char/serial drivers all of which will use the spinlock facility." The current calls use the spinlock, so the kind of change Seibold is proposing would need to be reflected in the USB and char/serial driver code. But Morton thinks that this is the right time to make those changes, because "kfifo has no business assuming that the caller wants to use spin_lock() locking".

The basic problem is that Morton would like to see a convention followed that things like kfifo_put() (or, kfifo_in() as Seibold's API defines it) do not assume locking. He agrees with Seibold's decision to add a separate kfifo_*_locked() variants. Cox points out that the convention is very inconsistently followed, but Morton is adamant:

Plus, as I've said enty en times and keep getting ignored: the current naming is wrong. The generic kfifo_get() should not be assuming that the caller wants spinlock-based locking.

After initially NAK-ing part of the patch, Cox seems to have relented, so that particular barrier is gone. It would seem far too late in the 2.6.31 process for this kind of change to go in, but unless some other major issues crop up, it is quite possible that a new kfifo API could show up in 2.6.32.

Comments (2 posted)

The trouble with discard

By Jonathan Corbet
August 18, 2009
Traditionally, storage devices have managed the blocks of data given to them without being concerned about how the system used those blocks. Increasingly, though, there is value in making more information available to storage devices; in particular, there can be advantages to telling the device when specific blocks no longer contain data of interest to the host system. The "discard" concept was added to the kernel one year ago to communicate this information to storage devices. One year later, it seems that the original discard idea will not survive contact with real hardware - especially solid-state storage devices.

There are a number of use cases for the discard functionality. Large, "enterprise-class" storage arrays can implement virtual devices with a much larger storage capacity than is actually installed in the cabinet; these arrays can use information about unneeded blocks to reuse the physical storage for other data. The compcache compressed in-memory swapping mechanism needs to know when specific swap slots are no longer needed to be able to free the memory used for those slots. Arguably, the strongest pressure driving the discard concept comes from solid-state storage devices (SSDs). These devices must move data around on the underlying flash storage to implement their wear-leveling algorithms. In the absence of discard-like functionality, an SSD will end up shuffling around data that the host system has long since stopped caring about; telling the device about unneeded blocks should result in better performance.

The sad truth of the matter, though, is that this improved performance does not actually happen on SSDs. There are two reasons for this:

  • At the ATA protocol level, a discard request is implemented by a "TRIM" command sent to the device. For reasons unknown to your editor, the protocol committee designed TRIM as a non-queued command. That means that, before sending a TRIM command to the device, the block layer must first wait for all outstanding I/O operations on that device to complete; no further operations can be started until the TRIM command completes. So every TRIM operation stalls the request queue. Even if TRIM were completely free, its non-queued nature would impose a significant I/O performance cost. (It's worth noting that the SCSI equivalent to TRIM is a tagged command which doesn't suffer from this problem).

  • With current SSDs, TRIM appears to be anything but free. Mark Lord has measured regular delays of hundreds of milliseconds. Delays on that scale would be most unwelcome on a rotating storage device. On an SSD, hundred-millisecond latencies are simply intolerable.

One would assume that the second problem will eventually go away as the firmware running in SSDs gets smarter. But the first problem can only be fixed by changing the protocol specification, so any possible fix would be years in the future. It's a fact of life that we will simply have to live with.

There are a few proposals out there for how we might live with the performance problems associated with discard operations. Matthew Wilcox has a plan to reimplement the whole discard concept using a cache in the block layer. Rather than sending discard operations directly to the device, the block layer will remember them in its own cache. Any new write operations will then be compared against the discard cache; whenever an operation overwrites a sector marked for discard, the block layer will know that the discard operation is no longer necessary and can, itself, be discarded. That, by itself, would reduce the number of TRIM operations which must be sent to the device. But if the kernel can work to increase locality on block devices, performance should improve even more. One relatively easy-to-implement example would be actively reusing recently-emptied swap slots instead of scattering swapped pages across the swap device. As Matthew puts it: "there's a better way for the drive to find out that the contents of a block no longer matter -- write some new data to it."

In Matthew's scheme, the block layer would occasionally flush the discard cache, sending the actual operations to the device. The caching should allow the coalescing of many operations, further improving performance. Greg Freemyer, instead, suggests that flushing the discard cache could be done by a user-space process. Greg says:

Assuming we have a persistent bitmap in place, have a background scanner that kicks in when the cpu / disk is idle. It just continuously scans the bitmap looking for contiguous blocks of unused sectors. Each time it finds one, it sends the largest possible unmap down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

A variant of this approach was posted by Christoph Hellwig, who has implemented batched discard support in XFS. Christoph's patch adds a new ioctl() which wanders through the filesystem's free-space map and issues large discard operations on each of the free extents. The advantage of doing things at the filesystem level is that the filesystem already knows which blocks are uninteresting; there is no additional accounting required to obtain that information. This approach will also naturally generate large operations; larger discards tend to suit the needs of the hardware better. On the other hand, regularly discarding all of the free space in a filesystem makes it likely that some time will be spent telling the device to discard sectors which it already knows to be free.

It is far too soon to hazard a guess as to which of these approaches - if any - will be merged into the mainline. There is a fair amount of coding and benchmarking work to be done still. But it is clear that the code which is in current mainline kernels is not up to the task of getting the best performance out of near-future hardware.

Your editor feels the need to point out one possibly-overlooked aspect of this problem. An SSD is not just a dumb storage device; it is, instead, a reasonably powerful computer in its own right, running complex software, and connected via what is, essentially, a high-speed, point-to-point network. Some of the more enterprise-oriented devices are more explicitly organized this way; they are separate boxes which hook into an IP-based local net. Increasingly, the value in these devices is not in the relatively mundane pile of flash storage found inside; it's in the clever firmware which causes the device to look like a traditional disk and, one hopes, causes it to perform well. Competition in this area has brought about some improvements in this firmware, but we should see a modern SSD for what it is: a computer running proprietary software that we put at the core of our systems.

It does not have to be that way; Linux does not need to talk to flash storage through a fancy translation layer. We have our own translation layer code (UBI), and a few filesystems which can work with bare flash. It would be most interesting to see what would happen if some manufacturer were to make competitive, bare-flash devices available as separate components. The kernel could then take over the flash management task, and our developers could turn their attention toward solving the problem correctly instead of working around problems in vendor solutions. Kernel developers made an explicit choice to avoid offloading much of the network stack onto interface hardware; it would be nice to have a similar choice regarding the offloading of low-level storage management.

In the absence of that choice, we'll have no option but to deal with the translation layers offered by hardware vendors. The results look unlikely to be optimal in the near future, but they should still end up being better than what we have today.

Comments (36 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds