User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.11-rc1.

Linus's BitKeeper repository contains, as of this writing, some networking updates, an ALSA update (to version 1.0.8), some enhancements to the "circular pipe buffers" code introduced in -rc1 (see below), the ioctl() method rework (see below), in-inode extended attributes for ext3, and various fixes.

The current prepatch from Andrew Morton is 2.6.11-rc1-mm1. Recent additions to -mm include the Linux Trace Toolkit (LTT), relayfs, ext3 in-inode extended attributes (subsequently merged), the filesystems in user space (FUSE) patch set, an update to the random driver, and a copy of Dave Jones's "post-halloween" document (in the hope that somebody will be motivated to update it).

Andrew added LTT and relayfs with the explanation: "This is a discussion which needs to be had." The discussion has indeed been lively. Many developers see the value in this code, but object to the implementation. As a result, LTT and relayfs are likely to be slimmed down significantly, with more of the work shifted to user space or a separate loadable module. We may also see the Linux Kernel State Tracer patch submitted to -mm for comparison before the discussion is over.

The current 2.4 kernel is 2.4.29, released by Marcelo on January 19. One change was made since -rc3: the removal of one patch which was causing trouble. The changes since 2.4.28 are mostly bug fixes and driver updates; 2.4 is past the point of getting much in the way of new features.

Comments (none posted)

Kernel development news

Quote of the week

Given that base 2.6 kernels are shipped by Linus with known unfixed security holes anyone trying to use them really should be doing some careful thinking. In truth no 2.6 released kernel is suitable for anything but beta testing until you add a few patches anyway....

I still think the 2.6 model works well because its making very good progress and then others are doing testing and quality management on it. Linus is doing the stuff he is good at and other people are doing the stuff he doesn't.

-- Alan Cox

Comments (11 posted)

The new way of ioctl()

The ioctl() system call has long been out of favor among the kernel developers, who see it as a completely uncontrolled entry point into the kernel. Given the vast number of applications which expect ioctl() to be present, however, it will not go away anytime soon. So it is worth the trouble to ensure that ioctl() calls are performed quickly and correctly - and that they do not unnecessarily impact the rest of the system.

ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL). In the past, the usage of the BKL has made it possible for long-running ioctl() methods to create long latencies for unrelated processes. Recent changes, which have made BKL-covered code preemptible, have mitigated that problem somewhat. Even so, the desire to eventually get rid of the BKL altogether suggests that ioctl() should move out from under its protection.

Simply removing the lock_kernel() call before calling ioctl() methods is not an option, however. Each one of those methods must first be audited to see what other locking may be necessary for it to run safely outside of the BKL. That is a huge job, one which would be hard to do in a single "flag day" operation. So a migration path must be provided. As of 2.6.11, that path will exist.

The patch (by Michael s. Tsirkin) adds a new member to the file_operations structure:

    long (*unlocked_ioctl) (struct file *filp, unsigned int cmd, 
                            unsigned long arg);

If a driver or filesystem provides an unlocked_ioctl() method, it will be called in preference to the older ioctl(). The differences are that the inode argument is not provided (it's available as filp->f_dentry->d_inode) and the BKL is not taken prior to the call. All new code should be written with its own locking, and should use unlocked_ioctl(). Old code should be converted as time allows. For code which must run on multiple kernels, there is a new HAVE_UNLOCKED_IOCTL macro which can be tested to see if the newer method is available or not.

Michael's patch adds one other operation:

    long (*compat_ioctl) (struct file *filp, unsigned int cmd, 
                          unsigned long arg);

If this method exists, it will be called (without the BKL) whenever a 32-bit process calls ioctl() on a 64-bit system. It should then do whatever is required to convert the argument to native data types and carry out the request. If compat_ioctl() is not provided, the older conversion mechanism will be used, as before. The HAVE_COMPAT_IOCTL macro can be tested to see if this mechanism is available on any given kernel.

The compat_ioctl() method will probably filter down into a few subsystems. Andi Kleen has posted patches adding new compat_ioctl() methods to the block_device_operations and scsi_host_template structures, for example, though those patches have not been merged as of this writing.

Comments (1 posted)

The evolution of pipe buffers

Last week, this page looked at the new circular buffer structure used to implement Unix pipes in 2.6.11-rc1, and noted that the plan was to evolve that structure into something more general. Since then, Linus has taken a couple more steps; it must be time to catch up.

One change which has already been merged is the addition of a set of operations for pipe buffers:

    struct pipe_buf_operations {
	int can_merge;
	void *(*map)(struct file *, struct pipe_inode_info *, 
                     struct pipe_buffer *);
	void (*unmap)(struct pipe_inode_info *, struct pipe_buffer *);
	void (*release)(struct pipe_inode_info *, struct pipe_buffer *);

The can_merge flag addresses one of the issues raised last week: coalescing of writes into existing pages in the buffer. If can_merge is non-zero, coalescing will be performed. Otherwise, each write to a pipe buffer will result in the creation of a new circular buffer entry, and, by default, the allocation of a new page.

The map() and unmap() methods are charged with controlling the visibility of pipe buffer pages in the kernel's virtual address space. The default map() operations for buffers implementing Unix pipes is quite simple:

    static void *anon_pipe_buf_map(struct file *file, 
                                   struct pipe_inode_info *info, 
                                   struct pipe_buffer *buf)
            return kmap(buf->page);

Since the mapping operation has been abstracted out, there are now fewer assumptions regarding how data is really stored within a pipe buffer. This opens the door to different pipe implementations, such as pipes which implement a direct window into device memory.

The release() method should clean things up when the pipe buffer is no longer needed.

Linus has also created an initial implementation of a splice() system call, though this work is clearly not ready for merging at this point. This system call looks like:

    long sys_splice(int fdin, int fdout, size_t len, unsigned long flags);

fdin and fdout are two file descriptors; a call to sys_splice() will result in len bytes being copied from fdin to fdout, one of which is expected to be a pipe. The flags argument is not currently used by the sample implementation.

To make sys_splice() work, Linus added two new methods to the ever-expanding file_operations structure:

    ssize_t (*splice_write)(struct inode *in_pipe, struct file *out, 
                            size_t len, unsigned long flags);
    ssize_t (*splice_read)(struct file *in, struct inode *out_pipe, 
                           size_t len, unsigned long flags);

The patch includes a generic splice_read() implementation suitable for filesystem-backed file descriptors. It simply populates the page cache with some pages from the file, then loads those pages into the pipe buffer represented by out_pipe. Like ordinary read() and write() methods, the splice variants can transfer fewer bytes than requested. Linus's version will stop at the maximum capacity of a pipe buffer - 16 pages, currently.

As Linus acknowledges, there are a number of shortcomings to the current implementation - it is incomplete, the interfaces are ugly, and it will oops the system if anything goes wrong. It is, however, an indication of where he expects this work will lead. Stay tuned.

Comments (5 posted)

API changes in the 2.6 kernel series

The 2.6 kernel development series differs from its predecessors in that much larger and potentially destabilizing changes are being incorporated into each release. Among these changes are modifications to the internal programming interfaces for the kernel, with the result that kernel developers must work harder to stay on top of a continually-shifting API. There has never been a guarantee of internal API stability within the kernel - even in a stable development series - but the rate of change is higher now.

This article will be updated to keep track of the internal changes for each 2.6 kernel release. Its permanent location is:

This page will, doubtless, remain incomplete for a while. If you see an omission, please let us know by sending a note to rather than by posting comments here. The chances of a prompt update are higher, the article will not become cluttered with redundant comments, and we'll be more than happy to credit you here.

If you are a Linux Device Drivers, Third Edition reader looking for information on changes since the book was published: LDD3 covers version 2.6.10 of the kernel, so only the changes starting with 2.6.11 are relevant.

Last update: January 5, 2006

2.6.15 (January 2, 2006)

  • The nested class device patch was merged, allowing class_device structures to have other class_devices as parents. This patch is a hack to make the input subsystem work with sysfs. This code will change again in the future; see Greg Kroah-Hartman's article for more information on what is planned.

  • The prototypes for the driver model class "interface" methods add() and remove() have changed; there is now a new parameter pointing to the relevant interface structure.

  • A new platform_driver structure has been added to describe drivers for devices built into the core "platform."

  • The prototypes for the suspend() and resume() methods in struct device_driver have changed. They are also only called once per event, rather than three times as in previous kernels.

  • Two new fields have been added to the device_pm_info which control how drivers should act on hardware-created wakeup events; see this article for details.

  • There is a notification mechanism which lets interested modules know when a USB device is added to (or removed from) the system. This system is used by some core code; drivers do not normally need to hook in to it.

  • The gfp_t type is now used throughout the kernel. If you have a function which takes memory allocation flags, it should probably be using this type.

  • Code using reader/writer semaphores can now use rwsem_is_locked() to test the (read) state of the semaphore without blocking.

  • The new vmalloc_node() function allocates memory on a specific NUMA node.

  • The "reserved" bit for memory pages has, for all practical purposes, been removed.

  • vm_insert_page() has been added to make it easier for drivers to remap RAM into user space VMAs.

  • There is a new kthread_stop_sem() function which can be used to stop a kernel thread which might be currently blocked on a specific semaphore.

  • RapidIO bus support has been merged into the mainline.

  • The netlink connector mechanism makes netlink code easier to write. Independently, a type-safe netlink interface has been added and is used in parts of the networking subsystem.

  • These kernel symbols have been unexported and are no longer available to modules: clear_page_dirty_for_io, console_unblank, cpu_core_id hugetlb_total_pages, idle_cpu, nr_swap_pages, phys_proc_id, reprogram_timer, swapper_space, sysctl_overcommit_memory, sysctl_overcommit_ratio, sysctl_max_map_count, total_swap_pages, user_get_super, uts_sem, vm_acct_memory, and vm_committed_space.

  • Version 1 of the Video4Linux API is now officially scheduled for removal in July, 2006.

  • The owner field has been removed from the pci_driver structure.

  • A number of SCSI subsystem typedefs (Scsi_Device, Scsi_Pointer, and Scsi_Host_Template) have been removed.

  • The DMA32 memory zone has been added to the x86-64 architecture; its purpose is to make it easy to allocate memory below the 4GB barrier (with the new GFP_DMA32 flag).

  • A call to rcu_barrier() will block the calling process until all current RCU callbacks have completed.

2.6.14 (October 27, 2005)

  • A new PHY abstraction layer has been added for network drivers.

  • The sk_buff structure has changed again; the changes will force a recompile but shouldn't otherwise be a problem.

  • Version 19 of the wireless extensions has been merged. Among other things, this version deprecates the get_wireless_stats() method in the net_device structure.

  • The klist API has changed. The order of the parameters has been reversed for klist_add_head() and klist_add_tail(). It is now necessary to provide a pair of reference counting functions when setting up a list with klist_init().

  • The relayfs virtual filesystem, which enables high-rate data transfers between the kernel and user space, has been merged.

  • kzalloc() has been added as a way of obtaining pre-zeroed memory.

  • Two new versions of schedule_timeout() have been added.

  • The new TASK_INTERACTIVE state flag tells the scheduler not to perform the usual accounting on sleeping processes.

  • SKB's which are expected to be cloned can be efficiently allocated with alloc_skb_fclone().

  • A few new helper functions for mapping block I/O requests have been added; see this article for details.

  • Securityfs, a virtual filesystem intended for use with security modules, has been merged.

2.6.13 (August 28, 2005)

  • The HZ constant is now configurable at kernel build time.

  • The timer API now includes try_to_del_timer_sync(), which makes a best effort to delete the timer; it is safe to call in atomic context.

  • The block_device_operations structure now has an unlocked_ioctl() member.

  • The return value from netif_rx() has changed; it now will return one of only two values: NETIF_RX_SUCCESS or NETIF_RX_DROP.

  • pci_dma_burst_advice can be used by PCI drivers to learn the optimal way of bursting DMA transfers.

  • The text searching API has been added.

  • A new memory allocation function, kzalloc(), has been added.

2.6.12 (June 17, 2005)

  • cancel_rearming_delayed_work() was added to the workqueue API.

  • The timeout value passed to usb_bulk_msg() and usb_control_msg() is now expressed in milliseconds instead of jiffies.

  • An interrupt-disabling spinlock is used in the rwsem implementation. It was never correct to call one of the variants of down_read() or down_write() with interrupts disabled, but it is even less correct now.

  • The fields in the net_device structure have been rearranged, which will break binary-only drivers.

  • kref_put() now returns an int value: nonzero if the kref was actually released.

  • kobject_add() and kobject_del() no longer generate hotplug events. If you need these events, you must call kobject_hotplug() explicitly. The wrapper functions kobject_register() and kobject_unregister() do still generate hotplug events.

  • kobj_map() no longer takes a subsystem argument; instead, it needs a pointer to a semaphore which it can use for mutual exclusion.

  • A new function, sysfs_chmod_file(), allows permissions to be changed on existing sysfs attributes.

  • There is a new generic sort() function which should be used in preference to creating yet another implementation.

  • A new attribute (__nocast) is being used with sparse to disable a number of implicit casts and find probable bugs.

  • io_remap_page_range() is now deprecated; use io_remap_pfn_range() instead.

  • A set of functions has been added to work with big-endian I/O memory.

  • synchronize_kernel() is deprecated. Callers should instead use either synchronize_sched() (to verify that all processors have quiesced) or synchronize_rcu() (to verify that all processors have exited RCU critical sections).

  • The flag argument to blk_queue_ordered() has changed to indicate how ordered writes are handled by the device. Possible values are QUEUE_ORDERED_NONE (ordering is not possible), QUEUE_ORDERED_TAG (ordering is forced with request tags), and QUEUE_ORDERED_FLUSH (ordering is done with explicit flush commands). For the last case, the request queue has two new methods, prepare_flush_fn() and end_flush_fn(), which are called before and after a barrier request.

  • A new function, valid_signal(), can (and should) be used to test whether signal numbers from user space are valid.

  • The Developers Certificate of Origin, the document acknowledged by all those "Signed-off-by:" headers, has changed. The new version adds a clause noting that contributions - and the information that goes with them - are public information which can be redistributed.

2.6.11 (March 2, 2005)

  • The kernel now performs access checking for read() and write() calls before invoking the driver- or filesystem-specific file_operations method.

  • The bcopy() function, unused in the mainline kernel, has been removed.

  • The prototype of the suspend() method in struct pci_driver has changed; the state parameter is now of type pm_message_t.

  • The rwlock_is_locked() macro has been removed; instead, use either read_can_lock() or write_can_lock(). There is also a new spin_can_lock() for regular spinlocks.

  • Three new ways of waiting for completions have been added: wait_for_completion_interruptible(), wait_for_completion_timeout(), and wait_for_completion_interruptible_timeout().

  • For USB drivers: the usb_device_descriptor and usb_config_descriptor structures now keep all fields in the wire (little-endian) form. [GKH]

  • pci_set_power_state() and pci_enable_wake() have new prototypes: power states are represented with the pci_power_t type rather than an int. [GKH]

  • The Big Kernel Semaphore patch was merged. As a result, code which is protected by lock_kernel() is now preemptible. This change should not affect most code developed in this century, but there are always exceptions.

  • The file_operations structure now contains an unlocked_ioctl() member. If that member is non-NULL, it will be called in preference to the regular ioctl() method - and the big kernel lock will not be held. New code should use unlocked_ioctl() and the programmer should ensure that the proper locking has been performed.

    There is also a new compat_ioctl() method which is called, if present, when a 32-bit process calls ioctl() on a 64-bit system.

  • Run-time initialization of spinlocks is being converted away from the assignment form (using SPIN_LOCK_UNLOCKED) to explicit spin_lock_init() calls. No noises have yet been made about removing SPIN_LOCK_INIT, but the writing should be considered to be on the wall. If and when the real-time preemption patches are merged, the assignment form may no longer be possible.

  • debugfs has been merged; it is a virtual filesystem intended for use by kernel hackers who want to export debugging information from their code.

  • Binary attributes in sysfs can now offer mmap() support; see this patch for the details.

  • Four-level page tables have been merged. This change affects surprisingly little code, but, if you are manually walking through the page table tree, you will have to take the new level into account.

  • Socket buffers can be obtained from alloc_skb_from_cache(), which uses a slab cache.

  • A new memory allocation flag (__GFP_ZERO) was added; it allows kernel code to request that the allocated memory be zeroed. It is part of the larger prezeroing patch which has not, yet, been merged.

  • Linus has reimplemented pipes with a circular buffer construct which will, eventually, be mutated into a more generic form.

  • Work is being done toward the goal of removing the semaphore from struct subsystem. If your code depends on this semaphore, which it shouldn't, expect to have to change it soon.

2.6.10 (December 24, 2004)

  • Calling pci_enable_device() is required to get interrupt routing to work. [GKH]

  • A new function, pci_dev_present(), can be used to determine whether a specific device is present or not. [GKH]

  • The prototypes to pci_save_state() and pci_restore_state() have changed: the buffer argument is no longer needed (the space has been allocated in struct pci_dev instead). [GKH]

  • The kernel build system was tweaked; the preferred name for kernel makefiles is now Kbuild. The change is meant to highlight the fact that kernel makefiles are rather different than the user-space variety, but very few, if any makefiles have been renamed.

  • add_timer_on(), sys_lseek(), and a number of other kernel functions are no longer exported to modules. Most of the driver core functions have been changed to GPL-only exports.

  • I/O space write barriers are now supported.

  • The prototype of kunmap_atomic() has changed. This change should not affect properly-written code, but should generate warnings when a struct page pointer is (erroneously) passed to that function.

  • atomic_inc_return() was added as a way to increment the value of an atomic_t variable and get the new value.

  • The little-used "BIO walking" helper functions (process_that_request_first()) have been removed.

  • The venerable remap_page_range() function has been changed to remap_pfn_range(); the new function uses a page frame number for the physical address, rather than the actual address. remap_page_range() is still supported - for now.

  • wake_up_all_sync(), unused in the mainline tree, was removed.

  • A simple, stream-oriented circular buffer implementation was added.

  • The kernel event mechanism was merged, making it possible to notify user space of relevant kernel events.

  • vfs_permission() was replaced by generic_permission(), which has an optional callback for ACL checking. [MS]

2.6.9 (October 18, 2004)

  • Kprobes was merged, making another debugging technique available.

  • Spinlocks are implemented completely out of line now. This change should not affect any code.

  • wait_event_timeout() was added.

  • Kobjects now use the kref type to handle reference counting. Most code should be unaffected by this change.

  • A new set of functions for accessing I/O memory was introduced. The new functions are cleaner and type-safe, and should be used in preference to readb() and friends. The new ioport_map() function makes it possible to treat I/O ports as if they were I/O memory.

  • The NETIF_F_LLTX feature for net_devices tells the networking subsystem that the driver code performs its own locking and does not require that the xmit_lock be taking before hard_start_xmit() can be called.

  • dma_declare_coherent_memory() was added to allow the DMA functions to hand out memory located on a specific device.

  • msleep_interruptible() was added.

  • The prototype of kref_put() changed; a pointer to the release() function is now required.

2.6.8 (August 13, 2004)

  • The fcntl() method in the file_operations structure, just added in 2.6.6, was removed. It has been replaced by two new methods: check_flags() and dir_notify().

  • nonseekable_open() was added as a way of indicating that a given file is not seekable.

  • wait_event_interruptible_exclusive() was added.

  • dma_get_required_mask() was added as a way for drivers to determine the optimal DMA mask.

  • Module section information was added under /sys/module, making it easier use symbolic debuggers with modules.

  • The VFS follow_link() method saw some (compatible) changes. Filesystems should use the new symlink lookup method so that the kernel can, eventually, support a greater link depth. [MS]

(We are still in the process of filling in the earlier API changes - stay tuned).


Thanks to the following people who have helped keep this page current:

[GKH]Greg Kroah-Hartman
Michael Hayes
[MS]Miklos Szeredi

Comments (13 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds