LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements Letters to the editor ->One big page
This page Previous weekFollowing week |
Kernel developmentRelease status Kernel release status The current 2.6 prepatch remains 2.6.11-rc1.Linus's BitKeeper repository contains, as of this writing, some networking updates, an ALSA update (to version 1.0.8), some enhancements to the "circular pipe buffers" code introduced in -rc1 (see below), the ioctl() method rework (see below), in-inode extended attributes for ext3, and various fixes. The current prepatch from Andrew Morton is 2.6.11-rc1-mm1. Recent additions to -mm include the Linux Trace Toolkit (LTT), relayfs, ext3 in-inode extended attributes (subsequently merged), the filesystems in user space (FUSE) patch set, an update to the random driver, and a copy of Dave Jones's "post-halloween" document (in the hope that somebody will be motivated to update it). Andrew added LTT and relayfs with the explanation: "This is a discussion which needs to be had." The discussion has indeed been lively. Many developers see the value in this code, but object to the implementation. As a result, LTT and relayfs are likely to be slimmed down significantly, with more of the work shifted to user space or a separate loadable module. We may also see the Linux Kernel State Tracer patch submitted to -mm for comparison before the discussion is over. The current 2.4 kernel is 2.4.29, released by Marcelo on January 19. One change was made since -rc3: the removal of one patch which was causing trouble. The changes since 2.4.28 are mostly bug fixes and driver updates; 2.4 is past the point of getting much in the way of new features.
Kernel development news Quote of the week
Given that base 2.6 kernels are shipped by Linus with known unfixed
security holes anyone trying to use them really should be doing some
careful thinking. In truth no 2.6 released kernel is suitable for
anything but beta testing until you add a few patches anyway....
-- Alan Cox
I still think the 2.6 model works well because its making very good progress and then others are doing testing and quality management on it. Linus is doing the stuff he is good at and other people are doing the stuff he doesn't.
The new way of ioctl() The ioctl() system call has long been out of favor among the kernel developers, who see it as a completely uncontrolled entry point into the kernel. Given the vast number of applications which expect ioctl() to be present, however, it will not go away anytime soon. So it is worth the trouble to ensure that ioctl() calls are performed quickly and correctly - and that they do not unnecessarily impact the rest of the system.ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL). In the past, the usage of the BKL has made it possible for long-running ioctl() methods to create long latencies for unrelated processes. Recent changes, which have made BKL-covered code preemptible, have mitigated that problem somewhat. Even so, the desire to eventually get rid of the BKL altogether suggests that ioctl() should move out from under its protection. Simply removing the lock_kernel() call before calling ioctl() methods is not an option, however. Each one of those methods must first be audited to see what other locking may be necessary for it to run safely outside of the BKL. That is a huge job, one which would be hard to do in a single "flag day" operation. So a migration path must be provided. As of 2.6.11, that path will exist. The patch (by Michael s. Tsirkin) adds a new member to the file_operations structure:
long (*unlocked_ioctl) (struct file *filp, unsigned int cmd,
unsigned long arg);
If a driver or filesystem provides an unlocked_ioctl() method, it will be called in preference to the older ioctl(). The differences are that the inode argument is not provided (it's available as filp->f_dentry->d_inode) and the BKL is not taken prior to the call. All new code should be written with its own locking, and should use unlocked_ioctl(). Old code should be converted as time allows. For code which must run on multiple kernels, there is a new HAVE_UNLOCKED_IOCTL macro which can be tested to see if the newer method is available or not. Michael's patch adds one other operation:
long (*compat_ioctl) (struct file *filp, unsigned int cmd,
unsigned long arg);
If this method exists, it will be called (without the BKL) whenever a 32-bit process calls ioctl() on a 64-bit system. It should then do whatever is required to convert the argument to native data types and carry out the request. If compat_ioctl() is not provided, the older conversion mechanism will be used, as before. The HAVE_COMPAT_IOCTL macro can be tested to see if this mechanism is available on any given kernel. The compat_ioctl() method will probably filter down into a few subsystems. Andi Kleen has posted patches adding new compat_ioctl() methods to the block_device_operations and scsi_host_template structures, for example, though those patches have not been merged as of this writing.
The evolution of pipe buffers Last week, this page looked at the new circular buffer structure used to implement Unix pipes in 2.6.11-rc1, and noted that the plan was to evolve that structure into something more general. Since then, Linus has taken a couple more steps; it must be time to catch up.One change which has already been merged is the addition of a set of operations for pipe buffers:
struct pipe_buf_operations {
int can_merge;
void *(*map)(struct file *, struct pipe_inode_info *,
struct pipe_buffer *);
void (*unmap)(struct pipe_inode_info *, struct pipe_buffer *);
void (*release)(struct pipe_inode_info *, struct pipe_buffer *);
};
The can_merge flag addresses one of the issues raised last week: coalescing of writes into existing pages in the buffer. If can_merge is non-zero, coalescing will be performed. Otherwise, each write to a pipe buffer will result in the creation of a new circular buffer entry, and, by default, the allocation of a new page. The map() and unmap() methods are charged with controlling the visibility of pipe buffer pages in the kernel's virtual address space. The default map() operations for buffers implementing Unix pipes is quite simple:
static void *anon_pipe_buf_map(struct file *file,
struct pipe_inode_info *info,
struct pipe_buffer *buf)
{
return kmap(buf->page);
}
Since the mapping operation has been abstracted out, there are now fewer assumptions regarding how data is really stored within a pipe buffer. This opens the door to different pipe implementations, such as pipes which implement a direct window into device memory. The release() method should clean things up when the pipe buffer is no longer needed. Linus has also created an initial implementation of a splice() system call, though this work is clearly not ready for merging at this point. This system call looks like:
long sys_splice(int fdin, int fdout, size_t len, unsigned long flags);
fdin and fdout are two file descriptors; a call to sys_splice() will result in len bytes being copied from fdin to fdout, one of which is expected to be a pipe. The flags argument is not currently used by the sample implementation. To make sys_splice() work, Linus added two new methods to the ever-expanding file_operations structure:
ssize_t (*splice_write)(struct inode *in_pipe, struct file *out,
size_t len, unsigned long flags);
ssize_t (*splice_read)(struct file *in, struct inode *out_pipe,
size_t len, unsigned long flags);
The patch includes a generic splice_read() implementation suitable for filesystem-backed file descriptors. It simply populates the page cache with some pages from the file, then loads those pages into the pipe buffer represented by out_pipe. Like ordinary read() and write() methods, the splice variants can transfer fewer bytes than requested. Linus's version will stop at the maximum capacity of a pipe buffer - 16 pages, currently. As Linus acknowledges, there are a number of shortcomings to the current implementation - it is incomplete, the interfaces are ugly, and it will oops the system if anything goes wrong. It is, however, an indication of where he expects this work will lead. Stay tuned.
API changes in the 2.6 kernel series The 2.6 kernel development series differs from its predecessors in that much larger and potentially destabilizing changes are being incorporated into each release. Among these changes are modifications to the internal programming interfaces for the kernel, with the result that kernel developers must work harder to stay on top of a continually-shifting API. There has never been a guarantee of internal API stability within the kernel - even in a stable development series - but the rate of change is higher now.This article will be updated to keep track of the internal changes for each 2.6 kernel release. Its permanent location is:
http://lwn.net/Articles/2.6-kernel-api/ This page will, doubtless, remain incomplete for a while. If you see an omission, please let us know by sending a note to kernel@lwn.net rather than by posting comments here. The chances of a prompt update are higher, the article will not become cluttered with redundant comments, and we'll be more than happy to credit you here. If you are a Linux Device Drivers, Third Edition reader looking for information on changes since the book was published: LDD3 covers version 2.6.10 of the kernel, so only the changes starting with 2.6.11 are relevant. Last update: January 5, 2006
2.6.15 (January 2, 2006)
2.6.14 (October 27, 2005)
2.6.13 (August 28, 2005)
2.6.12 (June 17, 2005)
2.6.11 (March 2, 2005)
2.6.10 (December 24, 2004)
2.6.9 (October 18, 2004)
2.6.8 (August 13, 2004)
(We are still in the process of filling in the earlier API changes - stay tuned).
AcknowledgementsThanks to the following people who have helped keep this page current:
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.