Kernel development
Brief items
Kernel release status
The current extra-stable 2.6 release is 2.6.11.2, which was announced by Greg Kroah-Hartman on March 9.The current 2.6 release remains 2.6.11; Linus has not yet released any 2.6.12 prepatch. About 1000 patches have been merged into his BitKeeper repository, however; they include numerous driver updates, the address space randomization patches, a new packet classifier mechanism for the networking layer, a new workqueue API function (see below), a new function (set_pte_at()) which is intended to replace set_pte() in the memory management code, a Tiger digest algorithm implementation, the restoration of the Philips webcam driver, some software suspend improvements, some readahead improvements, a big block I/O barrier rewrite (which enables full barrier support on serial ATA drives), a set of patches to shrink the kernel for embedded use, a generic sort() function, high-resolution POSIX CPU clock support (not the full high-resolution timers patch), a USB API change (usb_control_msg() and usb_bulk_msg() now take a timeout in milliseconds rather than in jiffies), and lots of fixes.
The current -mm kernel is 2.6.11-mm2. Recent changes to -mm include a reiser4 update, the Open-iSCSI driver, a new SELinux multi-level security implementation, the return of the real-time rlimit patch (yes, that discussion is going again), and a big set of NFS and FAT filesystem updates.
The current 2.4 prepatch is 2.4.30-pre3, released by Marcelo on March 9. It consists of some driver updates and a few fixes.
Kernel development news
Quotes of the week
The kernel gets a formal security contact
The Linux kernel has been nearly unique in that it has operated without any sort of formal security organization. Security-related patches would be sent to a (hopefully) relevant maintainer, who would (hopefully) get it merged into the mainline. With luck, distributors would notice the merging of security-related patches and issue the appropriate updates.The whole system was somewhat unwieldy (though it worked most of the time), but, with this message from Chris Wright, things are beginning to change. There is now an official security contact address - security@kernel.org - which is distributed to a set of "security officers" who will take the appropriate action in response to security-related bugs. The people behind that alias, as of this writing, are Linus Torvalds, Andrew Morton, Alan Cox, Marcelo Tosatti, H. Peter Anvin, and Chris Wright
The posting also includes a disclosure policy, which reads as:
So the mechanism is now in place. What remains to be seen is how well it works when the next security hole turns up.
A unified device number allocator
Traditionally, device drivers have added their devices to the system with calls to register_chrdev() or register_blkdev(). These functions served two functions: allocating a portion of the device number space, and making specific devices available to user space. In 2.6, things changed a bit. For character devices, register_chrdev() was replaced by the combination of alloc_chrdev_region(), which allocates device numbers, and cdev_add(), which attaches a device to a specific number. On the block side, register_blkdev() has become optional, but it can still be used to allocate a block major number. The association of block devices with numbers is done with add_disk().In other words, the allocation of device number space and the association of specific numbers with devices have been split in the 2.6 kernel. Matt Mackall was looking at the allocation side recently, where he noticed a fair amount of duplicated code between the char and block implementations. The current code is also unable to perform dynamic allocation of major numbers outside of the traditional 0..255 range. So Matt put together a patch which cleans things up a bit.
The new allocation scheme relies on simple linked lists. When a new device number request comes in, the code searches the (sorted) list to see if the request can be satisfied. If so, a new entry is added to the list, and the starting device number is returned. This work is done by the new function register_dev():
int register_dev(dev_t base, dev_t top, int size, const char *name, struct list_head *list, dev_t *ret);
This function requests that a range of size numbers be allocated from the given list. The first number should fall between base and top; if a suitable range is found, that first number will be returned in ret. The list is a simple, list_head structure which is initially empty; the caller must provide locking to prevent concurrent calls to register_dev() using the same list.
The new interface works; it also replaces a fair amount of common code in the char and block code. Other than some quibbles about potential performance problems resulting from the linear list search algorithm (which should not really matter, since device number allocation is a rare operation), there seem to be no real objections to the new scheme. So it may find its way into a -mm kernel before too long.
A future change would allow the dynamic allocation of device numbers in the expanded range; for now, dynamic major numbers are allocated from 254 in descending order, as has been done for many years. The patch also retains the register_chrdev() and register_blkdev() interfaces in a compatibility mode - even though both were essentially obsolete even before the change. At some point in the future, there may be an attempt to deprecate those interfaces; that move would force changes in a great many drivers.
Some 2.6.12 API changes
The workqueue interface allows kernel code to request that a function be called at a later time, in process context. It can thus be used to arrange for work which cannot be performed immediately, perhaps because the current thread is running in an atomic mode. It is also possible to queue delayed work requests which are guaranteed not to run for a caller-requested delay period.Sometimes the need arises to cancel tasks which have been queued to a workqueue in a delayed mode. The function which performs this task is:
int cancel_delayed_work(struct work_struct *work);
This function attempts to intercept the given work before it runs and remove it from the queue. If it is successful, it returns a nonzero value. If, instead, cancel_delayed_work() returns zero, it means that the delayed work request was fired off before the call; it might, in fact, be running on another CPU when the cancel attempt is made. The caller usually needs to know that the work function is not running, so the standard procedure is to call flush_workqueue(), which waits until all tasks currently in the queue are run. After flush_workqueue() returns, the work function is guaranteed not to be running anywhere in the system.
There is one remaining obnoxious detail, however: what if the work function resubmits itself to the workqueue while it is running? In this case, that function could run again when the rest of the kernel least expects it - possibly after the module which contains that function has been removed from the kernel. That is the sort of race condition which gives kernel developers cold sweats. In general, this problem can be avoided by creating a "do not resubmit yourself" flag which is set before calling cancel_delayed_work(), but not all programmers make that effort.
In an attempt to make safe cancellation easier, Arjan van de Ven has added a new function to the workqueue API:
void cancel_rearming_delayed_work(struct work_struct *work);
The implementation is straightforward; at its core, this function does the following:
while (!cancel_delayed_work(work)) flush_workqueue(wq);
In other words, it simply keeps trying until it is able to catch the work request when it is not executing, and, thus, cannot resubmit itself. This approach works because it applies to delayed work - there has to be some time when the work request is sitting in the timer queue waiting to run. Sooner or later, the kernel is sure to catch it during that time and keep it from running again.
The new function has been merged for 2.6.12.
Meanwhile, there are two functions which are used by drivers to send messages to USB peripherals:
int usb_bulk_msg(struct usb_device *usb_dev, unsigned int pipe, void *data, int len, int *actual_length, int timeout); int usb_control_msg(struct usb_device *dev, unsigned int pipe, __u8 request, __u8 requesttype, __u16 value, __u16 index, void *data, __u16 size, int timeout);
In 2.6.11 and prior kernels, the timeout value is expressed in jiffies; for 2.6.12, the units of that parameter has been changed to milliseconds. Dozens of patches were merged to bring in-tree drivers up to the new version of the interface, but out-of-tree drivers will need to be changed explicitly. The situation is complicated a bit by the fact that the prototype of the function did not change, so the compiler will not flag callers which have not been updated.
Finally, David Howells has changed the rwsem implementation to use interrupt-disabling spinlocks. This change should be transparent to most callers. Anybody who calls down_read() or down_write() with interrupts already disabled will be in for a surprise, however. There should be no such callers, since those functions can sleep, but one never knows...
Linux Kernel Development, Second Edition
The second edition of Robert Love's Linux Kernel Development is out. Actually, it has been out for a month or two, but your editor's copy has only just arrived. It should be noted that your editor is the author of a book which could be seen, by some, as a competitor to Mr. Love's work, and![[Book cover]](https://static.lwn.net/images/ns/kernel/lkd2.jpg)
Seriously, though, the first edition of Linux Kernel Development was reviewed here in November, 2003. It was, at that time, the only book covering version 2.6 of the kernel, and it did a good job of it. The coverage was not always as deep as one might like, but it was broad, touching on most parts of the kernel. It was, beyond doubt, a book that belonged on every kernel hacker's bookshelf.
The second edition has not messed with that format very much. The book now appears under the Novell Press imprint, but Novell does not appear to have called for any changes. So the basic structure of the book remains the same. The introductory chapter has been split into two, with some additional information on obtaining and building the kernel. There are two completely new chapters; the first looks at working with modules, and the other is a low-level introduction to kobjects and sysfs. The new chapters, like the existing material, are clearly and accurately written. Beyond that, the table of contents reads much like it did in the first edition.
Arguably, the most significant change is that the entire book has been updated to the 2.6.10 kernel. As readers of the LWN Kernel Page are aware, much has changed inside the kernel since the 2.6.0-test release which was the base for the first edition. It was time for an update, and Robert has done it with style. Your editor feels confident in saying that the second edition, once again, belongs on every kernel hacker's bookshelf. Then the first edition can be demoted to paperweight duty.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>