Kernel development [LWN.net]

Kernel release status

The current extra-stable 2.6 release is 2.6.11.2, which was announced by Greg Kroah-Hartman on March 9.

The current 2.6 release remains 2.6.11; Linus has not yet released any 2.6.12 prepatch. About 1000 patches have been merged into his BitKeeper repository, however; they include numerous driver updates, the address space randomization patches, a new packet classifier mechanism for the networking layer, a new workqueue API function (see below), a new function (set_pte_at()) which is intended to replace set_pte() in the memory management code, a Tiger digest algorithm implementation, the restoration of the Philips webcam driver, some software suspend improvements, some readahead improvements, a big block I/O barrier rewrite (which enables full barrier support on serial ATA drives), a set of patches to shrink the kernel for embedded use, a generic sort() function, high-resolution POSIX CPU clock support (not the full high-resolution timers patch), a USB API change (usb_control_msg() and usb_bulk_msg() now take a timeout in milliseconds rather than in jiffies), and lots of fixes.

The current -mm kernel is 2.6.11-mm2. Recent changes to -mm include a reiser4 update, the Open-iSCSI driver, a new SELinux multi-level security implementation, the return of the real-time rlimit patch (yes, that discussion is going again), and a big set of NFS and FAT filesystem updates.

The current 2.4 prepatch is 2.4.30-pre3, released by Marcelo on March 9. It consists of some driver updates and a few fixes.

Comments (none posted)

Quotes of the week

I want to have people test things out, but it doesn't matter how many -rc kernels I'd do, it just won't happen. It's not a "real release".

-- Linus Torvalds

It's nice that patches are called "fix the frobnozzle gadget", but this analysis would be a lot easier if people would also label their patches "break the frobnozzle gadget" when that's what they do. Oh well

-- Andrew Morton

I don't think 2.2 and 2.4 models are applicable any more. There are more of us, we're better (and older) than we used to be, we're better paid (and hence able to work more), our human processes are better and the tools are better. This all adds up to a qualitative shift in the rate and accuracy of development. We need to take this into account when thinking about processes.

-- Andrew Morton

I think we should call the tree the "sucker tree", and if somebody wants to make a logo for it, make it be a penguin with a jokers' hat: exactly to remind people that it's not about the glory.

-- Linus Torvalds

Comments (none posted)

The kernel gets a formal security contact

The Linux kernel has been nearly unique in that it has operated without any sort of formal security organization. Security-related patches would be sent to a (hopefully) relevant maintainer, who would (hopefully) get it merged into the mainline. With luck, distributors would notice the merging of security-related patches and issue the appropriate updates.

The whole system was somewhat unwieldy (though it worked most of the time), but, with this message from Chris Wright, things are beginning to change. There is now an official security contact address - security@kernel.org - which is distributed to a set of "security officers" who will take the appropriate action in response to security-related bugs. The people behind that alias, as of this writing, are Linus Torvalds, Andrew Morton, Alan Cox, Marcelo Tosatti, H. Peter Anvin, and Chris Wright

The posting also includes a disclosure policy, which reads as:

The goal of the Linux kernel security team is to work with the bug submitter to bug resolution as well as disclosure. We prefer to fully disclose the bug as soon as possible. It is reasonable to delay disclosure when the bug or the fix is not yet fully understood, the solution is not well-tested or for vendor coordination. However, we expect these delays to be short, measurable in days, not weeks or months. A disclosure date is negotiated by the security team working with the bug submitter as well as vendors. However, the kernel security team holds the final say when setting a disclosure date. The timeframe for disclosure is from immediate (esp. if it's already publically known) to a few weeks. As a basic default policy, we expect report date to disclosure date to be on the order of 7 days.

So the mechanism is now in place. What remains to be seen is how well it works when the next security hole turns up.

Comments (1 posted)

A unified device number allocator

Traditionally, device drivers have added their devices to the system with calls to register_chrdev() or register_blkdev(). These functions served two functions: allocating a portion of the device number space, and making specific devices available to user space. In 2.6, things changed a bit. For character devices, register_chrdev() was replaced by the combination of alloc_chrdev_region(), which allocates device numbers, and cdev_add(), which attaches a device to a specific number. On the block side, register_blkdev() has become optional, but it can still be used to allocate a block major number. The association of block devices with numbers is done with add_disk().

In other words, the allocation of device number space and the association of specific numbers with devices have been split in the 2.6 kernel. Matt Mackall was looking at the allocation side recently, where he noticed a fair amount of duplicated code between the char and block implementations. The current code is also unable to perform dynamic allocation of major numbers outside of the traditional 0..255 range. So Matt put together a patch which cleans things up a bit.

The new allocation scheme relies on simple linked lists. When a new device number request comes in, the code searches the (sorted) list to see if the request can be satisfied. If so, a new entry is added to the list, and the starting device number is returned. This work is done by the new function register_dev():

    int register_dev(dev_t base, dev_t top, int size, const char *name,
                     struct list_head *list, dev_t *ret);

This function requests that a range of size numbers be allocated from the given list. The first number should fall between base and top; if a suitable range is found, that first number will be returned in ret. The list is a simple, list_head structure which is initially empty; the caller must provide locking to prevent concurrent calls to register_dev() using the same list.

The new interface works; it also replaces a fair amount of common code in the char and block code. Other than some quibbles about potential performance problems resulting from the linear list search algorithm (which should not really matter, since device number allocation is a rare operation), there seem to be no real objections to the new scheme. So it may find its way into a -mm kernel before too long.

A future change would allow the dynamic allocation of device numbers in the expanded range; for now, dynamic major numbers are allocated from 254 in descending order, as has been done for many years. The patch also retains the register_chrdev() and register_blkdev() interfaces in a compatibility mode - even though both were essentially obsolete even before the change. At some point in the future, there may be an attempt to deprecate those interfaces; that move would force changes in a great many drivers.

Comments (none posted)

Some 2.6.12 API changes

The workqueue interface allows kernel code to request that a function be called at a later time, in process context. It can thus be used to arrange for work which cannot be performed immediately, perhaps because the current thread is running in an atomic mode. It is also possible to queue delayed work requests which are guaranteed not to run for a caller-requested delay period.

Sometimes the need arises to cancel tasks which have been queued to a workqueue in a delayed mode. The function which performs this task is:

    int cancel_delayed_work(struct work_struct *work);

This function attempts to intercept the given work before it runs and remove it from the queue. If it is successful, it returns a nonzero value. If, instead, cancel_delayed_work() returns zero, it means that the delayed work request was fired off before the call; it might, in fact, be running on another CPU when the cancel attempt is made. The caller usually needs to know that the work function is not running, so the standard procedure is to call flush_workqueue(), which waits until all tasks currently in the queue are run. After flush_workqueue() returns, the work function is guaranteed not to be running anywhere in the system.

There is one remaining obnoxious detail, however: what if the work function resubmits itself to the workqueue while it is running? In this case, that function could run again when the rest of the kernel least expects it - possibly after the module which contains that function has been removed from the kernel. That is the sort of race condition which gives kernel developers cold sweats. In general, this problem can be avoided by creating a "do not resubmit yourself" flag which is set before calling cancel_delayed_work(), but not all programmers make that effort.

In an attempt to make safe cancellation easier, Arjan van de Ven has added a new function to the workqueue API:

    void cancel_rearming_delayed_work(struct work_struct *work);

The implementation is straightforward; at its core, this function does the following:

	while (!cancel_delayed_work(work))
		flush_workqueue(wq);

In other words, it simply keeps trying until it is able to catch the work request when it is not executing, and, thus, cannot resubmit itself. This approach works because it applies to delayed work - there has to be some time when the work request is sitting in the timer queue waiting to run. Sooner or later, the kernel is sure to catch it during that time and keep it from running again.

The new function has been merged for 2.6.12.

Meanwhile, there are two functions which are used by drivers to send messages to USB peripherals:

    int usb_bulk_msg(struct usb_device *usb_dev, unsigned int pipe,
                     void *data, int len, int *actual_length,
                     int timeout);

    int usb_control_msg(struct usb_device *dev, unsigned int pipe,
                        __u8 request, __u8 requesttype,
                        __u16 value, __u16 index,
                        void *data, __u16 size, int timeout);

In 2.6.11 and prior kernels, the timeout value is expressed in jiffies; for 2.6.12, the units of that parameter has been changed to milliseconds. Dozens of patches were merged to bring in-tree drivers up to the new version of the interface, but out-of-tree drivers will need to be changed explicitly. The situation is complicated a bit by the fact that the prototype of the function did not change, so the compiler will not flag callers which have not been updated.

Finally, David Howells has changed the rwsem implementation to use interrupt-disabling spinlocks. This change should be transparent to most callers. Anybody who calls down_read() or down_write() with interrupts already disabled will be in for a surprise, however. There should be no such callers, since those functions can sleep, but one never knows...

Comments (none posted)

Linux Kernel Development, Second Edition

The second edition of Robert Love's Linux Kernel Development is out. Actually, it has been out for a month or two, but your editor's copy has only just arrived. It should be noted that your editor is the author of a book which could be seen, by some, as a competitor to Mr. Love's work, and [Book cover]

thus might be biased in what he writes. Let it be known, however, that your editor would never let such concerns get in the way of a fair review. Linux Kernel Development really is only suitable for paperweight duty, and, even then, only until the cheesy binding gives out.

Seriously, though, the first edition of Linux Kernel Development was reviewed here in November, 2003. It was, at that time, the only book covering version 2.6 of the kernel, and it did a good job of it. The coverage was not always as deep as one might like, but it was broad, touching on most parts of the kernel. It was, beyond doubt, a book that belonged on every kernel hacker's bookshelf.

The second edition has not messed with that format very much. The book now appears under the Novell Press imprint, but Novell does not appear to have called for any changes. So the basic structure of the book remains the same. The introductory chapter has been split into two, with some additional information on obtaining and building the kernel. There are two completely new chapters; the first looks at working with modules, and the other is a low-level introduction to kobjects and sysfs. The new chapters, like the existing material, are clearly and accurately written. Beyond that, the table of contents reads much like it did in the first edition.

Arguably, the most significant change is that the entire book has been updated to the 2.6.10 kernel. As readers of the LWN Kernel Page are aware, much has changed inside the kernel since the 2.6.0-test release which was the base for the first edition. It was time for an update, and Robert has done it with style. Your editor feels confident in saying that the second edition, once again, belongs on every kernel hacker's bookshelf. Then the first edition can be demoted to paperweight duty.

Comments (4 posted)

Andrew Morton 2.6.11-mm1 ?

Andrew Morton 2.6.11-mm2 ?

Greg KH Linux 2.6.11.1 ?

Greg KH Linux 2.6.11.2 ?

Alan Cox Linux 2.6.11-ac1 ?

Alan Cox PATCH: 2.6.11-ac2 ?

Con Kolivas 2.6.11-ck2 ?

Marcelo Tosatti Linux 2.4.30-pre3 ?

Willy Tarreau linux-2.4.29-hf4 ?

Jake Moilanen No-exec support for ppc64 ?

Peter Williams PlugSched-3.0.2 for 2.6.11 ?

Pavel Machek swsusp: allow resume from initramfs ?

Christoph Lameter del_timer_sync scalability patch ?

David Howells rwsem: Make rwsems use interrupt disabling spinlocks ?

Keith Owens Announce: kdb v4.4 is available for kernel 2.6.11 ?

Marty Ridgeway March Release of LTP now available ?

Tom Zanussi relayfs for linux-2.6.11-mm2 ?

Jeff Garzik netdev-2.6 queue updated ?

Jeff Garzik 2.6.x net driver updates ?

Jeff Garzik libata-dev-2.6 queue updated ?

Jeff Garzik libata-dev queue updated ?

Jeff Garzik 2.6.x libata updates ?

Jeff Garzik starfire net driver update ?

Pierre Ossman Secure Digital (SD) support ?

Greg KH I2C patches for 2.6.11 ?

Greg KH PCI update for 2.6.11 ?

Jean Tourrilhes IrDA patches for 2.6.12-rc1 ?

Wim Van Sebroeck Watchdog v2.6.11 patches ?

Alex Aizman Open-iSCSI High-Performance Initiator for Linux ?

Matthew Wilcox ncr53c8xx updates ?

Bagalkote, Sreenivas [ANNOUNCE][PATCH 2.6.11 2/3] megaraid_sas: Announcing new module for LSI Logic's SAS based MegaRAID controllers ?

Adam Belay PCI bridge driver rewrite (rev 02) ?

Matt Mackall unified device list allocator ?

Bartlomiej Zolnierkiewicz ide-dev-2.6 update ?

Alan Cox PATCH: Ressurrect the esp serial driver ?

Greg KH USB update for 2.6.11 ?

Russell King Hotplug parallel ports ?

Andy Fleming RFC: PHY Abstraction Layer II ?

Nick Sillik [PATCH] To add usabillity for the Maxtor Onetouch button on External Hard-drives ?

Corey Minyard kref docs, take 2 ?

Chris Wright Security contact info ?

Badari Pulavarty 2.6.11-mm1 "nobh" support for ext3 writeback mode ?

Badari Pulavarty 2.6.11-mm1 ext3 writepages support for writeback mode ?

Robert Love inotify for 2.6.11 ?

Evgeniy Polyakov bd: Asynchronous block device ?

Stephen Hemminger rearrange netdevice structure to save space ?

David Howells BDI: Provide backing device capability information ?

David Howells BDI: Improve nommu mmap support ?

Mel Gorman 0/2 Buddy allocator with placement policy (Version 9) + prezeroing (Version 4) ?

Mel Gorman 1/2 Avoiding external fragmentation with a placement policy Version 9 ?

Mel Gorman 2/2 Prezeroing large blocks of pages during allocation Version 4 ?

Stephen Smalley Enhanced MLS support ?

Stephen Smalley [PATCH][LSM/SELINUX] Pass requested protection to security_file_mmap/mprotect hooks ?

Evgeniy Polyakov Acrypto - asynchronous crypto layer for linux kernel 2.6 ?

David Howells keys: Discard key spinlock and use RCU for key payload ?

Netfilter Core Team Release of iptables-1.3.1 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

The kernel gets a formal security contact

A unified device number allocator

Some 2.6.12 API changes

Linux Kernel Development, Second Edition

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Security-related

Miscellaneous