User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.29 merge window is open, so there is no development kernel release as of this writing. Quite a bit of work has been merged for 2.6.29; see the separate article below for details.

The current stable 2.6 kernel is 2.6.28, released by Linus on December 24. Some of the highlights of this kernel are the addition of the GEM GPU memory manager, the ext4 filesystem is no longer "experimental", scalability improvements in memory management via the reworked vmap() and pageout scalability patches, moving the -staging drivers into the mainline, and much more. See the excellent KernelNewbies summary for lots more details about 2.6.28. Says Linus: "In fact, even _if_ you have friends or family, leave them to their endless toil over that christmas ham or turkey, and during the night, when they're asleep, you can give them that magical present of a newly updated computer. When they wake up tomorrow morning, tell them how you saw Santa crawl down the chimney with his USB stick in hand, updating the OS of all good boys and girls."

Comments (none posted)

Kernel development news

Quotes of the week

The software design moral: Everything is shit and will attempt to kill you when you're not looking.
-- Matthew Garrett

I don't believe "auto-destroy my music collection" is a sane default.
-- Alan Cox

BTW, the current influx of higher-complexity filesystems certainly worries me a little.
-- Christoph Hellwig

Can you post the patch, so that we can see if we can find some silly error that we can ridicule you over?
-- Linus Torvalds (Thanks to Jeff Schroeder)

There's a lot of stuff here, as can be seen by the final diffstat number:
779 files changed, 472695 insertions(+), 26479 deletions(-)
and yes, it's all crap :)
-- Greg Kroah-Hartman

I will just note wryly that it used to be that I could compile 0.9x kernels on a 40 MHz 386 machine in 10 minutes. Some 15 years later, it still takes roughly the same amount of time to compile a kernel, even though computers have gotten vastly faster since then. Something seems wrong with that....
-- Ted Ts'o

Comments (11 posted)

2.6.29 merge window, part 1

By Jonathan Corbet
January 7, 2009
As of this writing, some 6500 non-merge changesets have been accepted for the 2.6.29 development cycle. There is the usual set of new device drivers, combined with a number of important core kernel changes.

As of this writing, user-visible changes include:

  • New drivers for for SH-2A FPU based SH7201 processors, Palm T|X, T5 and LifeDrive audio devices, Gumstix Overo audio devices, Marvell Zylonite audio devices, Wolfson Micro TWL4030, UDA134x, WM8350 AudioPlus, and WM8728 codecs, Texas Instruments SDP3430 audio devices, OMAP3 Pandora audio devices, Intel G45 integrated HDMI audio codecs, Broadcom BCM50610 network PHYs, LSI ET1011C PHYs, KS8695 Ethernet devices, SMSC LAN9420 PCI Ethernet adapters, SMSC LAN911x and LAN921x embedded Ethernet controllers, Solarflare 10Xpress SFT9001 network controllers, Atheros AR9285 chipsets, Solos ADSL2+ PCI Multiport cards, Nuvoton W90X900 CPUs, LG ATSC lgdt3304 video capture devices, Sharp s921 ISDB-T devices, ST Microelectronics STB6100 silicon tuners and STB0899 multistandard frontend devices, ST STV06XX-based cameras, TDA8261 8PSK/QPSK tuners, OmniVision ov772x cameras, Conexant CX24113/CX24128 tuners, Texas Instruments TVP514x video decoders, OMAP2 camera devices (as seen in Nokia Internet tablets), NXP TEA5764 I2C FM radio devices, Chelsio T3 ASIC based iSCSI adapters, Wolfson Microelectronics WM8350 power management units, Dialog DA9030 battery chargers, DaVinci DM355 EVM microcontrollers, Intel 5400 (Seaburg) memory controller chipsets, Walkera WK-0701 RC transmitters, Wacom W8001 penabled serial touchscreens, Dialog Semiconductor DA9034 touchscreens, TSC2007 based touchscreens, PXA930 trackball mice, and PXA930/PXA935 enhanced rotary controllers.

  • A number of new drivers have also entered the kernel via the staging tree; these include drivers for Sensoray 2250/2251 video capture devices, Airgo AGNX00 wireless chips, a wide variety of data acquisition devices via the Comedi framework, ASUS laptop OLED displays, Ralink 2860 and 2870 wireless wireless interfaces ("This is the Ralink RT2860 driver from the company that does horrible things like reading a config file from /etc."), RealTek RTL8187SE Wireless LAN NICs, HD44780 or KS-0074 parallel port LCD panels, ServerEngines BladeEngine (EC 3210) network interfaces, Princeton Instruments USB cameras, Mimio Xi interactive whiteboards, the openPOWERLINK network stack, Frontier Tranzport and Alphatrack devices, and several families of Meilhaus data acquisition boards. Also added, seemingly without help from Google, is a set of drivers for the Android platform, including support for the /dev/binder IPC mechanism, timed GPIO operations, the RAM buffer console, a special "low memory killer" device, and the logger device.

    Remember that "staging" drivers are not considered to be up to normal kernel code quality drivers; they are merged in the hope that developers will help to make them better. Quite a few improvements to these drivers were merged via the staging tree this time around, so this tree may be working as intended.

  • The long-deprecated eepro100 driver has finally been removed; the e100 driver should be used instead.

  • The SCSI layer has acquired support for Fibre Channel over Ethernet (FCoE) devices.

  • The GEM layer used for memory management in graphical processor unit (GPU) driver code has seen a number of improvements. The big news in this area, though, is that the kernel mode setting code has finally been merged. This change paves the way for the removal of a great deal of scary user-space code, better support for features like fast user switching, and the ability to run the X server without root privilege. Kernel mode setting is still in an early state, though, and most people will not want to enable it unless they are sure they have a properly-prepared user space.

  • Support for HP iPAQ h5000 systems, Samsung S3C64XX series based systems, and Pandora game consoles has been added to the ARM architecture code.

  • The SuperH architecture has gained support for the ftrace tracing framework.

  • There is a new no_file_caps= boot option which can be used to disable file capabilities on kernels which have that feature enabled. From the changelog: "This allows distributions to ship a kernel with file capabilities compiled in, without forcing users to use (and understand and trust) them."

  • The CIFS filesystem supports a new forcemand mount option; when present, it causes CIFS to use mandatory locks rather than POSIX-style advisory locks.

  • The CUBIC 2.3 TCP congestion control algorithm and the "backward congestion notification" feature are now supported in the networking layer.

  • The network code has support for the "deficit round robin" packet scheduling algorithm, said to produce highly fair scheduling with minimal cost.

  • A vast set of network namespace patches has been merged. The namespace hackers have, so far, refrained from saying that this feature is ready for general use, but it must be getting closer.

  • The devpts filesystem now supports the creation of multiple instances in different namespaces.

  • The wireless regulatory domain code has been extended to provide 802.11d support.

  • The Tree RCU patch set, which should provide improved scalability on systems with "more than a few hundred CPUs," has been merged.

  • Users of huge pages can now look in /proc/pid/smaps for a new KernelPageSize value giving the actual size of the pages in use. Among other things, this information can be used to verify that a process is actually using large pages where expected.

  • The eCryptfs filesystem now supports the encrypting of file names as well as their contents.

  • The FUSE user-space filesystem mechanism can now support ioctl() and poll() calls.

  • Support for unlabeled networks and hosts has been added to the SMACK security module.

Changes visible to kernel developers include:

  • There is a new synchronous hash interface called "shash." It simplifies the use of synchronous hash operations while allowing the same tfm to be used simultaneously in different threads. All in-tree users have been switched to the new API.

  • The massive task credentials patch set has been merged. This code reorganizes the handling of process credentials (user ID, capabilities, etc.). One of the immediate implications of this change is direct references to credential-oriented fields in the task structure need to be changed; for example, current->user->uid becomes current_uid(). See Documentation/credentials.txt for a description of the new API.

  • The ftrace code has seen a lot of internal changes. The function tracing feature has seen a number of improvements, and the developers have added mechanisms to profile the behavior of if statements, provide function call graphs, obtain user-space stack traces, and follow CPU power-state transitions.

  • Most of the callback functions/methods associated with the net_device structure have been moved out of that structure and into the new struct net_device_ops. In-tree drivers have been converted to the new API.

  • The priv field has been removed from struct net_device; drivers should use netdev_priv() instead.

  • The generic PHY layer now has power management support. To that end, two new methods - suspend() and resume() - have been added to struct phy_driver.

  • The networking layer now supports large receive offload (or "generic receive offload") operation.

  • The NAPI API has been cleaned up somewhat; in particular, functions like netif_rx_schedule(), netif_rx_schedule_prep(), and netif_rx_complete() have lost the unneeded struct net_device parameter.

  • The hrtimer code has been simplified with the removal of variable modes for callback functions. All processing is now done in hardirq context.

  • A new set of LSM hooks has been added; these support pathname-based security operations. With the merging of these hooks, one major obstacle to the inclusion of security modules like AppArmor and TOMOYO has been removed.

  • The kernel will now refuse to build with GCC 4.1.0 or 4.1.1; those versions have unfortunate bugs which prevent the building of a working kernel. Versions 3.0 and 3.1 have also been deemed to be too old and will not be supported in 2.6.29.

  • Video4Linux drivers now use a separate v4l2_file_operations structure to hold their VFS-like callbacks. The prototypes of a number of these functions have been changed to remove the inode argument.

  • Video4Linux2 has also acquired a new "subdevice" concept, meant to reflect the fact that video "devices" tend to be, in reality, a set of cooperating devices. See the new document for a description of how this mechanism works.

  • Two new functions - stop_machine_create() and stop_machine_destroy() - allow the independent creation of the threads used by stop_machine(). That, in turn, lets those threads be created before trying to actually stop the machine, making that operation more resistant to failure.

  • The poll() file operation is now allowed to sleep; see this article for more information on this change.

  • The CPU mask mechanism, used to represent sets of processors in the system, is in the middle of being massively reworked. The problem is that CPU masks were often put on the stack, but, as the number of processors grows, the stack lacks room for the mask. The new API is designed to get these masks off the stack, and to guard against anybody ever trying to put one back. See this posting by Rusty Russell for details on this work.

The merge window opened on December 28; if the usual two-week pattern holds, changes should be accepted through January 11. Tune in next week for an update on the final patches merged for the 2.6.29 kernel.

Comments (5 posted)

The future for grsecurity

By Jake Edge
January 7, 2009

Using an out-of-tree kernel patch has several downsides but, as long as the patch is maintained and updated with the kernel, it is workable. If the developers lose interest—or funding—it suddenly becomes a much bigger problem for users. That scenario may be about to play out for users of the grsecurity tool as a recent release comes with a warning that it could be the last.

Users of grsecurity are, unsurprisingly, worried about the future of the security tool, but calls for its inclusion in the mainline are not likely to be successful. Over time, largely because of the efforts of others outside of the grsecurity project, various pieces of grsecurity (and the associated PaX project) have been added to the kernel. But, there are a number of reasons that the full grsecurity patch is not in the mainline; the most basic is that the developers seem unwilling or uninterested in following the normal path to inclusion.

The grsecurity patch implements a number of security features that are useful, particularly for web servers or servers that provide shell access to untrusted users. One of the major features is role-based access control (RBAC), which is an alternative to the traditional UNIX discretionary access control (DAC) or the more recent mandatory access control (MAC) provided by SELinux and Smack. The aim of RBAC is create a "least privilege" system, where users and processes have only the minimum necessary privilege to accomplish their task. grsecurity also includes hardening of the chroot() system call, to eliminate privilege escalation and other vulnerabilities from within a "chroot jail". In addition, there are a number of other miscellaneous features like auditing and restricting /proc information, all of which are listed on the grsecurity features page.

Another major component of grsecurity is the PaX code, which restricts memory use so that various exploits, such as buffer overflows and other code execution vulnerabilities, are blunted or eliminated. It does this by making data pages non-executable using—or emulating—the "no execute" (or NX) bit. PaX restricts mprotect() to not allow pages that are both writable and executable to avoid code injection as well. PaX also adds much more aggressive address space layout randomization (ASLR) than is currently used by Linux. PaX is developed separately from grsecurity, by the anonymous "PaX Team", then incorporated into grsecurity by developer Brad Spengler.

The project has been around for a long time; grsecurity started in 2001, while PaX began in 2000. There are numerous satisfied users and grsecurity has been used in distributions such as NetSecL and Hardened Gentoo, but it has never made it into the mainline. Gabor Micsko recently posted a request on linux-kernel for Linus Torvalds to reconsider grsecurity:

The common opinion of the developers of grsecurity, PaX and their users is that acceptance of the code into the kernel would be the best solution for saving the project, beside finding another long-term sponsor.

Torvalds replied that much of what was in grsecurity and PaX was "insane and very annoying and invasive code." He then went on to explain some of the history:

The apparent inability (and perhaps more importantly - total unwilling[n]ess) from the PaX team to be able to see what makes sense in a long-term general kernel and what does not, and split things up and try to push the sensible things up (and know which things are too ugly or too specialized to make sense), caused many PaX features to never be merged.

Much of it did get merged over the years (mostly because some people spent the time to separate things out), but no, we're not going to suddenly start merging code like that just because the project is in trouble. None of the basic issues have been solved.

A perfect example of the unwillingness to work with the kernel hackers is embodied in the decision not to implement RBAC as a Linux Security Module (LSM). For better or worse, LSM is the mechanism used to implement access control in the kernel. Conceptually, it is a good fit for the grsecurity RBAC code. It might require additional LSM hooks, but working on getting those hooks added is the right approach. There was some uncertainty about LSM at one time, but it clearly is the way forward today.

There may also be an issue with the PaX code, in that anonymous contributions to the kernel are not allowed. Presumably Spengler, or some other interested hacker, could sign off on that code, but it cannot be accepted directly from "PaX Team".

To the extent grsecurity and PaX have been proposed for inclusion, they have always been presented as a single monolithic patch. There has never been an attempt to break the patch up into logical chunks that can be accepted or rejected on their individual merits. So far, that has not occurred even after the project lost its sponsor. But waiting until the last minute is not going to work. As Robert Hancock puts it:

Saying to the kernel developers "here, throw this huge blob of code into your kernel because otherwise we're taking our ball and going home" is not how it works.

If there is value in the existing code, interested users and developers need to work within the kernel process to get it accepted. To do that, one must identify the useful pieces and proceed from there. Valdis Kletnieks suggests:

Probably the best way to proceed would be for the stakeholders to come to some agreement on which parts are the "sane stuff" (which could be an interesting food fight), split those parts out, and submit them for inclusion as standalone separate patches.

This is yet another example of the perils of out-of-tree code. By all accounts, there are satisfied grsecurity users who may well be left behind if the grsecurity project fails to find sponsors by the end of March. They can, of course, continue running the grsecurity-enhanced kernels they currently have, but may not be able to take advantage of upcoming kernel advances.

Perhaps the stakeholders will gather together and continue updating grsecurity for newer kernels, but that still leaves the underlying problem. They would be better served spending at least part of their time working with the kernel hackers to get as much of grsecurity and PaX as possible merged into the mainline.

Comments (2 posted)

Btrfs aims for the mainline

By Jonathan Corbet
January 7, 2009
The Btrfs filesystem has been under development for the last year or so; for much of that time, it has been widely regarded as the most likely "next generation filesystem" for Linux. But, before it can claim that title, Btrfs must stabilize and find its way into the mainline kernel. Btrfs developer Chris Mason has been saying for a while that he thinks the code will come together more quickly if it is merged relatively soon, even if it is not yet truly ready for production use. General experience with kernel development tends to support this position: in-tree code gets more review, testing, and fixes than out-of-tree code. So the development community as a whole has been reasonably supportive of a relatively early Btrfs merge.

In our last Btrfs episode, Andrew Morton suggested that a 2.6.29 merge be targeted. Chris would like that happen; to that end, he has posted a version of Btrfs for consideration. Unsurprisingly, that posting has already increased the amount of attention being paid to this code, with the result that Chris quickly got a list of things to fix. Most of those have now been addressed, but there are a few remaining issues which could still impede the merging of Btrfs in this development cycle. This article will look at the potential roadblocks.

One of those is the user-space API. Btrfs brings with it a whole set of new ioctl() calls, none of which have been seriously reviewed or even documented. These calls perform functions like creating snapshots, initiating defragmentation, creating or resizing subvolumes, adding devices to the volume set, etc. Interestingly, there has been no real complaint about the volume-management features of Btrfs in general. But the interface to features like that needs close scrutiny; normally, user-space APIs cannot be broken once they are merged into the mainline. There has been some talk of making an exception for Btrfs, since there is little chance of systems becoming dependent on a specific interface before Btrfs is production-ready.

Still, once distributions start shipping Btrfs tools - to help testers if nothing else - an API change would cause pain. Any potential for this kind of pain would make API changes very hard to do. So Linux may well end up being stuck with the early Btrfs API. Given that at least one developer thinks that this API needs a serious rework, this issue could turn out to be a serious roadblock indeed.

Then, there is the issue of the special-purpose locking primitives used in Btrfs. To understand this discussion, it's worth looking at the locking function used within Btrfs:

    int btrfs_tree_lock(struct extent_buffer *eb)
	int i;

	if (mutex_trylock(&eb->mutex))
	    return 0;
	for (i = 0; i < 512; i++) {
	    if (mutex_trylock(&eb->mutex))
		return 0;
	mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(eb));
	return 0;

The lock in question is a mutex, but it is being acquired in an interesting way. If the lock is held by another process, this function will poll it up to 512 times, without sleeping, in the hope that it will become available quickly. Should that happen, the lock can be acquired without sleeping at all. After 512 unsuccessful attempts, the function will finally give up and go to sleep.

Chris justifies this behavior this way:

Btrfs is using mutexes to protect the btree blocks, and btree searching often hits hot nodes that are always in cache. For these nodes, the spinning is much faster, but btrfs also needs to be able to sleep with the locks held so it can read from the disk and do other complex operations.

For btrfs, dbench 50 performance doubles with the unconditional spin, mostly because that workload is almost all in ram. For 50 procs creating 4k files in parallel, the spin is 30-50% faster. This workload is a mixture of disk bound and CPU bound.

That kind of performance increase seems worth going for. In fact, it reflects a phenomenon which has been observed in other situations as well: even when sleeping locks are used, performance often improves if a processor spins for a while in the hope that a contended lock will become available. If the lock can be acquired without sleeping, then the overhead associated with putting the process to sleep and waking it up can be avoided. Beyond that, though, there is the fact that the process seeking to acquire the lock is probably well represented in the CPU's cache. Allowing that process to continue to run will, if the lock can be acquired quickly, almost certainly lead to better system performance.

For this reason, the adaptive realtime locks patch was developed last year, though it never found its way into the mainline. In response to the Btrfs discussion, Peter Zijlstra proposed a spinning mutex patch which is intended to provide the same benefits as the special Btrfs locking function, but for more general use and without the addition of magic constants. In Peter's patch, an attempt to acquire a contended lock will spin for as long as the process holding that lock is actually running on a CPU. If the lock holder goes to sleep, any process trying to acquire the lock also goes to sleep. The heuristic seems to make sense, though detailed benchmarks have not been posted. The patch was received reasonably well, though Linus has insisted that some changes be made.

So a more general spinning mutex may well find its way into the mainline. Whether it will go in for 2.6.29 is not clear, though. Developers tend to like their core locking primitives to be reasonably well tested; merging something which was developed toward the end of the merge window could be a hard sell. Until something like that happens, Chris is uninterested in removing his special locking function:

But, if anyone working on adaptive mutexes is looking for a coder, tester, use case, or benchmark for their locking scheme, my hand is up. Until then, this is my for loop, there are many like it, but this one is mine.

Finally, there is the question of the name. Some reviewers have suggested that the filesystem should be merged with a name which makes it clear that it's not meant for production use - "btrfsdev," for example. Chris is resistant to that idea, noting that, unlike existing filesystems, Btrfs is known to be new and has no reputation for stability. He has stated his willingness to make the change, though, if it is truly considered to be necessary. Bruce Fields pointed out that calling it "Btrfs" from the beginning could possibly burn future developers who boot an old kernel (with a non-production Btrfs) after switching to a newer, production-ready version of the filesystem.

All of this adds up to an uncertain fate for Btrfs in 2.6.29; there are a fair number of open issues and it's late in the merge window. Of course, Btrfs could be merged after 2.6.29-rc1; since it is a completely new subsystem, it won't cause regressions. But if Linus concludes that there are enough loose ends in the current Btrfs code, he may just decide to give it one more development cycle before bringing it into the mainline. So, while nobody seems to doubt that Btrfs will go in, the question of when remains open.

(With any luck, we hope to have an authoritative article on Btrfs for this page in the near future, once the author - you know who you are! - gets it written. Stay tuned.)

Comments (36 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

  • Casey Dahlin: waitfd. (January 7, 2009)

Development tools

Device drivers

  • Dave Airlie: drm. (January 5, 2009)

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds