LWN.net Logo

Kernel development

Brief items

Kernel release status

The 2.6.26 merge window remains open, so there is no released 2.6 development kernel. See the article below for a summary of patches merged over the last week.

No stable kernel releases have been made over the last week. As of this writing, the 2.6.24.6 and 2.6.25.1 stable updates are in the review process; if all goes well, these updates should be released on May 1.

Comments (none posted)

Kernel development news

Quotes of the week

Those who have been watching the linux-kernel list know that the 2.6.26 merge window has been a little rougher than some of those which came before. That has led to some fairly strong discussion over how changes find their way into the mainline. Here's a few selections.

I'm not saying the patch is wrong ... or that just because it broke voyager it shouldn't be done. What I'm saying is that it shouldn't have been put into the x86 tree without mailing list review.

Running a git tree isn't a private fiefdom, it's a public trust; to keep the trust of other developers, you have to run the tree in a transparent fashion ... and making the mailing list the only input to it is one way of ensuring this. It also helps with review that we're all so worried about so little being done ...

-- James Bottomley

But, we'd not mind at all posting 1000 x86.git patches to lkml (or another list) every 3 months (or more frequently), if people request that.
-- Ingo Molnar

You can post whatever patches you like a million times to lkml. That's not the problem. It's that the patches don't get reviewed, posting them more or to a different place doesn't help that.
-- David Miller

Sorting x86 arch code is inevitably going to break a few eggs, but I suspect the time cost has been more in Dave v Ingo (12 rounds, two falls, two submissions or a knockout) than actually sorting out the fallout of a couple of problem cases.
-- Alan Cox

So here's how we're going to fix David's problem:

- Everyone gets their stuff into linux-next.

- Lots of people _test_ linux-next. Just once a week.

Those two steps will improve the merge-window chaos a lot. Things will get better.

-- Andrew Morton

IMO, the merge window is way too short for actually testing anything. I rebuild the kernel once or even twice a day and there's no way I can really test it. I can only check if it breaks right away. And if it does, there's no time to find out what broke it before the next few hundreds of commits land on top of that.
-- Rafael Wysocki

And yes, there is a solution: don't develop so much. Don't allow thousands of developers to be involved. Do a small core group, and make development so hard or inconvenient that you only have a few tens of people who write code, and vet them and force them to jump through hoops when adding new features (or fixing old ones, for that matter).
-- Linus Torvalds

Comments (4 posted)

The 2.6.26 merge window, part 2

By Jonathan Corbet
April 30, 2008
Since last week's summary was written, another 3700 changesets have found their way into the mainline git repository. The most significant user-visible changes include:

  • New drivers have been merged for Wolfson WM9713 codecs, TI DAVINCI AC97 sound chips, Emagic Audiowerk 2 soundcards, x86 PC speakers (new driver which makes them look like sound cards), Asus AV100 (Xonar DX) sound cards, Micron MT9M001 and MT9V022 cameras, PXA27x Quick Capture cameras, Kworld ATSC 120 tuners, cx23417 MPEG encoders, Integrant ITD1000 tuners, Philips TDA10048HN-based demodulators, Philips SAA7171/3/4 audio/video decoders (the last out-of-tree IVTV driver), Auvitek AU8522 demodulators, Samsung S5H1411-based tuners, framebuffer, keyboard, and mouse virtual devices (for Xen), several Wolfson Microelectronics touchscreens, wireless Xbox 360 controllers, Zhen Hua PPM-4CH transmitters, SPCP8x5 USB to serial adaptors, NCR 53c9x SCSI controllers (replacement driver), Freescale 8610 and 5121 display interface units, Intel 965G/965GM integrated graphics controllers, TI OMAP sound controllers (including the one on the Nokia 810), Eee PC function keys, and Intel IXP4xx Ethernet devices.

  • There is now "basic" support for braille screen readers.

  • Support for the One Laptop Per Child XO architecture has been merged into the mainline.

  • The new virtual files found in /proc/pid/mountinfo provide information on all filesystem mounts visible to the relevant process.

  • The new virtual file /proc/vmallocinfo displays information on use of vmalloc space within the kernel.

  • The SPARC Niagara architecture now has NUMA support.

  • The Xen balloon driver (allowing memory to be added to or removed from virtual guests) has been merged.

  • By default, /dev/mem can no longer be used to access RAM; Fedora and Red Hat have applied this patch for years, but now it has found its way into the mainline.

  • The KVM paravirtualization subsystem now supports the S/390, PowerPC 440, and ia64 architectures.

  • Per-process "securebits" are supported. These bits control how a process's capability bits are managed; the patch is intended to help those who would transition over to a fully capability-based system. See this article for a more detailed description of this feature.

  • The getrusage() system call has a new RUSAGE_THREAD option which causes it to return information about the current thread only.

  • The device whitelist control group patch (described briefly in this article) has been merged.

  • It is now possible to create and use partitions with network block device (NBD) devices.

  • The audit subsystem can now test events against the type of the file being operated upon.

  • The VFS now makes backing device information available under /sys/class/bdi. Interested people can look at per-device readahead and writeback variables there.

  • The FUSE filesystem now supports the creation of shared writable memory mappings.

Changes visible to kernel developers include:

  • ioremap() on the x86 architecture will now always return an uncached mapping. Previously, it had taken a more relaxed approach, leaving the caching as the BIOS had set it up. The practical result was to almost always create uncached mappings, but with occasional exceptions. Drivers which depend on a cached mapping will now break; they will need to use ioremap_cache() instead.

  • The Video4Linux2 API now defines a set of controls for camera devices; they allow user space to work with parameters like exposure type, tilt and pan, focus, and more.

  • On the x86 architecture, there is a new configuration parameter which allows gcc to make its own decisions about the inlining of functions, even when functions are declared inline. In some cases, this option can reduce the size of the kernel's text segment by over 2%.

  • The legacy IDE layer has gone through a lot of internal changes which will break any remaining IDE drivers.

  • The nopage() virtual memory area operation has been removed; all in-tree code is now using fault() instead.

  • The SLUB allocator supports a new sysfs file (/sys/kernel/slab/name/order) which allows system administrators to change the size of page allocations used by the named slab.

  • A condition which triggers a warning from WARN_ON will now also taint the kernel.

  • The get_info() interface for /proc files has been removed. There is also a new function for creating /proc files:

        struct proc_dir_entry *proc_create_data(const char *name, mode_t mode,
    					    struct proc_dir_entry *parent,
    					    const struct file_operations *proc_fops,
    					    void *data);
    

    This version adds the data pointer, ensuring that it will be set in the resulting proc_dir_entry structure before user space can try to access it.

  • The object debugging infrastructure has been merged.

The merge window remains open; tune in next week for (what should be) the final set of changes merged for 2.6.26.

Comments (2 posted)

Restricting root with per-process securebits

By Jake Edge
April 30, 2008

Linux capabilities have had a long and somewhat tortuous journey as part of the Linux kernel. Slowly—and very carefully—functionality is being added to this security feature to get it to a point where it is a viable alternative to the all-or-nothing setuid(0) model. A recently merged patch adds a per-process securebits feature that will allow capabilities-based daemons or subsystems to coexist with existing setuid utilities.

Linux capabilities break up the privileged tasks normally associated with root (i.e. uid 0) into finer-grained abilities which can be individually granted or revoked for specific processes. The idea is to change the standard Unix model that root has all special privileges while all other users have none. The terminology is always a bit contentious, though, as Linux capabilities are derived from a POSIX proposal that was never adopted, but shares the name "capabilities" with an entirely different approach; this article is only concerned with capabilities of the Linux variety.

There has long been interest in creating a Linux system that did not rely upon a single root account. Capabilities are seen as the way to get there, but they have suffered from a bit of a chicken-and-egg problem. With the recent work to add file-based capabilities and restore CAP_SETPCAP to its original meaning, a true capabilities-based system is becoming possible. In the patch, which has been merged for 2.6.26, Andrew Morgan describes the new functionality:

The feature added by this patch can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel.

The patch removes the global securebits variable, replacing it with an entry in struct task_struct, that can be manipulated by a process, but only for itself—and any children. Morgan envisions hybrid systems that have some utilities using capabilities to get their privileges along with some setuid(0) utilities. In that scenario, a capabilities-based utility or daemon may wish to limit what its children can do, even if they execute a setuid(0) binary. As part of the evolution, process trees can be created that cannot get root privileges.

Processes which have the CAP_SETPCAP capability can change their securebits setting via the prctl() system call. There are three separate bits that govern the interaction of capabilities and setuid:

  • SECURE_NOROOT – enabling this gives no special privileges to uid 0
  • SECURE_NO_SETUID_FIXUP – setting this bit disables capability fixes when transitioning from or to uid 0 via setuid. This might be done for compatibility with older programs that use setuid to reduce their privileges.
  • SECURE_KEEP_CAPS – when set, a process can retain its capabilities even when transitioning to a normal (not uid 0) user. This bit is cleared by exec().
Each of these bits also has a companion *_LOCKED bit that, if set, will not allow any user program to alter the corresponding setting. As Morgan notes in the patch, a program that can set its capabilities (has CAP_SETPCAP) can drop all privileges for itself and any child process by doing:
    prctl(PR_SET_SECUREBITS, 0x2f);
This is the equivalent of setting SECURE_NOROOT, SECURE_NO_ROOT_LOCKED, SECURE_NO_SETUID_FIXUP, SECURE_NO_SETUID_FIXUP_LOCKED, and SECURE_KEEP_CAPS_LOCKED.

The memory of the sendmail-capabilities bug from 2000 makes some a bit queasy—or worse—about any patches that involve capabilities and setuid. Andrew Morton asks: "what was the bug which caused us to cripple capability inheritance back in the days of yore? (Some sendmail thing?)" That bug was caused because unprivileged users could take away the CAP_SETUID capability from setuid binaries like sendmail. When sendmail then used setuid to drop its privileges, it failed, but sendmail did not check, so it was still running with full privilege. This could be leveraged by a user to gain root privileges. It was a disconnect between capabilities and the longstanding behavior of Unix-like systems when dropping privileges.

Morgan has written a detailed description of the sendmail-capabilities bug in response to Morton's questions. He makes it clear that he wants to move toward full capability support without breaking existing code:

I'm basically interested in evolving the capability implementation back to the POSIX.1e model and making it whole - but most certainly *without crippling legacy superuser support in the process* .

As folk get more comfortable with this full capability model. I believe we can delete more cruft from the main kernel, but even that clean up will leave a fully functional legacy model in place. I feel it should be for something like init, or one of its children to be able to run subsystems in capability-only or legacy modes.

Morton seemed satisfied that his concerns had been addressed, but still wonders about the future for capabilities: "So how do we ever get to the stage where we can recommend that distributors turn these things on, and have them agree with us?" This was echoed by Ismail Dönmez, who was looking for concrete examples of how to use the per-process securebits feature. Morgan provides a pointer to some examples along with his belief that sometime soon the capabilities developers will become confident enough to recommend turning off the "experimental" flag for the SECURITY_FILE_CAPABILITIES kernel configuration. That flag governs both the file-based capabilities as well as the per-process securebits. In addition, Morgan says:

More importantly I'm hopeful that in that time we'll have accumulated enough documentation and user-space experience and examples to convince others that this is, indeed, a viable feature to support in mainstream distributions.

A developerWorks article on file-based capabilities by Serge Hallyn and a web page on POSIX capabilities by Chris Friedhoff were both mentioned in the thread as good references for the work being done to actually use capabilities in systems. Those pre-date the securebits work, so Dönmez was looking for use-cases for the new feature. Morgan replied that containers were one, deferring to Hallyn who has some ideas on using securebits:

We tend to talk about 'system containers' versus 'application containers'. A system container would be like a vserver or openvz instance, something which looks like a separate machine. I was going to say I don't imagine per-process securebits being useful there, but actually since a system container doesn't need to do any hardware setup it actually might be a much easier start for a full SECURE_NOROOT distro than a real machine. Heck, on a real machine init and a few legacy [daemons] could run in the init namespace, while users log in and apache etc run in a SECURE_NOROOT container.

But I especially like the thought of for instance postfix running in a carefully crafted application container (with its own virtual network card and limited file tree and no visibility of other processes) with SECURE_NOROOT on.

Capabilities are an interesting, but complicated, security feature. For most of the ten years they have been part of the Linux kernel, they have either been broken, ignored, or both. With the latest work being done by Hallyn, Morgan, and others, capabilities are finally becoming a fully-working alternative to things like SELinux. It will be interesting to see if more user utilities will become capability-aware and whether distributions start using capabilities. Some day, root may just fade away.

Comments (4 posted)

Ksplice: kernel patches without reboots

By Jonathan Corbet
April 29, 2008
The kernel developers are generally quite good about responding to security problems. Once a vulnerability in the kernel has been found, a patch comes out in short order; system administrators can then apply the patch (or get a patched kernel from their distributor), reboot the system, and get on with life knowing that the vulnerability has been fixed. It is a system which works pretty well.

One little problem remains, though: rebooting the system is a pain. At a minimum, it requires a few minutes of down time. In many situations, that down time cannot be tolerated. Reboots also disrupt any ongoing work, break existing network connections, and can cause the loss of results from long-running processes. And, most importantly of all, reboots prove traumatic for a certain subset of Linux administrators who prize a long uptime above almost all other things. Administrators currently have to choose between multi-year uptimes and security fixes; anything which frees them from a dilemma of this magnitude can only be welcome.

That "anything" might just be a recently-announced project called ksplice. With ksplice, system administrators can have the best of both worlds: security fixes without unsightly reboots.

An in-depth explanation of how ksplice works can be found in this document [PDF]. In short, ksplice requires as input the source tree for the running kernel and the security patch. It will then build two kernels, one with the patch and one without; the kernels are built with a special set of options which makes it easy to figure out which functions change as a result of the patch. The two kernels will be compared, with the purpose of finding those functions. Changes can propagate further than one might expect, especially if, for example, an inline function is modified.

Once a list of changed functions has been made, the updated code for those functions is packaged into a kernel module and loaded into the system. Then comes the tricky part: getting the running kernel to start using the new code. That requires patching the running code, which is a risky thing to do. Ksplice starts with a call to stop_machine_run(), which dumps a high-priority thread onto each processor, thus taking control of all processors in the system. It then examines all threads in the system to ensure that none of them are running in the functions to be replaced; if so, trampoline jumps are patched into the beginning of each replaced function (they "bounce" the call to the old code into the replacement code) and life continues. Otherwise ksplice will back off and try again later.

This method imposes a number of limitations. One is that only code changes can be patched in with ksplice; patches which make changes to data structures cannot be accommodated. Another comes from the retry-based approach to ensuring that no threads are running in the patched functions; what happens if one of those functions is never free? Kernel functions like schedule(), sys_poll(), or sys_waitid() are likely to always have processes running within them. In cases like this, ksplice will eventually give up and inform the user that the patch cannot be done; it is simply not possible to make changes to those particular functions.

These limitations mean that, out of 50 security patches examined by the ksplice developers, eight could not be applied with ksplice. So multi-year uptimes are probably still incompatible with the application of all security patches. Even so, ksplice certainly has the potential to reduce patch-related downtime considerably. Chances are good that there will be a fair amount of interest in ksplice in sites running high-uptime, mission-critical systems.

There are few things in the way of an immediate merge of this code into the mainline. One is a matter of coding quality and can be fixed. Then, there is the matter of the lead developer being unconvinced that merging this code makes sense since it is, essentially, a standalone feature. Andi Kleen's response made the (usual) reasons for merging the code clear:

To be honest you weren't the first to come up with something like this (although you're the first to post to l-k as far as I know). But the usual problem of something that is kept out of tree is that it eventually bitrots and gets forgotten. The only sane way to make such extensions a generically usable linux feature is to merge them to mainline.

So, presumably, the code will eventually be proposed for a mainline merge. But there is one other little difficulty pointed out by Tomasz Chmielewski: Microsoft holds a patent described this way:

A system and method for automatically updating software components on a running computer system without requiring any interruption of service. A software module is hotpatched by loading a patch into memory and modifying an instruction in the original module to jump to the patch.

Microsoft came up with this novel new technique in the distant past: 2002. The posting immediately brought out a crowd of surprised graybeards who distinctly remember using such techniques on their PDP-11 systems some decades before Microsoft "invented" hot-patching. The basic claim of the patent would thus appear to be invalidated by some decades' worth of prior art, but some of the dependent claims include features (such as capturing all other processors on the system) which were unlikely to be useful on PDP-11s.

Given that the kernel developers are now well aware of this patent, they must take it into account when deciding whether to accept this code into the mainline. It would not be surprising if they chose to avoid baiting the Microsoft FUD machine in this way, even if they all agreed that the patent lacked validity. So a promising technology risks being left out of the kernel as the result of a software patent which was filed at least 30 years too late.

Comments (64 posted)

Patches and updates

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds