User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.35-rc1, released on May 30. Changes merged since last week's summary include "ramoops" (for saving oops information in persistent memory), direct I/O support in Btrfs, and some changes to how truncate() is handled. See the separate article below for a summary of changes, or the full changelog has all the details.

Stable updates: was released on June 1. It removes two patches which created problems for users; only those who have experienced difficulties need to think about upgrading.

Comments (none posted)

Quotes of the week

The Linux approach is to do the job right. That means getting the interface right and so it works across all the interested parties (or as close as we can get)... This is the Linux way of doing things. It's like the GPL and being shouted at by Linus. They are things you accept when you choose to take part.
-- Alan Cox

The kernel history is full of examples where crappy solutions got rejected and kept out of the kernel for a long time even if there was a need for them in the application field and they got shipped in quantities with out of tree patches (NOHZ, high resolution timers, ...). At some point people stopped arguing for crappy solutions and sat down and got it right.
-- Thomas Gleixner

The whole "no reason to tolerate broken apps" mindset is simply misguided IMO, because it's based on unrealistic assumptions. That's because in general users only need the platform for running apps they like (or need or whatever). If they can't run apps they like on a given platform, or it is too painful to them to run their apps on it, they will rather switch to another platform than stop using the apps.
-- Rafael Wysocki

NAK NAK NAK NAK! QAK! HAK! Crap code! Stop adding undocumented interfaces. Just stop it. Now. Geeze.

How is anyone supposed to use this? What are the semantics of this thing? What are the units of its return value? What is the base value of its return value? Does it return different times on different CPUs? I assume so, otherwise why does sched_clock_cpu() exist? <looks at the sched_clock_cpu() documentation, collapses in giggles>

-- Andrew Morton after one patch too many

"The more we prohibit, the safer we are" is best left to the likes of TSA; if we are really interested in security and not in security theatre or BDSM fetishism, let's make sure that heuristics we use make sense.
-- Al Viro

Comments (14 posted)

Idling ACPI idle

By Jonathan Corbet
June 1, 2010
Len Brown has spent some years working to ensure that Linux has top-quality ACPI support. So it is interesting to see him working to short out some of that ACPI code, but that is what has happened with his new cpuidle driver for recent Intel processors. With this experimental feature - merged for 2.6.35 - Linux no longer depends on ACPI to handle idle state transitions on Nehalem and Atom processors.

The ACPI BIOS is the standard way of getting at processor idle states in the x86 world. So why would Linux want to move away from ACPI for its cpuidle driver? Len explains:

ACPI has a few fundamental flaws here. One is that it reports exit latency instead of break-even power duration. The other is that it requires a BIOS writer to get the tables right. Both of these are fatal flaws.

The motivating factor appears to be a BIOS bug shipping on Dell systems for some months now which disables a number of idle states. As a result, Len's test system takes 100W of power when using the ACPI idle code; when idle states are handled directly, power use drops to 85W. That seems like a savings worth having. The fact that Linux now uses significantly less power than certain other operating systems - which are dependent on ACPI still - is just icing on the cake.

In general, it makes sense to use hardware features directly in preference to BIOS solutions when we have the knowledge necessary to do so. There can be real advantages in eliminating the firmware middleman in such situations. It's nice to see a chip vendor - which certainly has the requisite knowledge - supporting the use of its hardware in this way.

Comments (3 posted)

Ambient light sensors

By Jonathan Corbet
June 2, 2010
Ambient light sensors do exactly that: tell the system how much light currently exists in the environment. They are useful for tasks like automatically adjusting screen brightness for optimal readability. There are a few drivers for such sensors in the kernel now, but there is no standard for how those drivers should interface to user space. Andrew Morton recently noticed this problem and suggested that it should be fixed: "This is very important! We appear to be making a big mess which we can never fix up."

As it happens, the developers of drivers for these sensors tried to solve this problem earlier this year. That work culminated in a pull request asking Linus to accept the ambient light sensors framework into the 2.6.34 kernel. That pull never happened, though; Linus thought that these sensors should just be treated as another (human) input device, and others requested that it be expanded to support other types of sensors. This framework has languished ever since.

Perhaps the light sensor framework wasn't ready, but the end result is that its developers have gotten discouraged and every driver going into the system is implementing a different, incompatible API. Other drivers are waiting for things to stabilize; Alan Cox commented: "We have some intel drivers to submit as well when sanity prevails." It's a problem clearly requiring a solution, but it's not quite clear who will make another try at it or when that could happen.

Comments (4 posted)

Kernel development news

2.6.35 merge window part 3

By Jonathan Corbet
May 31, 2010
The 2.6.35 merge window was closed with the 2.6.35-rc1 release on May 30. A relatively small number of changes have been merged since last week's summary; the most significant are summarized here.

User-visible changes include:

  • The "ramoops" driver allows the system to record oops information in persistent RAM for recovery later.

  • The Btrfs filesystem has gained basic direct I/O support.

  • The FUSE filesystem now enables user-space filesystem implementations to transfer data with the splice() system call, avoiding a copy operation.

  • A new, non-ACPI CPU-idle mechanism for Intel processors has been added on an experimental basis. It seems that, with enough cleverness, it's possible to save more power by handling idle states directly instead of letting the ACPI BIOS do it.

  • There are a few new drivers: ST-Ericsson AB8500 power management IC RTC chips, SMSC EMC1403 thermal sensors, Texas Instruments TMP102 sensors, MC13783 PMIC LED controllers, Cirrus EP93xx backlight controllers, ADP8860 backlight controllers, RDC R-321x GPIO controllers, Janz VMOD-TTL digital I/O modules, Janz VMOD-ICAN3 Intelligent CAN controllers, TPS6507x based power management chips and touchscreen controllers, ST-Ericsson AB3550 mixed signal circuit devices, AB8500 power management chips (replacing existing driver), S6E63M0 AMOLED panels, and NXP PCF50633 MFD backlight controllers.

Changes visible to kernel developers include:

  • The user-mode helper API (used by the kernel to run programs in user space) has changed somewhat. call_usermodhelper_setcleanup() has become:

         void call_usermodehelper_setfns(struct subprocess_info *info,
    		    int (*init)(struct subprocess_info *info),
    		    void (*cleanup)(struct subprocess_info *info),
    		    void *data);

    The new init() function will be called from the helper process just before executing the helper function. There is also a new function:

         call_usermodehelper_fns(char *path, char **argv, char **envp,
    			     enum umh_wait wait,
    			     int (*init)(struct subprocess_info *info),
    			     void (*cleanup)(struct subprocess_info *), void *data)

    This variant is like call_usermodhelper() but it allows the specification of the initialization and cleanup functions at the same time.

  • The fsync() member of struct file_operations has lost its struct dentry pointer argument, which was not used by any implementations.

  • The new truncate sequence patches have been merged, changing how truncate() is handled in the VFS layer.

As is always the case, a few things were not merged. In the end, suspend blockers did not make it; there was really no question of that given the way the discussion went toward the end of the merge window. The fanotify file notification interface did not go in, despite the lack of public opposition. Also not merged was the latest uprobes posting. Concurrency-managed workqueues remain outside of the mainline, as does a set of patches meant to prepare the ground for that feature. Transparent hugepages did not go in, but it was probably a bit early for that code in any case. The open by handle system calls went through a bunch of revisions prior to and during the merge window, but remain unmerged. A number of these features can be expected to try again in 2.6.36; others will probably vanish.

All told, some 8,113 non-merge changesets were accepted during the 2.6.35 merge window - distinctly more than the 6,032 merged during the 2.6.34 window. Linus's announcement suggests that a few more changes might make their way in after the -rc1 release, but that number will be small. Almost exactly 1000 developers have participated in this development cycle so far. As Linus noted in the 2.6.35-rc1 announcement, the development process continues to look healthy.

Comments (7 posted)

What comes after suspend blockers

By Jonathan Corbet
June 1, 2010
It looked like it was almost a done deal: after more than a year of discussions, it seemed that most of the objections to the Android "suspend blockers" concept had been addressed. The code had gone into a tree which feeds into linux-next, and a pull request was sent to Linus. All that remained was to see whether Linus actually pulled it. That did not happen; by the end of the merge window, the newly reinvigorated discussion had made that outcome unsurprising. But the discussion which destroyed any chance of getting that code in has, in the end, yielded the beginnings of an approach which may be acceptable to all participants. This article will take a technical look at the latest round of objections and the potential solution.

As a reminder, suspend blockers (formerly "wakelocks") came about as part of the power management system used on Android phones. Whenever possible, the Android developers want to put the phone into a fully suspended state, where power consumption is minimized. The Android model calls for automatic ("opportunistic") suspend to happen even if there are processes which are running. In this way, badly-written programs are prevented from draining the battery too quickly.

But a phone which is suspended all the time, while it does indeed run a long time on a single charge, is also painful to use. So there are times when the phone must be kept running; these times include anytime that the display is on. It's also important to not suspend the phone when interesting things are happening; that's where suspend blockers come in. The arrival of a key event, for example, will cause a suspend blocker to be obtained within the kernel; that blocker will be released after the event has been read by user space. The user-space application, meanwhile, takes a suspend blocker of its own before reading events; that will keep the system running after the kernel releases the first blocker. The user-space blocker is only released once the event has been fully processed; at that point, the phone can suspend.

The latest round of objections included some themes which had been heard before: in particular, the suspend blocker ABI, once added to the kernel, must be maintained for a very long time. Since there was a lot of unhappiness with that ABI, it's not surprising that many kernel developers did not want to be burdened with it indefinitely. There are also familiar concerns about the in-kernel suspend blocker calls spreading to "infect" increasing numbers of drivers. And the idea that the kernel should act to protect the system against badly-written applications remains controversial; some simply see that approach as making a more robust system, while others see it as a recipe for the proliferation of bad code.

Quality of service

The other criticisms, though, came from a different direction: suspend blockers were seen as a brute-force solution to a resource management problem which can (and should) be solved in a way which is more flexible, meets the needs of a wider range of users, and which is not tied to current Android hardware. In this view, "suspend" is not a special and unique state of the system; it is, instead, just an especially deep idle state which can be managed with the usual cpuidle logic. The kernel currently uses quality-of-service (QOS) information provided through the pm_qos API to choose between idle states; with an expanded view of QOS, that same logic could incorporate full suspend as well.

In other words, using cpuidle, current kernels already implement the "opportunistic suspend" idea - for the set of sleep states known to the cpuidle code now. On x86 hardware, a true "suspend" is a different hardware state than the sleep states used by cpuidle, but (1) the kernel could hide those differences, and (2) architectures which are more oriented toward embedded applications tend to treat suspend as just another idle state already. There are signs that x86 is moving in the same direction, where there will be nothing all that special about the suspended state.

That said, there are some differences at the software level. Current idle states are only entered when the system is truly idle, while opportunistic suspend can happen while processes are running. Idle states do not stop timers within the kernel, while suspend does. Suspend, in other words, is a convenient way to bring everything to a stop - whether or not it would stop of its own accord - until some sort of sufficiently interesting event arrives. The differences appear to be enough - for now - to make a "pure" QOS-based implementation impossible; things can head in that direction, though, so it's worth looking at that vision.

To repeat: current CPU idle states are chosen based on the QOS requirements indicated by the kernel. If some kernel subsystem claims that it needs to run with latencies measured in microseconds, the kernel knows that it cannot use a deep sleep state. Bringing suspend into this model will probably involve the creation of a new QOS level, often called "QOS_NONE", which specifies that any amount of latency is acceptable. If nothing in the system is asking for a QOS greater than QOS_NONE, the kernel knows that it can choose "suspend" as an idle state if that seems to make sense. Of course, the kernel would also have to know that any scheduled timers can be delayed indefinitely; the timer slack mechanism already exists to make that information available, but this mechanism is new and almost unused.

In a system like this, untrusted applications could be run in some sort of jail (a control group, say) where they can be restricted to QOS_NONE. In some versions, the QOS level of that cgroup is changed dynamically between "normal" and QOS_NONE depending on whether the system as a whole thinks it would like to suspend. Once untrusted applications are marked in this way, they can no longer prevent the system from suspending - almost.

One minor difficulty that comes in is that, if suspend is an idle state, the system must go idle before suspending becomes an option. If the application just sits in the CPU, it can still keep the system as a whole from suspending. Android's opportunistic suspend is designed to deal with this problem; it will suspend the system regardless of what those applications are trying to do. In the absence of this kind of forced suspend, there must be some way to keep those applications from keeping the system awake.

One intriguing idea was to state that QOS_NONE means that a process might be forced to wait indefinitely for the CPU, even if it is in a runnable state; the scheduler could then decree the system to be idle if only QOS_NONE processes are runnable. Peter Zijlstra worries that not running runnable tasks will inevitably lead to all kinds of priority and lock inversion problems; he does not want to go there. So this approach did not get very far.

An alternative is to defer any I/O operations requested by QOS_NONE processes when the system is trying to suspend. A process which is waiting for I/O goes idle naturally; if one assumes that even the most CPU-hungry application will do I/O eventually, it should be possible to block all processes this way. Another is to have a user-space daemon which informs processes that it's time to stop what they are doing and go idle. Any process which fails to comply can be reminded with a series of increasingly urgent signals, culminating in SIGKILL if need be.

Meanwhile, in the real world

Approaches like this can be implemented, and they may well be the long-term solution. But it's not an immediate solution. Among other things, a purely QOS-based solution will require that drivers change the system's overall QOS level in response to events. When something interesting happens, the system should not be allowed to suspend until user space has had a chance to respond. So important drivers will need to be augmented with internal QOS calls - kernel-space suspend blockers in all but name, essentially. Timers will need to be changed so that those which can be delayed indefinitely do not prevent the system from suspending. It might also be necessary to temporarily pass a higher level of QOS to applications when waking them up to deal with events. All of this can probably be done in a way that can be merged, but it won't solve Android's problem now.

So what we may see in the relatively near future is a solution based on an approach described by Alan Stern. Alan's idea retains the use of forced suspend, though not quite in the opportunistic mode. Instead, there would be a "QOS suspend" mode attainable by explicitly writing "qos" to /sys/power/state. If there are no QOS constraints active when "QOS suspend" is requested, the system will suspend immediately; otherwise, the process writing to /sys/power/state will block until those constraints are released. Additionally, there would be a new QOS constraint called QOS_EVENTUALLY which is compatible with any idle state except full suspend. These constraints - held only within the kernel - would block suspend when things are happening.

In other words, Android's kernel-space suspend blockers turn into QOS_EVENTUALLY constraints. The difference is that QOS terms are being used, and the kernel can make its best choice on how those constraints will be met.

There are no user-space suspend blockers in Alan's approach; instead, there is a daemon process which tries to put the system into the "QOS suspend" state whenever it thinks that nothing interesting is happening. Applications could communicate with that daemon to request that the system not be suspended; the daemon could then honor those requests (or not) depending on whatever policy it implements. Thus, the system suspends when both the kernel and user space agree that it's the right thing to do, and it doesn't require that all processes go idle first. This mechanism also makes it easy to track which processes are blocking suspend - an important requirement for the Android folks.

In summary, as Alan put it:

The advantages of this scheme are that this does everything the Android people need, and it does it in a way that's entirely compatible with pure QoS/cpuidle-based power management. It even starts along the path of making suspend-to-RAM just another kind of dynamic power state.

Android developer Brian Swetland agreed, saying "...from what I can see it certainly seems like this model provides us with the functionality we're looking for." So we might just have the form of a real solution.

There are a number of loose ends to tie down, of course. Additionally, various alternatives are still being discussed; one approach would replace user-space wakelocks with a special device which can be used to express QOS constraints, for example. There is also the little nagging issue that nobody has actually posted any code. That problem notwithstanding, it seems like there could be a way forward which would finally break the roadblock that has kept so much Android code out of the kernel for so long.

Comments (4 posted)

Symbolic links in "sticky" directories

By Jake Edge
June 2, 2010

Security problems that exploit badly written programs by placing symbolic links in /tmp are legion. This kind of flaw has existed in applications going back to the dawn of UNIX time, and new ones get introduced regularly. So a recent effort to change the kernel to avoid these kinds of problems would seem, at first glance anyway, to be welcome. But some kernel hackers are not convinced that the core kernel should be fixing badly written applications.

These /tmp symlink races are in a class of security vulnerabilities known as time-of-check-to-time-of-use (TOCTTOU) bugs. For /tmp files, typically a buggy application will check to see if a particular filename exists and/or if the file has a particular set of characteristics; if the file passes that test, the program uses it. An attacker exploits this by racing to put a symbolic link or different file in /tmp between the time of the check and the open or create. That allows the attacker to bypass whatever the checks are supposed to enforce.

For programs with normal privilege levels, these attacks can cause a variety of problems, but don't lead to system compromise. But for setuid programs, an attacker can use the elevated privileges to overwrite arbitrary files in ways that can lead to all manner of ugliness, including complete compromise via privilege escalation. There are various guides that describe how to avoid writing code with this kind of vulnerability, but the flaw still gets reported frequently.

Ubuntu security team member Kees Cook proposed changing the kernel to avoid the problem, not by removing the race, but by stopping programs from following the symlinks that get created. "Proper" fixes in applications will completely avoid the race by creating random filenames that get opened with O_CREAT|O_EXCL. But, since these problems keep cropping up after multiple decades of warnings, perhaps another approach is in order. Cook adapted code from the Openwall and grsecurity kernels that did just that.

Since the problems occur in shared directories (like /tmp and /var/tmp) which are world-writable, but with the "sticky bit" turned on so that users can only delete their own files, the patch restricts the kinds of symlinks that can be followed in sticky directories. In order for a symlink in a sticky directory to be followed, it must either be owned by the follower, or the directory and symlink must have the same owner. Since shared temporary directories are typically owned by root, and random attackers cannot create symlinks owned by root, this would eliminate the problems caused by /tmp file symlink races.

The first version of the patch elicited a few suggestions, and an ACK by Serge Hallyn, but no complaints. Cook obviously did a fair amount of research into the problem and anticipated some objections from earlier linux-kernel discussions, which he linked to in the post. He also linked to a list of 243 CVE entries that mention /tmp—not all are symlink races, but many of them are. When Cook revised and reposted the patch, though, a few complaints cropped up.

For one thing, Cook had anticipated that VFS developers would object to putting his test into that code, so he put it into the capabilities checks (cap_inode_follow_link()) instead. That didn't sit well with Eric Biederman, who said:

Placing this in cap_inode_follow_link is horrible naming. There is nothing capabilities about this. Either this needs to go into one or several of the security modules or this needs to go into the core vfs.

Alan Cox agreed that it should go into SELinux or some specialized Linux security module (LSM). He also suggested that giving each user their own /tmp mountpoint would solve the problem as well, without requiring any kernel changes: "Give your users their own /tmp. No kernel mods, no misbehaviours, no weirdomatic path walking hackery. No kernel patch needed that I can see."

But Cook and others are not convinced that there are any legitimate applications that require the ability to follow these kinds of symlinks. Given that following them has been a source of serious security holes, why not just fix it once and for all in the kernel? One could argue that changing the behavior would violate the POSIX standard—one of the objections Cook anticipated—but that argument may be a bit weak. Ted Ts'o believes that POSIX doesn't really apply because the sticky bit isn't in the standard:

So for people who to argue against this (which I believe to be a useful restriction, and not one that necessarily has to be done in a LSM), it's not sufficient to say that it is a POSIX violation, because it isn't. The sticky bit itself wasn't originally considered by POSIX, and many systems which implemented the sticky bit had no problems becoming [certified] as POSIX compliant.

Per-user /tmp directories might solve the problem, but come with an administrative burden of their own. Eric Paris notes that it might be a better solution, but it doesn't come for free:

Now, if we used filesystem namespaces regularly for years and users, administrators, and developers dealt with them often I agree that would probably be the preferred solution. It would solve this issue, but in introduces a whole host of other problems that are even more obvious and even likely to bite people.

Ts'o agrees: "I do have a slight preference against per-user /tmp mostly because it gets confusing for administrators, and because it could be used by rootkits to hide themselves in ways that would be hard for most system administrators to find." Based on that and other comments, Cook revised the patches again, moving the test into VFS, rather than trying to come in through the security subsystem.

In addition, he changed the code so that the new behavior defaulted "off" to address one of the bigger objections. Version 3 of the patch was posted on June 1, and has so far only seen comments from Al Viro, who doesn't seem convinced of the need for the change, but was nevertheless discussing implementation details.

It may be that Viro and other filesystem developers—Christoph Hellwig did not seem particularly in favor of the change for example—will oppose this change. It is, at some level, a band-aid to protect poorly written applications, but it also provides a measure of protection that some would like to have. As Cook pointed out, the Ubuntu kernel already has this protection, but he would like to see that protection extended to all kernel users. Whether that happens remains to be seen.

Comments (16 posted)

Patches and updates

Kernel trees


Core kernel code

Device drivers

Filesystems and block I/O

Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds