Brief itemsreleased on May 30. Changes merged since last week's summary include "ramoops" (for saving oops information in persistent memory), direct I/O support in Btrfs, and some changes to how truncate() is handled. See the separate article below for a summary of changes, or the full changelog has all the details.
Stable updates: 18.104.22.168 was released on June 1. It removes two patches which created problems for 22.214.171.124 users; only those who have experienced difficulties need to think about upgrading.
How is anyone supposed to use this? What are the semantics of this thing? What are the units of its return value? What is the base value of its return value? Does it return different times on different CPUs? I assume so, otherwise why does sched_clock_cpu() exist? <looks at the sched_clock_cpu() documentation, collapses in giggles>
The ACPI BIOS is the standard way of getting at processor idle states in the x86 world. So why would Linux want to move away from ACPI for its cpuidle driver? Len explains:
The motivating factor appears to be a BIOS bug shipping on Dell systems for some months now which disables a number of idle states. As a result, Len's test system takes 100W of power when using the ACPI idle code; when idle states are handled directly, power use drops to 85W. That seems like a savings worth having. The fact that Linux now uses significantly less power than certain other operating systems - which are dependent on ACPI still - is just icing on the cake.
In general, it makes sense to use hardware features directly in preference to BIOS solutions when we have the knowledge necessary to do so. There can be real advantages in eliminating the firmware middleman in such situations. It's nice to see a chip vendor - which certainly has the requisite knowledge - supporting the use of its hardware in this way.noticed this problem and suggested that it should be fixed: "This is very important! We appear to be making a big mess which we can never fix up."
As it happens, the developers of drivers for these sensors tried to solve this problem earlier this year. That work culminated in a pull request asking Linus to accept the ambient light sensors framework into the 2.6.34 kernel. That pull never happened, though; Linus thought that these sensors should just be treated as another (human) input device, and others requested that it be expanded to support other types of sensors. This framework has languished ever since.
Perhaps the light sensor framework wasn't ready, but the end result is that its developers have gotten discouraged and every driver going into the system is implementing a different, incompatible API. Other drivers are waiting for things to stabilize; Alan Cox commented: "We have some intel drivers to submit as well when sanity prevails." It's a problem clearly requiring a solution, but it's not quite clear who will make another try at it or when that could happen.
Kernel development newsthe 2.6.35-rc1 release on May 30. A relatively small number of changes have been merged since last week's summary; the most significant are summarized here.
User-visible changes include:
Changes visible to kernel developers include:
void call_usermodehelper_setfns(struct subprocess_info *info, int (*init)(struct subprocess_info *info), void (*cleanup)(struct subprocess_info *info), void *data);
The new init() function will be called from the helper process just before executing the helper function. There is also a new function:
call_usermodehelper_fns(char *path, char **argv, char **envp, enum umh_wait wait, int (*init)(struct subprocess_info *info), void (*cleanup)(struct subprocess_info *), void *data)
This variant is like call_usermodhelper() but it allows the specification of the initialization and cleanup functions at the same time.
As is always the case, a few things were not merged. In the end, suspend blockers did not make it; there was really no question of that given the way the discussion went toward the end of the merge window. The fanotify file notification interface did not go in, despite the lack of public opposition. Also not merged was the latest uprobes posting. Concurrency-managed workqueues remain outside of the mainline, as does a set of patches meant to prepare the ground for that feature. Transparent hugepages did not go in, but it was probably a bit early for that code in any case. The open by handle system calls went through a bunch of revisions prior to and during the merge window, but remain unmerged. A number of these features can be expected to try again in 2.6.36; others will probably vanish.
All told, some 8,113 non-merge changesets were accepted during the 2.6.35 merge window - distinctly more than the 6,032 merged during the 2.6.34 window. Linus's announcement suggests that a few more changes might make their way in after the -rc1 release, but that number will be small. Almost exactly 1000 developers have participated in this development cycle so far. As Linus noted in the 2.6.35-rc1 announcement, the development process continues to look healthy.a pull request was sent to Linus. All that remained was to see whether Linus actually pulled it. That did not happen; by the end of the merge window, the newly reinvigorated discussion had made that outcome unsurprising. But the discussion which destroyed any chance of getting that code in has, in the end, yielded the beginnings of an approach which may be acceptable to all participants. This article will take a technical look at the latest round of objections and the potential solution.
As a reminder, suspend blockers (formerly "wakelocks") came about as part of the power management system used on Android phones. Whenever possible, the Android developers want to put the phone into a fully suspended state, where power consumption is minimized. The Android model calls for automatic ("opportunistic") suspend to happen even if there are processes which are running. In this way, badly-written programs are prevented from draining the battery too quickly.
But a phone which is suspended all the time, while it does indeed run a long time on a single charge, is also painful to use. So there are times when the phone must be kept running; these times include anytime that the display is on. It's also important to not suspend the phone when interesting things are happening; that's where suspend blockers come in. The arrival of a key event, for example, will cause a suspend blocker to be obtained within the kernel; that blocker will be released after the event has been read by user space. The user-space application, meanwhile, takes a suspend blocker of its own before reading events; that will keep the system running after the kernel releases the first blocker. The user-space blocker is only released once the event has been fully processed; at that point, the phone can suspend.
The latest round of objections included some themes which had been heard before: in particular, the suspend blocker ABI, once added to the kernel, must be maintained for a very long time. Since there was a lot of unhappiness with that ABI, it's not surprising that many kernel developers did not want to be burdened with it indefinitely. There are also familiar concerns about the in-kernel suspend blocker calls spreading to "infect" increasing numbers of drivers. And the idea that the kernel should act to protect the system against badly-written applications remains controversial; some simply see that approach as making a more robust system, while others see it as a recipe for the proliferation of bad code.
In other words, using cpuidle, current kernels already implement the "opportunistic suspend" idea - for the set of sleep states known to the cpuidle code now. On x86 hardware, a true "suspend" is a different hardware state than the sleep states used by cpuidle, but (1) the kernel could hide those differences, and (2) architectures which are more oriented toward embedded applications tend to treat suspend as just another idle state already. There are signs that x86 is moving in the same direction, where there will be nothing all that special about the suspended state.
That said, there are some differences at the software level. Current idle states are only entered when the system is truly idle, while opportunistic suspend can happen while processes are running. Idle states do not stop timers within the kernel, while suspend does. Suspend, in other words, is a convenient way to bring everything to a stop - whether or not it would stop of its own accord - until some sort of sufficiently interesting event arrives. The differences appear to be enough - for now - to make a "pure" QOS-based implementation impossible; things can head in that direction, though, so it's worth looking at that vision.
To repeat: current CPU idle states are chosen based on the QOS requirements indicated by the kernel. If some kernel subsystem claims that it needs to run with latencies measured in microseconds, the kernel knows that it cannot use a deep sleep state. Bringing suspend into this model will probably involve the creation of a new QOS level, often called "QOS_NONE", which specifies that any amount of latency is acceptable. If nothing in the system is asking for a QOS greater than QOS_NONE, the kernel knows that it can choose "suspend" as an idle state if that seems to make sense. Of course, the kernel would also have to know that any scheduled timers can be delayed indefinitely; the timer slack mechanism already exists to make that information available, but this mechanism is new and almost unused.
In a system like this, untrusted applications could be run in some sort of jail (a control group, say) where they can be restricted to QOS_NONE. In some versions, the QOS level of that cgroup is changed dynamically between "normal" and QOS_NONE depending on whether the system as a whole thinks it would like to suspend. Once untrusted applications are marked in this way, they can no longer prevent the system from suspending - almost.
One minor difficulty that comes in is that, if suspend is an idle state, the system must go idle before suspending becomes an option. If the application just sits in the CPU, it can still keep the system as a whole from suspending. Android's opportunistic suspend is designed to deal with this problem; it will suspend the system regardless of what those applications are trying to do. In the absence of this kind of forced suspend, there must be some way to keep those applications from keeping the system awake.
One intriguing idea was to state that QOS_NONE means that a process might be forced to wait indefinitely for the CPU, even if it is in a runnable state; the scheduler could then decree the system to be idle if only QOS_NONE processes are runnable. Peter Zijlstra worries that not running runnable tasks will inevitably lead to all kinds of priority and lock inversion problems; he does not want to go there. So this approach did not get very far.
An alternative is to defer any I/O operations requested by QOS_NONE processes when the system is trying to suspend. A process which is waiting for I/O goes idle naturally; if one assumes that even the most CPU-hungry application will do I/O eventually, it should be possible to block all processes this way. Another is to have a user-space daemon which informs processes that it's time to stop what they are doing and go idle. Any process which fails to comply can be reminded with a series of increasingly urgent signals, culminating in SIGKILL if need be.
Approaches like this can be implemented, and they may well be the long-term solution. But it's not an immediate solution. Among other things, a purely QOS-based solution will require that drivers change the system's overall QOS level in response to events. When something interesting happens, the system should not be allowed to suspend until user space has had a chance to respond. So important drivers will need to be augmented with internal QOS calls - kernel-space suspend blockers in all but name, essentially. Timers will need to be changed so that those which can be delayed indefinitely do not prevent the system from suspending. It might also be necessary to temporarily pass a higher level of QOS to applications when waking them up to deal with events. All of this can probably be done in a way that can be merged, but it won't solve Android's problem now.
So what we may see in the relatively near future is a solution based on an approach described by Alan Stern. Alan's idea retains the use of forced suspend, though not quite in the opportunistic mode. Instead, there would be a "QOS suspend" mode attainable by explicitly writing "qos" to /sys/power/state. If there are no QOS constraints active when "QOS suspend" is requested, the system will suspend immediately; otherwise, the process writing to /sys/power/state will block until those constraints are released. Additionally, there would be a new QOS constraint called QOS_EVENTUALLY which is compatible with any idle state except full suspend. These constraints - held only within the kernel - would block suspend when things are happening.
In other words, Android's kernel-space suspend blockers turn into QOS_EVENTUALLY constraints. The difference is that QOS terms are being used, and the kernel can make its best choice on how those constraints will be met.
There are no user-space suspend blockers in Alan's approach; instead, there is a daemon process which tries to put the system into the "QOS suspend" state whenever it thinks that nothing interesting is happening. Applications could communicate with that daemon to request that the system not be suspended; the daemon could then honor those requests (or not) depending on whatever policy it implements. Thus, the system suspends when both the kernel and user space agree that it's the right thing to do, and it doesn't require that all processes go idle first. This mechanism also makes it easy to track which processes are blocking suspend - an important requirement for the Android folks.
In summary, as Alan put it:
Android developer Brian Swetland agreed, saying "...from what I can see it certainly seems like this model provides us with the functionality we're looking for." So we might just have the form of a real solution.
There are a number of loose ends to tie down, of course. Additionally, various alternatives are still being discussed; one approach would replace user-space wakelocks with a special device which can be used to express QOS constraints, for example. There is also the little nagging issue that nobody has actually posted any code. That problem notwithstanding, it seems like there could be a way forward which would finally break the roadblock that has kept so much Android code out of the kernel for so long.
Security problems that exploit badly written programs by placing symbolic links in /tmp are legion. This kind of flaw has existed in applications going back to the dawn of UNIX time, and new ones get introduced regularly. So a recent effort to change the kernel to avoid these kinds of problems would seem, at first glance anyway, to be welcome. But some kernel hackers are not convinced that the core kernel should be fixing badly written applications.
These /tmp symlink races are in a class of security vulnerabilities known as time-of-check-to-time-of-use (TOCTTOU) bugs. For /tmp files, typically a buggy application will check to see if a particular filename exists and/or if the file has a particular set of characteristics; if the file passes that test, the program uses it. An attacker exploits this by racing to put a symbolic link or different file in /tmp between the time of the check and the open or create. That allows the attacker to bypass whatever the checks are supposed to enforce.
For programs with normal privilege levels, these attacks can cause a variety of problems, but don't lead to system compromise. But for setuid programs, an attacker can use the elevated privileges to overwrite arbitrary files in ways that can lead to all manner of ugliness, including complete compromise via privilege escalation. There are various guides that describe how to avoid writing code with this kind of vulnerability, but the flaw still gets reported frequently.
Ubuntu security team member Kees Cook proposed changing the kernel to avoid the problem, not by removing the race, but by stopping programs from following the symlinks that get created. "Proper" fixes in applications will completely avoid the race by creating random filenames that get opened with O_CREAT|O_EXCL. But, since these problems keep cropping up after multiple decades of warnings, perhaps another approach is in order. Cook adapted code from the Openwall and grsecurity kernels that did just that.
Since the problems occur in shared directories (like /tmp and /var/tmp) which are world-writable, but with the "sticky bit" turned on so that users can only delete their own files, the patch restricts the kinds of symlinks that can be followed in sticky directories. In order for a symlink in a sticky directory to be followed, it must either be owned by the follower, or the directory and symlink must have the same owner. Since shared temporary directories are typically owned by root, and random attackers cannot create symlinks owned by root, this would eliminate the problems caused by /tmp file symlink races.
The first version of the patch elicited a few suggestions, and an ACK by Serge Hallyn, but no complaints. Cook obviously did a fair amount of research into the problem and anticipated some objections from earlier linux-kernel discussions, which he linked to in the post. He also linked to a list of 243 CVE entries that mention /tmp—not all are symlink races, but many of them are. When Cook revised and reposted the patch, though, a few complaints cropped up.
For one thing, Cook had anticipated that VFS developers would object to putting his test into that code, so he put it into the capabilities checks (cap_inode_follow_link()) instead. That didn't sit well with Eric Biederman, who said:
Alan Cox agreed that it should go into SELinux or some specialized Linux security module (LSM). He also suggested that giving each user their own /tmp mountpoint would solve the problem as well, without requiring any kernel changes: "Give your users their own /tmp. No kernel mods, no misbehaviours, no weirdomatic path walking hackery. No kernel patch needed that I can see."
But Cook and others are not convinced that there are any legitimate applications that require the ability to follow these kinds of symlinks. Given that following them has been a source of serious security holes, why not just fix it once and for all in the kernel? One could argue that changing the behavior would violate the POSIX standard—one of the objections Cook anticipated—but that argument may be a bit weak. Ted Ts'o believes that POSIX doesn't really apply because the sticky bit isn't in the standard:
Per-user /tmp directories might solve the problem, but come with an administrative burden of their own. Eric Paris notes that it might be a better solution, but it doesn't come for free:
Ts'o agrees: "I do have a slight preference against per-user /tmp mostly because it gets confusing for administrators, and because it could be used by rootkits to hide themselves in ways that would be hard for most system administrators to find." Based on that and other comments, Cook revised the patches again, moving the test into VFS, rather than trying to come in through the security subsystem.
In addition, he changed the code so that the new behavior defaulted "off" to address one of the bigger objections. Version 3 of the patch was posted on June 1, and has so far only seen comments from Al Viro, who doesn't seem convinced of the need for the change, but was nevertheless discussing implementation details.
It may be that Viro and other filesystem developers—Christoph Hellwig did not seem particularly in favor of the change for example—will oppose this change. It is, at some level, a band-aid to protect poorly written applications, but it also provides a measure of protection that some would like to have. As Cook pointed out, the Ubuntu kernel already has this protection, but he would like to see that protection extended to all kernel users. Whether that happens remains to be seen.
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds