Brief items
The current development kernel is 2.6.35-rc1,
released on May 30.
Changes merged since
last week's
summary include "ramoops" (for saving oops information in persistent
memory), direct I/O support in Btrfs, and
some changes to how
truncate() is handled. See the separate article below for a
summary of changes, or the
full
changelog has all the details.
Stable updates: 2.6.32.15 was released on
June 1. It removes two patches which created problems for 2.6.32.14
users; only those who have experienced difficulties need to think about
upgrading.
Comments (none posted)
The Linux approach is to do the job right. That means getting the
interface right and so it works across all the interested parties
(or as close as we can get)...
This is the Linux way of doing things. It's like the GPL and being
shouted at by Linus. They are things you accept when you choose to
take part.
--
Alan Cox
The kernel history is full of examples where crappy solutions got
rejected and kept out of the kernel for a long time even if there
was a need for them in the application field and they got shipped
in quantities with out of tree patches (NOHZ, high resolution
timers, ...). At some point people stopped arguing for crappy
solutions and sat down and got it right.
--
Thomas Gleixner
The whole "no reason to tolerate broken apps" mindset is simply
misguided IMO, because it's based on unrealistic assumptions.
That's because in general users only need the platform for running
apps they like (or need or whatever). If they can't run apps they
like on a given platform, or it is too painful to them to run their
apps on it, they will rather switch to another platform than stop
using the apps.
--
Rafael Wysocki
NAK NAK NAK NAK! QAK! HAK! Crap code! Stop adding undocumented
interfaces. Just stop it. Now. Geeze.
How is anyone supposed to use this? What are the semantics of this
thing? What are the units of its return value? What is the base
value of its return value? Does it return different times on
different CPUs? I assume so, otherwise why does sched_clock_cpu()
exist? <looks at the sched_clock_cpu() documentation, collapses in
giggles>
--
Andrew Morton after one patch too many
"The more we prohibit, the safer we are" is best left to the likes
of TSA; if we are really interested in security and not in security
theatre or BDSM fetishism, let's make sure that heuristics we use
make sense.
--
Al Viro
Comments (14 posted)
By Jonathan Corbet
June 1, 2010
Len Brown has spent some years working to ensure that Linux has top-quality
ACPI support. So it is interesting to see him working to short out some of
that ACPI code, but that is what has happened with his new cpuidle driver
for recent Intel processors. With this experimental feature - merged for
2.6.35 - Linux no longer depends on ACPI to handle idle state transitions
on Nehalem and Atom processors.
The ACPI BIOS is the standard way of getting at processor idle states in
the x86 world. So why would Linux want to move away from ACPI for its
cpuidle driver? Len explains:
ACPI has a few fundamental flaws here. One is that it reports exit
latency instead of break-even power duration. The other is that it
requires a BIOS writer to get the tables right. Both of these are
fatal flaws.
The motivating factor appears to be a BIOS bug shipping on Dell systems for
some months now which disables a number of idle states. As a result, Len's
test system takes 100W of power when using the ACPI idle code; when
idle states are handled directly, power use drops to 85W. That seems like
a savings worth having. The fact that Linux now uses significantly less
power than certain other operating systems - which are dependent on ACPI
still - is just icing on the cake.
In general, it makes sense to use hardware features directly in preference
to BIOS solutions when we have the knowledge necessary to do so. There can
be real advantages in eliminating the firmware middleman in such
situations. It's nice to see a chip vendor - which certainly has the
requisite knowledge - supporting the use of its hardware in this way.
Comments (3 posted)
By Jonathan Corbet
June 2, 2010
Ambient light sensors do exactly that: tell the system how much light
currently exists in the environment. They are useful for tasks like
automatically adjusting screen brightness for optimal readability. There are
a few drivers for such sensors in the kernel now, but there is no standard
for how those drivers
should interface to user space. Andrew Morton recently
noticed this problem and suggested that it
should be fixed: "
This is
very important! We appear to be making a big mess which we can never fix
up."
As it happens, the developers of drivers for these sensors tried to solve
this problem earlier this year. That work culminated in a
pull request asking Linus to accept the ambient light sensors framework
into the 2.6.34 kernel. That pull never happened, though; Linus thought
that these sensors should just be treated as another (human) input device,
and others requested
that it be expanded to support other types of sensors. This framework has
languished ever since.
Perhaps the light sensor framework wasn't ready, but the end result is that
its developers have gotten discouraged and every driver going into the
system is implementing a different, incompatible API. Other drivers are
waiting for things to stabilize; Alan Cox commented: "We have some intel drivers
to submit as well when sanity prevails." It's a problem
clearly requiring a solution, but it's not quite clear who will make
another try at it or when that could happen.
Comments (4 posted)
Kernel development news
By Jonathan Corbet
May 31, 2010
The 2.6.35 merge window was closed with
the 2.6.35-rc1 release on
May 30. A relatively small number of changes have been merged since
last week's summary; the most
significant are summarized here.
User-visible changes include:
- The "ramoops" driver allows the system to record oops information
in persistent RAM for recovery later.
- The Btrfs filesystem has gained basic direct I/O support.
- The FUSE filesystem now enables user-space filesystem implementations
to transfer data with the splice() system call, avoiding a
copy operation.
- A new, non-ACPI CPU-idle mechanism for Intel processors has been added
on an experimental basis. It seems that, with enough cleverness, it's
possible to save more power by handling idle states directly instead
of letting the ACPI BIOS do it.
- There are a few new drivers:
ST-Ericsson AB8500 power management IC RTC chips,
SMSC EMC1403 thermal sensors,
Texas Instruments TMP102 sensors,
MC13783 PMIC LED controllers,
Cirrus EP93xx backlight controllers,
ADP8860 backlight controllers,
RDC R-321x GPIO controllers,
Janz VMOD-TTL digital I/O modules,
Janz VMOD-ICAN3 Intelligent CAN controllers,
TPS6507x based power management chips and touchscreen controllers,
ST-Ericsson AB3550 mixed signal circuit devices,
AB8500 power management chips (replacing existing driver),
S6E63M0 AMOLED panels, and
NXP PCF50633 MFD backlight controllers.
Changes visible to kernel developers include:
- The user-mode helper API (used by the kernel to run programs in user
space) has changed somewhat.
call_usermodhelper_setcleanup() has become:
void call_usermodehelper_setfns(struct subprocess_info *info,
int (*init)(struct subprocess_info *info),
void (*cleanup)(struct subprocess_info *info),
void *data);
The new init() function will be called from the helper process
just before executing the helper function. There is also a new function:
call_usermodehelper_fns(char *path, char **argv, char **envp,
enum umh_wait wait,
int (*init)(struct subprocess_info *info),
void (*cleanup)(struct subprocess_info *), void *data)
This variant is like call_usermodhelper() but it allows the
specification of the initialization and cleanup functions at the same
time.
- The fsync() member of struct file_operations has
lost its struct dentry pointer argument, which was not used
by any implementations.
- The new truncate sequence
patches have been merged, changing how truncate() is
handled in the VFS layer.
As is always the case, a few things were not merged. In the end,
suspend blockers did not make it; there was really no question of that
given the way the discussion went toward the end of the merge window. The
fanotify file notification
interface did not go in, despite the lack of public opposition. Also not
merged was the latest uprobes
posting. Concurrency-managed
workqueues remain outside of the mainline, as does a set of patches
meant to prepare the ground for that feature. Transparent hugepages did not go
in, but it was probably a bit early for that code in any case. The open by handle system calls went
through a bunch of revisions prior to and during the merge window, but
remain unmerged.
A number of these features
can be expected to try again in 2.6.36; others will probably vanish.
All told, some 8,113 non-merge changesets were accepted during the 2.6.35
merge window - distinctly more than the 6,032 merged during the 2.6.34
window. Linus's announcement suggests that a few more changes might
make their way in after the -rc1 release, but that number will be small.
Almost exactly 1000 developers have participated in this development cycle
so far. As Linus noted in the 2.6.35-rc1 announcement, the development
process continues to look healthy.
Comments (7 posted)
By Jonathan Corbet
June 1, 2010
It looked like it was almost a done deal: after more than a year of
discussions, it seemed that most of the objections to the Android "suspend
blockers" concept had been addressed. The code had gone into a tree which
feeds into linux-next, and
a pull
request was sent to Linus. All that remained was to see whether Linus
actually pulled it. That did not happen; by the end of the merge window,
the newly reinvigorated discussion had made that outcome unsurprising. But
the discussion which destroyed any chance of getting that code in has, in
the end, yielded the beginnings of an approach which may be acceptable to
all participants. This article will take a technical look at the latest
round of objections and the potential
solution.
As a reminder, suspend blockers (formerly "wakelocks") came about as part
of the power management system used on Android phones. Whenever possible,
the Android developers want to put the phone into a fully suspended state,
where power consumption is minimized. The Android model calls for
automatic ("opportunistic") suspend to happen even if there are processes
which are running. In this way, badly-written programs are prevented from
draining the battery too quickly.
But a phone which is suspended all the time, while it does indeed run a
long time on a single charge, is also painful to use. So there are times
when the phone must be kept running; these times include anytime that the
display is on. It's also important to not suspend the phone when
interesting things are happening; that's where suspend blockers come in.
The arrival of a key event, for example, will cause a suspend blocker to be
obtained within the kernel; that blocker will be released after the event
has been read by user space. The user-space application, meanwhile, takes
a suspend blocker of its own before reading events; that will keep the
system running after the kernel releases the first blocker. The user-space
blocker is only released once the event has been fully processed; at that
point, the phone can suspend.
The latest round of objections included some themes which had been heard
before: in particular, the suspend blocker ABI, once added to the kernel,
must be maintained for a very long time. Since there was a lot of
unhappiness with that ABI, it's not surprising that many kernel developers
did not want to be burdened with it indefinitely. There are also familiar
concerns about the in-kernel suspend blocker calls spreading to "infect"
increasing numbers of drivers. And the idea that the kernel should act to
protect the system against badly-written applications remains
controversial; some simply see that approach as making a more robust
system, while others see it as a recipe for the proliferation of bad code.
Quality of service
The other criticisms, though, came from a different direction: suspend
blockers were seen as a brute-force solution to a resource management
problem which can (and should) be solved in a way which is more flexible,
meets the needs of a wider range of users, and which is not tied to current
Android hardware. In this view, "suspend" is not a special and
unique state of the system; it is, instead, just an especially deep idle
state which can be managed with the usual
cpuidle logic. The kernel
currently uses quality-of-service (QOS) information provided through the
pm_qos API to choose between
idle states; with an expanded view of QOS, that same logic could
incorporate full suspend as well.
In other words, using cpuidle, current kernels already implement the
"opportunistic suspend" idea - for the set of sleep states known to the
cpuidle code now. On x86 hardware, a true "suspend" is a different hardware state
than the sleep states used by cpuidle, but (1) the kernel could hide
those differences, and (2) architectures which are more oriented
toward embedded applications tend to treat suspend as just another idle
state already. There are signs that x86 is moving in the same direction,
where there will be nothing all that special about the suspended state.
That said, there are some differences at the software level. Current idle
states are only entered when the system is truly idle, while opportunistic
suspend can happen while processes are running. Idle states do not stop
timers within the kernel, while suspend does. Suspend, in other words, is
a convenient way to bring everything to a stop - whether or not it would
stop of its own accord - until some sort of
sufficiently interesting event arrives. The differences appear to be
enough - for now - to make a "pure" QOS-based implementation impossible;
things can head in that direction, though, so it's worth looking at that
vision.
To repeat: current CPU idle states are chosen based on the QOS
requirements indicated by the kernel. If some kernel subsystem claims that
it needs to run with latencies measured in microseconds, the kernel knows
that it cannot use a deep sleep state. Bringing suspend into this model
will probably involve the creation of a new QOS level, often called "QOS_NONE",
which specifies that any amount of latency is acceptable. If
nothing in the system is asking for a QOS greater than QOS_NONE, the kernel
knows that it can choose "suspend" as an idle state if that seems to make
sense. Of course, the kernel would also have to know that any scheduled
timers can be delayed indefinitely; the timer slack mechanism already
exists to make that information available, but this mechanism is new and
almost unused.
In a system like this, untrusted applications could be run in some sort of
jail (a control group, say) where they can be restricted to QOS_NONE. In
some versions, the QOS level of that cgroup is changed dynamically between
"normal" and QOS_NONE depending on whether the system as a whole thinks it
would like to suspend. Once untrusted applications are marked in this way,
they can no longer prevent the system from suspending - almost.
One minor difficulty that comes in is that, if suspend is an idle state,
the system must go idle before suspending becomes an option. If the
application just sits in the CPU, it can still keep the system as a whole
from suspending. Android's opportunistic suspend is designed to deal with
this problem; it will suspend the system regardless of what those
applications are trying to do. In the absence of this kind of forced
suspend, there must be some way to keep those applications from keeping the
system awake.
One intriguing idea was to state that QOS_NONE means that a process might
be forced to wait indefinitely for the CPU, even if it is in a runnable
state; the scheduler could then decree the system to be idle if only
QOS_NONE processes are runnable. Peter Zijlstra worries that not running runnable tasks will
inevitably lead to all kinds of priority and lock inversion problems; he
does not want to go there. So this approach did not get very far.
An alternative is to defer any I/O operations requested by QOS_NONE
processes when the system is trying to suspend. A process which is waiting
for I/O goes idle naturally; if one assumes that even the most CPU-hungry
application will do I/O eventually, it should be possible to block all
processes this way. Another is to have a user-space daemon which informs
processes that it's time to stop what they are doing and go idle. Any
process which fails to comply can be reminded with a series of increasingly
urgent signals, culminating in SIGKILL if need be.
Meanwhile, in the real world
Approaches like this can be implemented, and they may well be the long-term
solution. But it's not an immediate solution. Among other things, a
purely QOS-based solution will require that drivers change the system's
overall QOS level in response to events. When something interesting
happens, the system should not be allowed to suspend until user space has
had a chance to respond. So important drivers will need to be augmented
with internal QOS calls - kernel-space suspend blockers in all but name,
essentially. Timers will need to be changed so that those which can be
delayed indefinitely do not prevent the system from suspending.
It might also be necessary to temporarily pass a higher level
of QOS to applications when waking them up to deal with events. All of
this can probably be done in a way that can be merged, but it won't solve
Android's problem now.
So what we may see in the relatively near future is a solution based on an approach described by Alan Stern. Alan's
idea retains the use of forced suspend, though not quite in the
opportunistic mode. Instead, there would be a "QOS suspend" mode
attainable by explicitly writing "qos" to /sys/power/state. If
there are no QOS constraints active when "QOS suspend" is requested, the
system will suspend immediately; otherwise,
the process writing to /sys/power/state will block until those
constraints are released. Additionally, there would be a new QOS
constraint called QOS_EVENTUALLY which is compatible with any idle state
except full suspend. These constraints - held only within the
kernel - would block suspend when things are happening.
In other words, Android's kernel-space suspend blockers turn into
QOS_EVENTUALLY constraints. The difference is that QOS terms are being
used, and the kernel can make its best choice on how those constraints will
be met.
There are no user-space suspend blockers in Alan's approach; instead, there
is a daemon process which tries to put the system into the "QOS suspend"
state whenever it thinks that nothing interesting is happening.
Applications could communicate with that daemon to request that the system
not be suspended; the daemon could then honor those requests (or not)
depending on whatever policy it implements. Thus, the system suspends when
both the kernel and user space agree that it's the right thing to do, and
it doesn't require that all processes go idle first. This mechanism also
makes it easy to track which processes are blocking suspend - an important
requirement for the Android folks.
In summary, as Alan put it:
The advantages of this scheme are that this does everything the
Android people need, and it does it in a way that's entirely
compatible with pure QoS/cpuidle-based power management. It even
starts along the path of making suspend-to-RAM just another kind of
dynamic power state.
Android developer Brian Swetland agreed,
saying "...from what I can see it certainly seems like this model
provides us with the functionality we're looking for." So we might
just have the form of a real solution.
There are a number of loose ends to tie down, of course. Additionally, various
alternatives are still being discussed; one
approach would replace user-space wakelocks with a special device which
can be used to express QOS constraints, for example. There is also the
little nagging issue that nobody has actually posted any code. That
problem notwithstanding, it seems like there could be a way forward which
would finally break the roadblock that has kept so much Android code out of
the kernel for so long.
Comments (4 posted)
By Jake Edge
June 2, 2010
Security problems that exploit badly written programs by placing symbolic
links in /tmp are legion. This kind of flaw has existed in
applications going back to the dawn of UNIX time, and new ones get
introduced regularly. So a recent effort to change the kernel to avoid
these kinds of problems would seem, at first glance anyway, to be welcome.
But some kernel hackers are not convinced that the core kernel should be
fixing badly written applications.
These /tmp symlink races are in a class of security
vulnerabilities known as time-of-check-to-time-of-use (TOCTTOU) bugs. For
/tmp files, typically a buggy application will check to see if a
particular filename exists and/or if the file has a particular set of
characteristics;
if the file passes that test, the program uses it. An attacker exploits
this by racing to put a symbolic link or different file in /tmp
between the time of
the check and the open or create. That allows the attacker to bypass
whatever the checks are supposed to enforce.
For programs with normal privilege levels, these attacks can cause a
variety of problems, but don't lead to system compromise. But for setuid
programs, an attacker can use the elevated privileges to overwrite
arbitrary files in ways that can lead to all manner of ugliness, including
complete compromise via privilege escalation. There are various guides
that describe how to avoid writing code with this kind of vulnerability,
but the flaw still gets reported frequently.
Ubuntu security team member Kees Cook proposed changing the kernel to avoid the
problem, not by removing the race, but by stopping programs from following
the symlinks that get created. "Proper" fixes in applications will
completely avoid the race by creating random filenames that get opened with
O_CREAT|O_EXCL. But, since these problems keep cropping up after
multiple decades of warnings, perhaps another approach is in order. Cook
adapted code from the Openwall and grsecurity kernels that did just that.
Since the problems occur in shared directories (like /tmp and
/var/tmp) which are world-writable, but with the "sticky bit"
turned on so that users can only delete their own files, the patch
restricts the kinds of symlinks that can be followed in sticky directories.
In order for a symlink in a sticky directory to be followed, it must either
be owned by the follower, or the directory and symlink must have the same owner.
Since shared temporary directories are typically owned by root, and random
attackers cannot create symlinks owned by root, this would eliminate the
problems caused by /tmp file symlink races.
The first version of the patch elicited a few suggestions, and an ACK by Serge
Hallyn, but no
complaints. Cook obviously did a fair amount of research into the problem
and anticipated some objections from earlier linux-kernel discussions,
which he linked to in the post. He also linked to a list of 243
CVE
entries that mention /tmp—not all are symlink races, but
many of them are.
When Cook revised and reposted the patch, though, a
few complaints cropped up.
For one thing, Cook had anticipated that VFS developers would object to
putting his test into that code, so he put it into the capabilities
checks (cap_inode_follow_link()) instead. That didn't sit well
with Eric Biederman, who said:
Placing this in cap_inode_follow_link is horrible naming. There is nothing
capabilities about this. Either this needs to go into one or several
of the security modules or this needs to go into the core vfs.
Alan Cox agreed that it should go into
SELinux or some specialized Linux security module (LSM). He also suggested
that giving each user their own /tmp mountpoint would solve the
problem as well, without requiring any kernel changes: "Give your users their own /tmp. No kernel mods, no misbehaviours, no
weirdomatic path walking hackery. No kernel patch needed that I can
see."
But Cook and others are not convinced that there are any legitimate
applications that require the ability to follow these kinds of symlinks.
Given that following them has been a source of serious security holes, why
not just fix it once and for all in the kernel? One could argue that
changing the behavior would violate the POSIX standard—one of the
objections Cook anticipated—but that argument may be a bit weak. Ted
Ts'o believes that POSIX doesn't really
apply because the sticky bit isn't
in the standard:
So for people who to argue against this (which I believe to be a
useful restriction, and not one that necessarily has to be done in a
LSM), it's not sufficient to say that it is a POSIX violation, because
it isn't. The sticky bit itself wasn't originally considered by
POSIX, and many systems which implemented the sticky bit had no
problems becoming [certified] as POSIX compliant.
Per-user /tmp directories might solve the problem, but come with
an administrative burden of their own. Eric Paris notes that it might be a better solution, but
it doesn't come for free:
Now, if we used filesystem namespaces regularly for years
and users, administrators, and developers dealt with them often I agree
that would probably be the preferred solution. It would solve this
issue, but in introduces a whole host of other problems that are even
more obvious and even likely to bite people.
Ts'o agrees: "I do have a slight preference against per-user /tmp mostly because it
gets confusing for administrators, and because it could be used by
rootkits to hide themselves in ways that would be hard for most system
administrators to find." Based on that and other comments, Cook
revised the patches again, moving the test
into VFS, rather than trying to come in through the security subsystem.
In addition, he changed the code so that the new behavior defaulted
"off" to address one of the bigger objections. Version 3 of the patch was
posted on June 1, and has so far only seen comments from Al Viro, who
doesn't seem convinced of the need for the change, but was nevertheless
discussing implementation details.
It may be that Viro and other filesystem developers—Christoph Hellwig did
not seem particularly in favor of the change for example—will oppose
this change. It is, at some level, a band-aid to protect poorly written
applications, but it also provides a measure of protection that some would
like to have. As Cook pointed out, the
Ubuntu kernel already has this protection, but he would like to see that
protection extended to all kernel users. Whether that happens remains to
be seen.
Comments (16 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>