Kernel development
Brief items
Kernel release status
The current development kernel is 3.12-rc5, released on October 13. Linus notes that things are calming down and seems generally happy.Stable updates: 3.11.5, 3.10.16, 3.4.66, and 3.0.100 were released on October 13. There is probably only one more 3.0.x update to be expected before that kernel goes unsupported; 3.0 users should be thinking about moving on.
Quote of the week
Linux Foundation Technical Advisory Board Elections
Elections for (half of) the members of the Linux Foundation's Technical Advisory Board will be held as a part of the 2013 Kernel Summit in Edinburgh, probably on the evening of October 23. The nomination process is open now; anybody with an interest in serving on the board should get their nomination in soon.
Kernel development news
Mount point removal and renaming
Mounting a filesystem is typically an operation restricted to the root user (or a process with CAP_SYS_ADMIN). There are ways to allow regular users to mount certain filesystems (e.g. removable devices like CDs or USB sticks), but that needs to be set up in advance by an administrator. In addition, bind mounts, which mount a portion of an already-mounted filesystem in another location, always require privileges. User namespaces will allow any user to be root inside their own namespace—thus be able to mount files and filesystems in (currently) unexpected ways. As might be guessed, that can lead to some surprising behavior that a patch set from Eric W. Biederman is trying to address.
The problem crops up when someone tries to delete or rename a file or directory that has been used as a mount point elsewhere. A user only needs read access to a file (and execute permissions to the directories in the path) to be able to use it as a mount point, which means that users can mount filesystems over files they don't own. When the owner of the file (or directory) goes to remove it, they get an EBUSY error—for no obvious reason. Biederman has proposed changing that with a set of patches that would allow the unlink or rename to proceed and to quietly unmount anything mounted there.
For example, if two users were to set up new mount and user namespaces ("user1" creates "ns1", "user2" creates "ns2"), the existing kernel would give the following behavior:
ns1$ ls foo f1 f2 ns1$ mount foo /tmp/user2/barOver in the other namespace, user2 tries to remove their temporary directory:
ns2$ ls /tmp/user2/bar ns2$ rmdir /tmp/user2/bar rmdir: failed to remove ‘bar’: Device or resource busy
The visibility of mounts in other mount namespaces is part of the problem. A user getting an EBUSY when they attempt to remove their own directory may not even be able to determine why they are getting the error. They may not be able to see the mount on top of their file because it was made in another namespace. Coupled with user namespaces, this would allow unprivileged users to perform a denial of service attack against other users—including those more privileged.
Biederman's patches first add mount tracking to the virtual filesystem (VFS) layer. That will allow the later patches to find any mounts associated with a particular mount point. Using that, all of the mounts for a given directory entry (dentry) can be unmounted, which is exactly what is done when a mount point is deleted or renamed.
The idea was generally greeted favorably, but Linus Torvalds raised an issue: some programs are written to expect that rmdir() on a non-empty directory has no side effects, as it just returns ENOTEMPTY. The existing behavior is to return EBUSY if the directory is a mount point, but under Biederman's patches, any mount on the directory would be unmounted before determining whether the directory is empty and can be removed. That essentially adds a side effect to rmdir() even if it fails.
In addition, depending on the mount propagation settings, the mount in another namespace might be visible. So, a user looking at "their" directory may actually be seeing files that were mounted by another user. But if they try to delete the directory (or some program does), it might succeed because the underlying mount point directory is empty, which may violate the principle of least surprise.
Torvalds was not at all sure that any application cares, but was concerned that it made the change to the semantics larger than needed. He also had a suggestion for a way forward:
Biederman agreed and proposed another patch that would cause rmdir() to fail with an EBUSY if there is a mount on the directory in the current mount namespace. Mounts in other mount namespaces would continue to be unmounted in that case. But there were some questions raised about whether renaming mount points (or unlink()ing file mount points) should get the same treatment.
Serge E. Hallyn asked: "Do you think we should do the same thing for over-mounted file at
vfs_unlink()?
" In other words: if the mount is atop a file
that is removed (unlink()ed), rather than a
directory, should the same rule be applied? The question was eventually
broadened to include rename() as well.
At first, Biederman thought the rules should only apply to
rmdir(), believing that
the permissions in the enclosing directories should be sufficient to
avoid any problems with the other two operations. But
after some
discussion with Miklos Szeredi and Andy Lutomirski, he changed his mind. For consistency, as well as
alleviating a race condition in earlier (pre-UMOUNT_NOFOLLOW)
versions of the fusermount command, "the most practical path
I can see is to block unlink,
rename, and rmdir if there is a mount in the local namespace
".
The fusermount race comes about because of its attempt to ensure that the mount point it is unmounting does not change out from under it. A malicious user could replace the mount point with a symbolic link to some other filesystem, which the root-privileged fusermount would happily unmount. Earlier, Biederman had seen that problem as an insurmountable hurdle to his approach for fixing the rmdir() problem. But, not allowing mount point renames eliminates most of the concern with the fusermount race condition. There are still unlikely scenarios where an older fusermount binary and a newer kernel could be subverted to unmount any filesystem, but Szeredi, who is the FUSE maintainer, is not overly worried. It should be noted that there are other ways to "win" that race even in existing kernels (by renaming a parent directory of the mount point, for example).
New patches reflecting the changes suggested by various reviewers were posted on October 15. Biederman is targeting the 3.13 kernel, so there is some more time for reviewers to weigh in. It is a change that interested folks should be paying attention to, as it does subtly change the longtime behavior of the kernel.
It is, in some ways, another example of the unintended consequences of user namespaces. If user namespaces are not enabled, the problem is essentially just a source of potential confusion; it only becomes a denial of service when they are enabled. But, if distributions are to ever enable user namespaces, these kinds of problems need to be found and fixed.
Revisiting CPU hotplug locking
Last week's Kernel Page included an article on a new CPU hotplugging locking mechanism designed to minimize the overhead of "read-locking" the set of available CPUs on the system. That article remains valid as a description of a clever and elaborate special-purpose locking system, but it seems unlikely that it describes code that will be merged into the mainline. Further discussion — along with an intervention by Linus — has caused this particular project to take a new direction.The CPU hotplug locking patch was designed with a couple of requirements in mind: (1) actual CPU hotplug operations are rare, so that is where the locking overhead should be concentrated, and (2) as the number of CPUs in commonly used systems grows, it is no longer acceptable to iterate over the full set of CPUs with preemption disabled. That is why get_online_cpus() was designed to be cheap, but also to serve as a sort of sleeping lock. Both of those requirements came into question once other developers started looking at the patch set.
CPU hotplugging as a rare action
Peter Zijlstra's patch set (examined last week), in response to the above-mentioned requirements, went out of its way to minimize the cost of calls to get_online_cpus() and put_online_cpus() — the locking functions that ensure that no changes will be made to the set of online CPUs during the critical section. Interestingly, one of the first questions came from Ingo Molnar, who thought that get_online_cpus() still wasn't cheap enough. He suggested that read-locking the set of online CPUs should cost nothing, while actual hotplug operations should avoid contention by freezing all tasks in the system. Freezing all tasks is an expensive operation, but, as Ingo put it:
It was then pointed out (in the LWN comments
too) that Android systems use CPU hotplug as a crude form of CPU power
management. Ingo dismissed that use as
"
The end result is that, for the next few years at least, CPU hotplugging as
a power management technique seems likely to stay around. So, while it
still makes sense to put the expense of the necessary locking on that side
— actually adding or removing CPUs is not going to be a hugely fast operation
in the best of conditions — it would hurt some users to make hotplugging a
lot slower.
This was about the point where Linus came
along with
a suggestion of his own. Rather than set up complex locking, why not use
the normal read-copy-update (RCU) mechanism to protect CPU removals? In
short, if a thread sees a bit set indicating that a particular CPU exists,
all data associated with that CPU will continue to be valid for as long as
the reading thread holds an RCU read lock. When a CPU is removed, the bit
can be cleared, but the removal of the associated data would have to wait
until after an RCU grace period has passed. This mechanism is used
throughout the kernel and is well understood.
There is only one problem: holding an RCU read lock requires disabling
preemption, essentially putting the holding thread into atomic context.
Peter expressed his concerns about
disabling preemption in this way. Current get_online_cpus()
callers assume they can do things like memory allocation that might sleep;
that would not be possible if that code had to run with preemption
disabled. The other potential problem is that some systems have a
lot of CPUs; keeping preemption disabled while iterating over 4096
CPUs could introduce substantial latencies into the system. For these
reasons, Peter thought, disabling preemption was not the right way to solve
the hotplug locking problem.
Linus was, to put it mildly, unimpressed by
this reasoning. It was, he said, the path to low-quality code. Working
with preemption disabled, he said, is just the way things should be done in
the core kernel:
And in the kernel, we care. We have the resources. Plus, we can
also say "if you can't handle it, don't do it". We don't need new
features so badly that we are willing to screw up core code.
So the sleeping-lock approach has gone out of favor. But,
if disabling preemption is to be used instead, solutions
must be found to the atomic context and latency problems mentioned above.
With regard to atomic context, the biggest issue is likely to be memory
allocations which, normally, can sleep while the kernel works to free the
needed space.
There are two ways to handle memory allocations when preemption is
disabled. One of those is to use the GFP_ATOMIC flag, but code
using GFP_ATOMIC tends to draw a lot of critical attention from
reviewers. The alternative is to either pre-allocate the memory before
disabling preemption, or to temporarily re-enable preemption for long
enough to perform the allocation. With the latter approach, naturally, it
is usually necessary to check whether the state of the universe has changed
while preemption was enabled. All told, it makes for more complex
programming, but, as Linus noted, it can be very efficient.
Latency problems can be addressed by disabling preemption inside the loop
that passes over all CPUs, rather than outside of it. So preemption is
disabled while any given CPU is being processed, but it is quickly
re-enabled (then disabled again) between CPUs. That should eliminate any
significant latencies,
but, once again, the code needs to be prepared for things changing while
preemption is enabled.
Changing CPU hotplug locking along these lines would eliminate the need for
the complex locking code that was examined last week. But there is a cost
to be paid elsewhere: all code that uses get_online_cpus() must be
audited and possibly changed to work under the new regime. Peter has agreed that this approach is workable, though,
and he seems willing to carry out this audit. That work appears to be
underway as of this writing.
To some observers, this sequence of events highlights the difficulties of
kernel programming: a talented developer works to create some tricky code
that makes things better, only to be told that the approach is wrong. In
truth, early patch postings are often better seen as a characterization of
the problem than the final solution. As long as developers are willing to
let go of their approach when confronted with something better, things work
out for the best for everybody involved. That would appear to be the case
here; the resulting kernel will perform better while using code that is
simpler and adheres more closely to common programming practices.
Thus far, most of the power-aware scheduling patches posted to the lists
have been focused on task placement —
packing "small" processes onto a small number of CPUs to allow others to be
powered down, for example. The problem with that approach, as Ingo Molnar
complained at the time, was that it failed
to recognize that there are several mechanisms used to control CPU power
consumption. These include the cpuidle subsystem (which decides when a CPU
can sleep and how deeply), the cpufreq subsystem (charged with controlling
the clock frequency for CPUs) and various aspects of the scheduler itself.
There is no integration between these subsystems; indeed, the scheduler is
almost entirely ignorant of what the cpuidle and cpufreq controllers are
doing. There are other problems as well: the notion of controlling a CPU's
frequency has been effectively rendered obsolete by current processor
designs, for example.
In the end, Ingo said that no power-aware scheduling patches would be
considered for merging until these problems were solved. In other words,
the developers working on these patches needed to solve not just their
problem, but the problem of rationalizing and integrating the work that has
been done by other developers in preceding years. Such things happen in
kernel development; it can be hard on individual developers, but it does
result in better code in the long term.
To address this challenge, Morten Rasmussen, who has been working on the big LITTLE MP scheduler, has taken a step back;
his latest power-aware scheduling patch set
does not actually introduce much in the way of power-aware scheduling.
Instead, it is focused on the creation of an internal API that governs
communications between the scheduler and a new type of "power driver" that
is meant to eventually replace the cpuidle and cpufreq subsystems. The
power driver (there can only be one for all CPUs in the current patch set)
provides these operations to the scheduler:
Two of these methods allow the scheduler to query the power state of a
given CPU; at_max_capacity() allows the scheduler to ask whether the
processor is running at full speed, while best_wake_cpu() asks
which (sleeping) CPU would be the best to wake in response to increasing
load. The best_wake_cpu() call can make use of low-level
architectural knowledge to determine which CPU would require the least
power to bring up; it would, for example, favor CPUs that share power or
clock lines with currently running CPUs over those that would require
powering up a new package.
The scheduler can provide feedback to the power driver with the
go_faster() and go_slower() methods. These calls request
higher or lower speed from the given CPU without specifying an actual clock
frequency, which isn't really possible on a lot of current processors. The
power driver can then instruct the hardware to adopt a power policy that
matches what the scheduler is asking for. The hint parameter is
not used in the current patch set; its purpose is to indicate how much
faster or slower performance the scheduler would like to see. These calls
as a whole are hints, actually; the power driver is not required to carry
out the scheduler's wishes.
Finally, late_callback() exists to allow the power driver to
do work that may require sleeping or having interrupts enabled. Most of
the functions listed above can be called from within the scheduler at
almost any point, so they have to be written to run in atomic context. If
they need to do something that cannot be done in that context, they can set
the work aside; the scheduler will call late_callback() at a safe
time for that work to be done.
The current patch set makes just enough use of these functions to show how
they would be used. Whenever the scheduler adds a process to a given CPU's
run queue, it checks whether the total load exceeds what the CPU is able to
provide; if so, a call to go_faster() will be made to ask for more
performance. A similar test is done whenever a process is removed from a
CPU; if that CPU is providing more power than is needed,
go_slower() will be called. A separate test will call
go_faster() if the idle time on the CPU is low, even if the
computed load suggests that the CPU should not be busy. Rudimentary
implementations of go_faster() and go_slower() have been
provided; they are a simple wrapper around the existing cpufreq driver
code.
The full plan (as described in Morten's
Linux Plumbers Conference talk slides [PDF]) calls for the elimination
of cpufreq and cpuidle altogether once their functionality has been pulled
into the power driver. There will also be several more
functions to be provided by the power driver. These include
get_best_sleep_cpu() to get a hint for the best CPU to put asleep,
enter_idle() to actually put a CPU into the sleep state,
load_scale() to help with the calculation of loads regardless of
the CPU's current power state, and task_boost() to give priority
to a specific CPU. task_boost() is aimed at systems that provide
features like "turbo mode," where one CPU can be overclocked, but only if
the others are idle.
The long-term plan also involves bringing back techniques like small-task packing, proper support for
big.LITTLE systems, and more. But those goals look distant at the moment;
Morten and company must first build a consensus around the proposed
architecture. That may take some doing, yet; scheduler developer Peter
Zijlstra's first response was "
Some of these issues may be resolved on October 23, when a half-day minisummit will be held on the topic
in Edinburgh. Many of the relevant developers should be there, allowing
for quick resolution of a number of the outstanding issues. With luck,
your editor will be there too; stay tuned for the next episode in this
long-running story.
very broken to begin with
", saying that proper power-aware
scheduling should be used instead. That may be true, but it doesn't change
the fact
that hotplugging is used that way — or that the kernel lacks proper
power-aware scheduling at the moment anyway. Paul McKenney posted an interesting look at the situation, noting
that CPU hotplugging can serve as an effective defense against scheduler
bugs that could otherwise ruin a system's battery life.
A different way
A new direction for power-aware scheduling
Power-aware scheduling attempts to place processes on CPUs in a way that
minimizes the system's overall consumption of power. Discussion in this
area has been muted since we last looked at it in
June, but work has been proceeding. Now a new set of power-aware
scheduling patches shows a significant change in direction motivated by
criticisms that were aired in June. This particular problem is far from
solved, but the shape of the eventual solution may be becoming a bit more
clear.
The latest approach
struct power_driver {
int (*at_max_capacity) (int cpu);
int (*go_faster) (int cpu, int hint);
int (*go_slower) (int cpu, int hint);
int (*best_wake_cpu) (void);
void (*late_callback) (int cpu);
};
What's coming
I
don't see anything except a random bunch of hooks without an over-all
picture of how to get less power used.
" Morten has promised to fill
out the story.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>