Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.12-rc5, released on October 13. Linus notes that things are calming down and seems generally happy.

Stable updates: 3.11.5, 3.10.16, 3.4.66, and 3.0.100 were released on October 13. There is probably only one more 3.0.x update to be expected before that kernel goes unsupported; 3.0 users should be thinking about moving on.

Comments (none posted)

Quote of the week

Single core systems are becoming a historic curiosity, we should justify every piece of extra complexity we add for them.

— Ingo Molnar

Comments (3 posted)

Linux Foundation Technical Advisory Board Elections

Elections for (half of) the members of the Linux Foundation's Technical Advisory Board will be held as a part of the 2013 Kernel Summit in Edinburgh, probably on the evening of October 23. The nomination process is open now; anybody with an interest in serving on the board should get their nomination in soon.

Full Story (comments: none)

Mount point removal and renaming

By Jake Edge
October 16, 2013

Mounting a filesystem is typically an operation restricted to the root user (or a process with CAP_SYS_ADMIN). There are ways to allow regular users to mount certain filesystems (e.g. removable devices like CDs or USB sticks), but that needs to be set up in advance by an administrator. In addition, bind mounts, which mount a portion of an already-mounted filesystem in another location, always require privileges. User namespaces will allow any user to be root inside their own namespace—thus be able to mount files and filesystems in (currently) unexpected ways. As might be guessed, that can lead to some surprising behavior that a patch set from Eric W. Biederman is trying to address.

The problem crops up when someone tries to delete or rename a file or directory that has been used as a mount point elsewhere. A user only needs read access to a file (and execute permissions to the directories in the path) to be able to use it as a mount point, which means that users can mount filesystems over files they don't own. When the owner of the file (or directory) goes to remove it, they get an EBUSY error—for no obvious reason. Biederman has proposed changing that with a set of patches that would allow the unlink or rename to proceed and to quietly unmount anything mounted there.

For example, if two users were to set up new mount and user namespaces ("user1" creates "ns1", "user2" creates "ns2"), the existing kernel would give the following behavior:

    ns1$ ls foo
    f1   f2
    ns1$ mount foo /tmp/user2/bar

Over in the other namespace, user2 tries to remove their temporary directory:

    ns2$ ls /tmp/user2/bar
    ns2$ rmdir /tmp/user2/bar
    rmdir: failed to remove ‘bar’: Device or resource busy

The visibility of mounts in other mount namespaces is part of the problem. A user getting an EBUSY when they attempt to remove their own directory may not even be able to determine why they are getting the error. They may not be able to see the mount on top of their file because it was made in another namespace. Coupled with user namespaces, this would allow unprivileged users to perform a denial of service attack against other users—including those more privileged.

Biederman's patches first add mount tracking to the virtual filesystem (VFS) layer. That will allow the later patches to find any mounts associated with a particular mount point. Using that, all of the mounts for a given directory entry (dentry) can be unmounted, which is exactly what is done when a mount point is deleted or renamed.

The idea was generally greeted favorably, but Linus Torvalds raised an issue: some programs are written to expect that rmdir() on a non-empty directory has no side effects, as it just returns ENOTEMPTY. The existing behavior is to return EBUSY if the directory is a mount point, but under Biederman's patches, any mount on the directory would be unmounted before determining whether the directory is empty and can be removed. That essentially adds a side effect to rmdir() even if it fails.

In addition, depending on the mount propagation settings, the mount in another namespace might be visible. So, a user looking at "their" directory may actually be seeing files that were mounted by another user. But if they try to delete the directory (or some program does), it might succeed because the underlying mount point directory is empty, which may violate the principle of least surprise.

Torvalds was not at all sure that any application cares, but was concerned that it made the change to the semantics larger than needed. He also had a suggestion for a way forward:

That said, I like the _concept_ of being able to remove a mount-point and the mount just goes away. But I do think that for sanity sake, it should have something like "if one of the mounts is in the current namespace, return -EBUSY". IOW, the patch-series would make the VFS layer _able_ to remove mount-points, but a normal rmdir() when something is mounted in that namespace would fail in order to give legacy behavior.

Biederman agreed and proposed another patch that would cause rmdir() to fail with an EBUSY if there is a mount on the directory in the current mount namespace. Mounts in other mount namespaces would continue to be unmounted in that case. But there were some questions raised about whether renaming mount points (or unlink()ing file mount points) should get the same treatment.

Serge E. Hallyn asked: "Do you think we should do the same thing for over-mounted file at vfs_unlink()?" In other words: if the mount is atop a file that is removed (unlink()ed), rather than a directory, should the same rule be applied? The question was eventually broadened to include rename() as well. At first, Biederman thought the rules should only apply to rmdir(), believing that the permissions in the enclosing directories should be sufficient to avoid any problems with the other two operations. But after some discussion with Miklos Szeredi and Andy Lutomirski, he changed his mind. For consistency, as well as alleviating a race condition in earlier (pre-UMOUNT_NOFOLLOW) versions of the fusermount command, "the most practical path I can see is to block unlink, rename, and rmdir if there is a mount in the local namespace".

The fusermount race comes about because of its attempt to ensure that the mount point it is unmounting does not change out from under it. A malicious user could replace the mount point with a symbolic link to some other filesystem, which the root-privileged fusermount would happily unmount. Earlier, Biederman had seen that problem as an insurmountable hurdle to his approach for fixing the rmdir() problem. But, not allowing mount point renames eliminates most of the concern with the fusermount race condition. There are still unlikely scenarios where an older fusermount binary and a newer kernel could be subverted to unmount any filesystem, but Szeredi, who is the FUSE maintainer, is not overly worried. It should be noted that there are other ways to "win" that race even in existing kernels (by renaming a parent directory of the mount point, for example).

New patches reflecting the changes suggested by various reviewers were posted on October 15. Biederman is targeting the 3.13 kernel, so there is some more time for reviewers to weigh in. It is a change that interested folks should be paying attention to, as it does subtly change the longtime behavior of the kernel.

It is, in some ways, another example of the unintended consequences of user namespaces. If user namespaces are not enabled, the problem is essentially just a source of potential confusion; it only becomes a denial of service when they are enabled. But, if distributions are to ever enable user namespaces, these kinds of problems need to be found and fixed.

Comments (3 posted)

Revisiting CPU hotplug locking

By Jonathan Corbet
October 16, 2013

Last week's Kernel Page included an article on a new CPU hotplugging locking mechanism designed to minimize the overhead of "read-locking" the set of available CPUs on the system. That article remains valid as a description of a clever and elaborate special-purpose locking system, but it seems unlikely that it describes code that will be merged into the mainline. Further discussion — along with an intervention by Linus — has caused this particular project to take a new direction.

The CPU hotplug locking patch was designed with a couple of requirements in mind: (1) actual CPU hotplug operations are rare, so that is where the locking overhead should be concentrated, and (2) as the number of CPUs in commonly used systems grows, it is no longer acceptable to iterate over the full set of CPUs with preemption disabled. That is why get_online_cpus() was designed to be cheap, but also to serve as a sort of sleeping lock. Both of those requirements came into question once other developers started looking at the patch set.

CPU hotplugging as a rare action

Peter Zijlstra's patch set (examined last week), in response to the above-mentioned requirements, went out of its way to minimize the cost of calls to get_online_cpus() and put_online_cpus() — the locking functions that ensure that no changes will be made to the set of online CPUs during the critical section. Interestingly, one of the first questions came from Ingo Molnar, who thought that get_online_cpus() still wasn't cheap enough. He suggested that read-locking the set of online CPUs should cost nothing, while actual hotplug operations should avoid contention by freezing all tasks in the system. Freezing all tasks is an expensive operation, but, as Ingo put it:

Actual CPU hot unplugging and replugging is _ridiculously_ rare in a system, I don't understand how we tolerate _any_ overhead from this utter slowpath.

It was then pointed out (in the LWN comments too) that Android systems use CPU hotplug as a crude form of CPU power management. Ingo dismissed that use as "very broken to begin with", saying that proper power-aware scheduling should be used instead. That may be true, but it doesn't change the fact that hotplugging is used that way — or that the kernel lacks proper power-aware scheduling at the moment anyway. Paul McKenney posted an interesting look at the situation, noting that CPU hotplugging can serve as an effective defense against scheduler bugs that could otherwise ruin a system's battery life.

The end result is that, for the next few years at least, CPU hotplugging as a power management technique seems likely to stay around. So, while it still makes sense to put the expense of the necessary locking on that side — actually adding or removing CPUs is not going to be a hugely fast operation in the best of conditions — it would hurt some users to make hotplugging a lot slower.

A different way

This was about the point where Linus came along with a suggestion of his own. Rather than set up complex locking, why not use the normal read-copy-update (RCU) mechanism to protect CPU removals? In short, if a thread sees a bit set indicating that a particular CPU exists, all data associated with that CPU will continue to be valid for as long as the reading thread holds an RCU read lock. When a CPU is removed, the bit can be cleared, but the removal of the associated data would have to wait until after an RCU grace period has passed. This mechanism is used throughout the kernel and is well understood.

There is only one problem: holding an RCU read lock requires disabling preemption, essentially putting the holding thread into atomic context. Peter expressed his concerns about disabling preemption in this way. Current get_online_cpus() callers assume they can do things like memory allocation that might sleep; that would not be possible if that code had to run with preemption disabled. The other potential problem is that some systems have a lot of CPUs; keeping preemption disabled while iterating over 4096 CPUs could introduce substantial latencies into the system. For these reasons, Peter thought, disabling preemption was not the right way to solve the hotplug locking problem.

Linus was, to put it mildly, unimpressed by this reasoning. It was, he said, the path to low-quality code. Working with preemption disabled, he said, is just the way things should be done in the core kernel:

Yes, preempt_disable() is harder to use than sleeping locks. You need to do pre-allocation etc. But it is much *much* more efficient.

And in the kernel, we care. We have the resources. Plus, we can also say "if you can't handle it, don't do it". We don't need new features so badly that we are willing to screw up core code.

So the sleeping-lock approach has gone out of favor. But, if disabling preemption is to be used instead, solutions must be found to the atomic context and latency problems mentioned above.

With regard to atomic context, the biggest issue is likely to be memory allocations which, normally, can sleep while the kernel works to free the needed space. There are two ways to handle memory allocations when preemption is disabled. One of those is to use the GFP_ATOMIC flag, but code using GFP_ATOMIC tends to draw a lot of critical attention from reviewers. The alternative is to either pre-allocate the memory before disabling preemption, or to temporarily re-enable preemption for long enough to perform the allocation. With the latter approach, naturally, it is usually necessary to check whether the state of the universe has changed while preemption was enabled. All told, it makes for more complex programming, but, as Linus noted, it can be very efficient.

Latency problems can be addressed by disabling preemption inside the loop that passes over all CPUs, rather than outside of it. So preemption is disabled while any given CPU is being processed, but it is quickly re-enabled (then disabled again) between CPUs. That should eliminate any significant latencies, but, once again, the code needs to be prepared for things changing while preemption is enabled.

Changing CPU hotplug locking along these lines would eliminate the need for the complex locking code that was examined last week. But there is a cost to be paid elsewhere: all code that uses get_online_cpus() must be audited and possibly changed to work under the new regime. Peter has agreed that this approach is workable, though, and he seems willing to carry out this audit. That work appears to be underway as of this writing.

To some observers, this sequence of events highlights the difficulties of kernel programming: a talented developer works to create some tricky code that makes things better, only to be told that the approach is wrong. In truth, early patch postings are often better seen as a characterization of the problem than the final solution. As long as developers are willing to let go of their approach when confronted with something better, things work out for the best for everybody involved. That would appear to be the case here; the resulting kernel will perform better while using code that is simpler and adheres more closely to common programming practices.

Comments (none posted)

A new direction for power-aware scheduling

By Jonathan Corbet
October 15, 2013

Power-aware scheduling attempts to place processes on CPUs in a way that minimizes the system's overall consumption of power. Discussion in this area has been muted since we last looked at it in June, but work has been proceeding. Now a new set of power-aware scheduling patches shows a significant change in direction motivated by criticisms that were aired in June. This particular problem is far from solved, but the shape of the eventual solution may be becoming a bit more clear.

Thus far, most of the power-aware scheduling patches posted to the lists have been focused on task placement — packing "small" processes onto a small number of CPUs to allow others to be powered down, for example. The problem with that approach, as Ingo Molnar complained at the time, was that it failed to recognize that there are several mechanisms used to control CPU power consumption. These include the cpuidle subsystem (which decides when a CPU can sleep and how deeply), the cpufreq subsystem (charged with controlling the clock frequency for CPUs) and various aspects of the scheduler itself. There is no integration between these subsystems; indeed, the scheduler is almost entirely ignorant of what the cpuidle and cpufreq controllers are doing. There are other problems as well: the notion of controlling a CPU's frequency has been effectively rendered obsolete by current processor designs, for example.

In the end, Ingo said that no power-aware scheduling patches would be considered for merging until these problems were solved. In other words, the developers working on these patches needed to solve not just their problem, but the problem of rationalizing and integrating the work that has been done by other developers in preceding years. Such things happen in kernel development; it can be hard on individual developers, but it does result in better code in the long term.

The latest approach

To address this challenge, Morten Rasmussen, who has been working on the big LITTLE MP scheduler, has taken a step back; his latest power-aware scheduling patch set does not actually introduce much in the way of power-aware scheduling. Instead, it is focused on the creation of an internal API that governs communications between the scheduler and a new type of "power driver" that is meant to eventually replace the cpuidle and cpufreq subsystems. The power driver (there can only be one for all CPUs in the current patch set) provides these operations to the scheduler:

    struct power_driver {
	int (*at_max_capacity)	(int cpu);
	int (*go_faster)	(int cpu, int hint);
	int (*go_slower)	(int cpu, int hint);
	int (*best_wake_cpu)	(void);
	void (*late_callback)	(int cpu);
    };

Two of these methods allow the scheduler to query the power state of a given CPU; at_max_capacity() allows the scheduler to ask whether the processor is running at full speed, while best_wake_cpu() asks which (sleeping) CPU would be the best to wake in response to increasing load. The best_wake_cpu() call can make use of low-level architectural knowledge to determine which CPU would require the least power to bring up; it would, for example, favor CPUs that share power or clock lines with currently running CPUs over those that would require powering up a new package.

The scheduler can provide feedback to the power driver with the go_faster() and go_slower() methods. These calls request higher or lower speed from the given CPU without specifying an actual clock frequency, which isn't really possible on a lot of current processors. The power driver can then instruct the hardware to adopt a power policy that matches what the scheduler is asking for. The hint parameter is not used in the current patch set; its purpose is to indicate how much faster or slower performance the scheduler would like to see. These calls as a whole are hints, actually; the power driver is not required to carry out the scheduler's wishes.

Finally, late_callback() exists to allow the power driver to do work that may require sleeping or having interrupts enabled. Most of the functions listed above can be called from within the scheduler at almost any point, so they have to be written to run in atomic context. If they need to do something that cannot be done in that context, they can set the work aside; the scheduler will call late_callback() at a safe time for that work to be done.

The current patch set makes just enough use of these functions to show how they would be used. Whenever the scheduler adds a process to a given CPU's run queue, it checks whether the total load exceeds what the CPU is able to provide; if so, a call to go_faster() will be made to ask for more performance. A similar test is done whenever a process is removed from a CPU; if that CPU is providing more power than is needed, go_slower() will be called. A separate test will call go_faster() if the idle time on the CPU is low, even if the computed load suggests that the CPU should not be busy. Rudimentary implementations of go_faster() and go_slower() have been provided; they are a simple wrapper around the existing cpufreq driver code.

What's coming

The full plan (as described in Morten's Linux Plumbers Conference talk slides [PDF]) calls for the elimination of cpufreq and cpuidle altogether once their functionality has been pulled into the power driver. There will also be several more functions to be provided by the power driver. These include get_best_sleep_cpu() to get a hint for the best CPU to put asleep, enter_idle() to actually put a CPU into the sleep state, load_scale() to help with the calculation of loads regardless of the CPU's current power state, and task_boost() to give priority to a specific CPU. task_boost() is aimed at systems that provide features like "turbo mode," where one CPU can be overclocked, but only if the others are idle.

The long-term plan also involves bringing back techniques like small-task packing, proper support for big.LITTLE systems, and more. But those goals look distant at the moment; Morten and company must first build a consensus around the proposed architecture. That may take some doing, yet; scheduler developer Peter Zijlstra's first response was "I don't see anything except a random bunch of hooks without an over-all picture of how to get less power used." Morten has promised to fill out the story.

Some of these issues may be resolved on October 23, when a half-day minisummit will be held on the topic in Edinburgh. Many of the relevant developers should be there, allowing for quick resolution of a number of the outstanding issues. With luck, your editor will be there too; stay tuned for the next episode in this long-running story.

Comments (3 posted)

Linus Torvalds Linux 3.12-rc5 ?

Greg KH Linux 3.11.5 ?

Greg KH Linux 3.10.16 ?

Sebastian Andrzej Siewior 3.10.15-rt11 ?

Kamal Mostafa Linux 3.8.13.11 ?

Luis Henriques Linux 3.5.7.23 ?

Greg KH Linux 3.4.66 ?

Greg KH Linux 3.0.100 ?

Alexandre Courbot ARM: support for Trusted Foundations secure monitor ?

Vyacheslav Tyrtov Exynos 5410 Dual cluster support ?

Lorenzo Pieralisi arm64: suspend/resume implementation ?

Matthew Leach AArch64 BE Support ?

Mark Rutland arm64: initial CPU hotplug support ?

David Long uprobes: Add uprobes support for ARM ?

Geert Uytterhoeven kexec support for Linux/m68k (kernel part) ?

Morten Rasmussen Power-aware scheduling v2 ?

Juri Lelli sched: SCHED_DEADLINE v8 ?

Lukasz Majewski cpufreq:boost: CPU Boost mode support ?

Tom Zanussi tracing: trace event triggers ?

Mika Westerberg ACPI power management support for I2C and SPI devices ?

hyunhee.kim Input: add regulator haptic driver Oct 09

Chen, Gong Extended H/W error log driver ?

Stanimir Varbanov Add support for Qualcomm's PRNG ?

Alexandre Courbot New descriptor-based GPIO interface ?

Srinivas Pandruvada Power Capping Framework and RAPL Driver ?

Marek Belisko omapdss: Add new panel driver for Topolly td028ttec1 LCD. ?

Rob Clark Atomic/nuclear modeset/pageflip ?

Eduardo Valentin thermal: introduce clock cooling device ?

Pali Rohár media: Add BCM2048 radio driver ?

Christian Ruppert pinmux: Add TB10x pinmux driver ?

Dinesh Ram si4713 usb device driver ?

Maxim Patlasov [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy ?

Minchan Kim squashfs: enhance parallel I/O ?

Phillip Lougher Squashfs: Directly decompress into the page cache for file data ?

Hong Zhiguo blk-throttle: simplify logic by token bucket algorithm ?

Liu Bo Btrfs: Online(inband) data deduplication ?

Eric W. Biederman vfs: Detach mounts on unlink ?

Dave Kleikamp loop: Issue O_DIRECT aio using bio_vec ?

Kirill A. Shutemov dynamically allocate split ptl if it cannot be embedded to struct page ?

Johannes Weiner mm: thrash detection-based file cache sizing v5 ?

Zhang Yanfei Arrange hotpluggable memory as ZONE_MOVABLE ?

Naoya Horiguchi update page table walker ?

Joonsoo Kim slab: overload struct slab over struct page to reduce memory usage ?

Chun-Yeow Yeoh Add Mesh Channel Switch Support ?

Kees Cook Kernel base address randomization ?

Stephan Mueller CPU Jitter RNG: inclusion into kernel crypto API and /dev/random ?

Casey Schaufler Subject: [PATCH] Smack: Implement lock security mode ?

Karel Zak util-linux v2.24-rc2 ?

Jacob Pan TMON thermal monitoring/tuning tool ?

Douglas Gilbert sg3_utils-1.37 available ?

Namhyung Kim perf tools: Introduce new 'ftrace' command (5) ?

Kernel development

Brief items

Kernel release status

Quote of the week

Linux Foundation Technical Advisory Board Elections

Kernel development news

Mount point removal and renaming

Revisiting CPU hotplug locking

CPU hotplugging as a rare action

A different way

A new direction for power-aware scheduling

The latest approach

What's coming

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous