LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.8 kernel was released on February 18; Linus said: "The release got delayed a couple of days because I was waiting for confirmation of a small patch, but hey, we could also say that it was all intentional, and that this is the special 'Presidents' Day Release'. It sounds more planned that way, no?" Some of the headline features in this release include metadata integrity checking in the xfs filesystem, the foundation for much improved NUMA scheduling, kernel memory usage accounting and associated usage limits, inline data support for small files in the ext4 filesystem, nearly complete user namespace support, and much more. See the KernelNewbies 3.8 page for lots of details.

Stable updates: 3.7.8, 3.4.31, and 3.0.64 were released on February 14, 3.7.9, 3.4.32, and 3.0.65 were released on February 17, and 3.2.39 came out on February 20.

Comments (none posted)

Quotes of the week

One person's bug is another person's fascinating invertebrate.
Neil Brown

Comments in XFS, especially weird scary ones, are rarely wrong. Some of them might have been there for close on 20 years, but they are our documentation for all the weird, scary stuff that XFS does. I rely on them being correct, so it's something I always pay attention to during code review. IOWs, When we add, modify or remove something weird and scary, the comments are updated appropriately so we'll know why the code is doing something weird and scary in another 20 years time.
Dave Chinner

Just to get back at you though, I'll turn on an incandescent light bulb every time I have to use -f.
Chris Mason (to Eric Sandeen)

Comments (none posted)

No kvmtool in the mainline

By Jonathan Corbet
February 20, 2013
The story of the "native Linux KVM tool" (or, more recently, "kvmtool") has been playing out since early 2011. This tool serves as a simple replacement for the QEMU emulator, making it easy to set up and run guests under KVM. The kvmtool developers have been working under the assumption that their code would be merged into the mainline kernel, as was done with perf, but others have disagreed with that idea. The result has been a repetitive conversation every merge window or two as kvmtool was proposed for merging.

The conversation for the 3.9 merge window has seemingly been a bit more decisive, though. Ingo Molnar (along with kvmtool developer Pekka Enberg) presented a long list of reasons why they thought it made sense to put kvmtool into the mainline repository. Ingo even compared kernel tooling to Somalia, saying that it was made up of "disjunct entities with not much commonality or shared infrastructure," though, presumably, with fewer pirates. Few others came to the defense of kvmtool, leaving Ingo and Pekka to carry forward the argument on their own.

Linus responded that he saw no convincing reason to put kvmtool in the mainline; indeed, he thought that tying kvmtool with the kernel could be retarding its development. He concluded with:

So here, let me state it very very clearly: I will not be merging kvmtool. It's not about "useful code". It's not about the project keeping to improve. Both of those would seem to be *better* outside the kernel, where there isn't that artificial and actually harmful tie-in.

That is probably the end of the discussion unless somebody can come up with a new argument that Linus will find more convincing. At this point, it seems that kvmtool is destined to remain out of the mainline kernel repository.

Comments (4 posted)

Kernel development news

3.9 Merge window part 1

By Jonathan Corbet
February 20, 2013
The 3.9 merge window has gotten off to a relatively slow start, with a mere 1,200 non-merge change sets pulled into the mainline as of this writing. The process may have been slowed a bit by a sporadic reboot problem that crept in relatively early, and which has not yet been tracked down. Even so, a number of significant changes have already found their way in for 3.9, with many more to follow.

Important user-visible changes include:

  • Progress has been made toward the goal of eliminating the timer tick while running in user space. The patches merged for 3.9 fix up the CPU time accounting code, printk() subsystem, and irq_work code to function without timer interrupts; further work can be expected in future development cycles.

  • A relatively simple scheduler patch fixes the "bouncing cow problem," wherein, on a system with more processors than running processes, those processes can wander across the processors, yielding poor cache behavior. For a "worst-case" tbench benchmark run, the result is a 15x improvement in performance.

  • The format of tracing events has been changed to remove some unused padding. This change created problems when it was first attempted in 2011, but it seems that the relevant user-space programs have since been fixed (by moving them to the libtraceevent library). It is worth trying again; smaller events require less bandwidth as they are communicated to user space. Anybody who observes any remaining problems would do well to report them during the 3.9 development cycle.

  • The ftrace tracing system has gained the ability to take a static "snapshot" of the tracing buffer controllable via a debugfs file. See this ftrace.txt patch for documentation on how to use this feature.

  • The perf bench utility has a new set of benchmarks intended to help with the evaluation of NUMA balancing patches.

  • perf stat has been augmented to include the ability to print out information at a regular interval.

  • New hardware support includes:

    • Systems and processors: The "Goldfish" virtual x86 platform used for Android development, Technologic Systems TS-5500 single-board computers, and SGI Ultraviolet System 3 systems.

    • Input: Cypress PS/2 touchpads and Cypress APA I2C trackpads.

    • Miscellaneous: ST-Ericsson AB8505, AB9540, and AB8540 pin controllers, Maxim MAX6581, MAX6602, MAX6622, MAX6636, MAX6689, MAX6693, MAX6694, MAX6697, MAX6698, and MAX6699 temperature sensor chips, TI / Burr Brown INA209 power monitors, TI LP8755 power management units, NVIDIA Tegra114 pinmux controllers, Allwinner A1X pin controllers, ARM PL320 interprocessor communication mailboxes, Calxeda Highbank CPU frequency controllers, Freescale i.MX6Q CPU frequency controllers, and Marvell Kirkwood CPU frequency controllers.

Changes visible to kernel developers include:

  • The workqueue functions work_pending() and delayed_work_pending() have been deprecated; users are being changed throughout the kernel tree.

  • The "regmap" API, which simplifies management of device register sets, now supports a "no bus" mode if the driver supplies simple "read" and "write" functions. Regmap has also gained asynchronous I/O support.

If the usual schedule holds, the 3.9 merge window should stay open until approximately March 5. As usual, LWN will list the most significant changes throughout the merge window; tune in next week for the next exciting episode.

Comments (none posted)

Multi-cluster power management

By Jonathan Corbet
February 20, 2013
The ARM "big.LITTLE" architecture is an interesting beast: it combines clusters of two distinct ARM-based CPU designs into a single processor. One cluster contains relatively slow Cortex-A7 CPUs that are highly power-efficient, while the other cluster is made up of fast, power-hungry Cortex-A15 CPUs. These CPUs can be powered up and down in any combination, but there are additional power savings if an entire cluster can be powered down at once. Power-efficient scheduling is currently a challenge for Linux even on homogeneous architectures; big.LITTLE throws another degree of freedom into the mix that the scheduler is absolutely unprepared to deal with, currently.

As a result, the initial approach to big.LITTLE is to treat each pair of fast and slow CPUs as if it were a single CPU with high- and low-frequency modes. That approach reduces the problem to writing an appropriate cpufreq governor at the cost of forcing one CPU in each pair to be powered down at any given time. The big.LITTLE patch set is more fully described in the article linked above; that patch set is coming along but is not yet ready for merging into the mainline. One piece of the larger patch set that might be ready for 3.9, though, is the "multi-cluster power management" (MCPM) code.

The Linux kernel has reasonably good CPU power management, but that code, like the scheduler, was not designed with multiple, dissimilar clusters in mind. Fixing that requires adding logic that can determine when entire clusters must be powered up and down, along with the code that actually implements those transitions. The MCPM subsystem is concerned with the latter part of the problem, which is not as easy as one might expect.

Multi-cluster power management involves the definition of a state machine that implements a 2x3 table of states. Along one axis are the three states describing the cluster's current power situation: CLUSTER_DOWN, CLUSTER_UP, and CLUSTER_GOING_DOWN. The first two are steady states, while the third indicates that the cluster is being powered down, but that the power-down operation is not yet complete. The other axis in the state table describes whether the kernel running on some CPU has decided that the cluster needs to be powered up or not; those states are called INBOUND_NOT_COMING_UP and INBOUND_COMING_UP. The table as a whole thus contains six states, along with a well-defined set of rules describing transitions between those states.

Shutdown

To begin with, imagine a cluster that is in a small portion of the state space: it is either fully powered up or fully powered down:

[state
diagram]

The cluster is running or not; in either one of the above state combinations, there is no plan to bring up the cluster (the INBOUND_COMING_UP substate would make no sense in a fully-running cluster in any case).

If we start from the top of the diagram (CLUSTER_UP), we can then trace out the sequence of steps needed to bring the cluster down. The first of those, once the power-down decision has been made, is to determine which CPU is (in the MCPM terminology) the "last man" that is in charge of shutting everything down and turning off the lights on its way out. Since the cluster is fully operational, that decision is relatively easy; a would-be last man simply acquires the relevant spinlock and elects itself into the position. Once that has happened, the last man pushes the cluster through to the CLUSTER_DOWN state:

[state
diagram]

All transitions marked with solid red arrows are executed by the last man CPU. Once the decision to power down has been made, the cluster moves to CLUSTER_GOING_DOWN, where the cleanup work is done. Among other things, the last man will wait until all other CPUs in the cluster have powered themselves down. Once everything is ready, the last man pushes the cluster into CLUSTER_DOWN, powering itself down in the process.

Coming back up

Bringing the cluster back up is a similar process, but with an interesting challenge: the CPUs in the cluster must elect a "first man" CPU to perform the initialization work far enough that the kernel can run safely on all the other CPUs. The problem is that, when a cluster first powers up, there may be no memory coherence between the CPUs in that cluster, so spinlocks are not a reliable mechanism for mutual exclusion. Some other mechanism must be used to safely choose a first man; that mechanism is called "voting mutexes" or "vlocks."

The core idea behind vlocks is that, while atomic instructions will not work between CPUs, it is still possible to use memory barriers to ensure that other CPUs can see a specific memory change. Acquiring a vlock in this environment is a multi-step operation: a CPU will indicate that it is about to vote for a lock holder, then vote for itself. Once (1) at least one CPU has voted for itself, and (2) all CPUs interested in voting have had their say, the CPU that voted last wins. The vlocks.txt documentation file included with the patch set provides the following pseudocode to illustrate the algorithm:

	int currently_voting[NR_CPUS] = { 0, };
	int last_vote = -1; /* no votes yet */

	bool vlock_trylock(int this_cpu)
	{
		/* signal our desire to vote */
		currently_voting[this_cpu] = 1;
		if (last_vote != -1) {
			/* someone already volunteered himself */
			currently_voting[this_cpu] = 0;
			return false; /* not ourself */
		}

		/* let's suggest ourself */
		last_vote = this_cpu;
		currently_voting[this_cpu] = 0;

		/* then wait until everyone else is done voting */
		for_each_cpu(i) {
			while (currently_voting[i] != 0)
				/* wait */;
		}

		/* result */
		if (last_vote == this_cpu)
			return true; /* we won */
		return false;
	}

Missing from the pseudocode is the use of memory barriers to make each variable change visible across the cluster; in truth, the memory caches for the cluster have not been enabled at the time that the first-man election takes place, so few barriers are necessary. Needless to say, vlocks are relatively slow, but that doesn't matter much when compared to a heavyweight operation like powering up an entire cluster.

Once a first man has been chosen, it drives the cluster through a set of states on its way back to full functionality:

[state
diagram]

The dotted green lines indicate state transitions executed by the inbound, first-man CPU. When a decision is made to power the cluster up, the first man will switch to the CLUSTER_DOWN / INBOUND_COMING_UP combination. While the cluster is in this state, the first man is the only CPU running; its job is to initialize things to the point that the other CPUs can safely resume the kernel with properly-functioning mutual exclusion primitives. Once that has been achieved, the cluster moves to CLUSTER_UP / INBOUND_COMING_UP while the other CPUs come on line; a final transition to CLUSTER_UP / INBOUND_NOT_COMING_UP happens shortly thereafter.

That describes the basic mechanism, but leaves one interesting question unaddressed: what happens when CPUs disagree about whether the cluster should go up or down? Such disagreements will not happen during the power-up process; the cluster is being brought online to execute a specific task that will still need to be done. But it is possible for the kernel as a whole to change its mind about powering a cluster down; an unexpected interrupt or load spike could indicate that the cluster is still needed. In that case, a new first man may make an appearance while the last man is trying to clock out and go home. This situation is handled by having the first man transition the cluster into the sixth state combination:

[state
diagram]

The CLUSTER_GOING_DOWN / INBOUND_COMING_UP state encapsulates the conflicted situation where the CPUs differ on the desired state. The eventual outcome needs to be a powered-up, functioning cluster. The last man must occasionally check for this state transition as it goes through its power-down rituals; when it notices that the cluster actually wants to be up, it faces a choice:

[state
diagram]

The optimal solution would be to abort the power-down process, unwind any work that has been done, and put the cluster into the CLUSTER_UP / INBOUND_COMING_UP state, at which point the first man can finish the job. Should that not be practical, though, the last man can complete the job and switch to CLUSTER_DOWN / INBOUND_COMING_UP instead; the first man will then go through the full power-up operation. Either way, the end result will be a functioning cluster.

A few closing notes

The above text pretty much describes the process used to change a cluster's power state; most of the rest is just architecture-specific details. For the curious, a lot more information can be found in cluster-pm-race-avoidance.txt, included with the MCPM patch set. It is noteworthy that the entire MCPM patch set is contained within the ARM architecture subtree; indeed, the entire big.LITTLE patch is ARM-specific. Perhaps that is how it needs to be, but it is also not difficult to imagine that other architectures may, at some point, follow ARM into the world of heterogeneous clusters. There may come a time when many of the lessons learned here will need to be applied to generic code.

Traditionally, ARM developers have confined themselves to working with a specific ARM subarchitecture, leading to a lot of duplicated (and substandard) code under arch/arm as a whole. More recently, there has been a big push to work across the ARM subarchitectures; that has resulted in a lot of cleaned up support code and abstractions for ARM as a whole. But, possibly, the ARM developers are still a little bit nervous about stepping outside of arch/arm and making changes to the core kernel when those changes are needed. Given that there are probably more Linux systems running on ARM processors than any other, it would be natural to expect that the needs of the ARM architecture would drive the evolution of the kernel as a whole. That is certainly happening, but, one could argue, it could be happening more often and more consistently.

One could instead argue that the big.LITTLE patch set is a short-term hack intended to get Linux running on the relevant hardware until a proper solution can be implemented. The "proper solution" is still likely to need MCPM, though, and, in any case, this kind of hack has a tendency to stick around for a long time. There is almost certainly a long list of use cases for which the basic big.LITTLE approach gives more than adequate results, while getting proper performance out of a true, scheduler-based solution may take years of tricky work. Cpufreq-based Big.LITTLE support may need to persist for a long time while a scheduler-based approach is implemented and stabilized.

That work is currently underway in the form of the big LITTLE MP project; there are patches being passed around within Linaro now. Needless to say, this work does touch the core scheduler, with over 1000 lines added to kernel/sched/fair.c. Thus far, though, this work has been done by ARM developers with little code from core scheduler developers and no exposure on the linux-kernel mailing list. One can only imagine that, once the linux-kernel posting is made, there will be a reviewer comment or two to address. So big LITTLE MP is probably not headed for the mainline right away.

Big LITTLE MP may well be one of the first significant core kernel changes to be driven by the needs of the mobile and embedded community. It will almost certainly not be the last. The changing nature of the computing world has already made itself felt by bringing vast numbers of developers into the kernel community. Increasingly, one can expect those developers to take their place in the decision-making process for the kernel as a whole. Once upon a time, it was said that the kernel was entirely driven by the needs of enterprises. To the extent that was true, the situation is changing; we are currently partway through a transition to where enterprise developers have a lot of help from the mobile and embedded community.

Comments (1 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

  • Rusty Russell: vringh . (February 19, 2013)

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds