Brief items
The 3.8 kernel was released on February 18; Linus
said: "
The release got delayed a couple
of days because I was waiting for confirmation of a small patch, but hey,
we could also say that it was all intentional, and that this is the special
'Presidents' Day Release'. It sounds more planned that way, no?"
Some of the headline features in this release include metadata integrity
checking in the xfs filesystem, the foundation for much improved NUMA
scheduling,
kernel memory usage accounting
and associated usage limits,
inline data
support for small files in the ext4 filesystem, nearly complete
user namespace support, and much more. See
the
KernelNewbies 3.8 page for
lots of details.
Stable updates:
3.7.8,
3.4.31, and 3.0.64 were released on February 14,
3.7.9,
3.4.32, and
3.0.65 were released on February 17,
and 3.2.39 came out on February 20.
Comments (none posted)
One person's bug is another person's fascinating invertebrate.
—
Neil Brown
Comments in XFS, especially weird scary ones, are rarely
wrong. Some of them might have been there for close on 20 years,
but they are our documentation for all the weird, scary stuff that
XFS does. I rely on them being correct, so it's something I always
pay attention to during code review. IOWs, When we add, modify or
remove something weird and scary, the comments are updated
appropriately so we'll know why the code is doing something weird
and scary in another 20 years time.
—
Dave Chinner
Just to get back at you though, I'll turn on an incandescent light
bulb every time I have to use -f.
—
Chris Mason (to Eric Sandeen)
Comments (none posted)
By Jonathan Corbet
February 20, 2013
The story of the "native Linux KVM tool" (or, more recently, "kvmtool") has
been playing out since early 2011. This tool serves as a simple
replacement for the QEMU emulator, making it easy to set up and run guests
under KVM. The kvmtool developers have been working under the assumption
that their code would be merged into the mainline kernel, as was done with
perf, but others have
disagreed
with that idea. The result has been a repetitive conversation every merge
window or two as kvmtool was proposed for merging.
The conversation for the 3.9 merge window has seemingly been a bit more
decisive, though. Ingo Molnar (along with kvmtool developer Pekka Enberg)
presented a long list of reasons why they
thought it made sense to put kvmtool into the mainline repository. Ingo
even compared kernel tooling to Somalia,
saying that it was made up of "disjunct entities with not much
commonality or shared infrastructure," though, presumably, with
fewer pirates. Few others came to the
defense of kvmtool, leaving Ingo and Pekka to carry forward the argument on
their own.
Linus responded that he saw no convincing
reason to put kvmtool in the mainline; indeed, he thought that tying
kvmtool with the kernel could be retarding its development. He concluded
with:
So here, let me state it very very clearly: I will not be merging
kvmtool. It's not about "useful code". It's not about the project
keeping to improve. Both of those would seem to be *better* outside
the kernel, where there isn't that artificial and actually harmful
tie-in.
That is probably the end of the discussion unless somebody can come up with
a new argument that Linus will find more convincing. At this point, it
seems that kvmtool is destined to remain out of the mainline kernel
repository.
Comments (4 posted)
Kernel development news
By Jonathan Corbet
February 20, 2013
The 3.9 merge window has gotten off to a relatively slow start, with a mere
1,200 non-merge change sets pulled into the mainline as of this writing.
The process may have been slowed a bit by a sporadic reboot problem that
crept in relatively early, and which has not yet been tracked down. Even
so, a number of significant changes have already found their way in for
3.9, with many more to follow.
Important user-visible changes include:
- Progress has been made toward the goal of eliminating the timer tick
while running in user space. The patches merged for 3.9 fix up the
CPU time accounting code, printk() subsystem, and irq_work
code to function without timer interrupts; further
work can be expected in future development cycles.
- A relatively simple scheduler
patch fixes the "bouncing cow problem," wherein, on a system with
more processors than running processes, those processes can wander
across the processors, yielding poor cache behavior.
For a "worst-case" tbench benchmark run, the result is a 15x
improvement in performance.
- The format of tracing events has been changed to remove some unused
padding. This change created problems
when it was first attempted in 2011, but it seems that the relevant
user-space programs have since been fixed (by moving them to the
libtraceevent library). It is worth trying again; smaller events
require less bandwidth as they are communicated to user space.
Anybody who observes any remaining problems
would do well to report them during the 3.9 development cycle.
- The ftrace tracing system has gained the ability to take a static
"snapshot" of the tracing buffer controllable via a debugfs file. See
this
ftrace.txt patch for documentation on how to use this feature.
- The perf bench utility has a new set of benchmarks intended to help
with the evaluation of NUMA balancing patches.
- perf stat has been augmented to include the ability to print
out information at a regular interval.
- New hardware support includes:
- Systems and processors:
The "Goldfish" virtual x86 platform used for Android development,
Technologic Systems TS-5500 single-board computers, and
SGI Ultraviolet System 3 systems.
- Input:
Cypress PS/2 touchpads and
Cypress APA I2C trackpads.
- Miscellaneous:
ST-Ericsson AB8505, AB9540, and AB8540 pin controllers,
Maxim MAX6581, MAX6602, MAX6622,
MAX6636, MAX6689, MAX6693, MAX6694, MAX6697, MAX6698, and MAX6699
temperature sensor chips,
TI / Burr Brown INA209 power monitors,
TI LP8755 power management units,
NVIDIA Tegra114 pinmux controllers,
Allwinner A1X pin controllers,
ARM PL320 interprocessor communication mailboxes,
Calxeda Highbank CPU frequency controllers,
Freescale i.MX6Q CPU frequency controllers, and
Marvell Kirkwood CPU frequency controllers.
Changes visible to kernel developers include:
- The workqueue functions work_pending() and
delayed_work_pending() have been deprecated; users are being
changed throughout the kernel tree.
- The "regmap" API, which simplifies management of device register sets,
now supports a "no bus" mode if the driver supplies simple "read" and
"write" functions. Regmap has also gained asynchronous I/O support.
If the usual schedule holds, the 3.9 merge window should stay open until
approximately March 5. As usual, LWN will list the most significant
changes throughout the merge window; tune in next week for the next
exciting episode.
Comments (none posted)
By Jonathan Corbet
February 20, 2013
The ARM "
big.LITTLE" architecture is an
interesting beast: it combines clusters of two distinct ARM-based CPU
designs into a single processor. One cluster contains relatively slow
Cortex-A7 CPUs that are highly power-efficient, while the other cluster is
made up of fast, power-hungry Cortex-A15 CPUs. These CPUs can be powered
up and down in any combination, but there are additional power savings if
an entire cluster can be powered down at once. Power-efficient scheduling
is currently a challenge for Linux even on homogeneous architectures;
big.LITTLE throws another degree of freedom into the mix that the scheduler
is absolutely unprepared to deal with, currently.
As a result, the initial approach to big.LITTLE is to treat each pair of
fast and slow CPUs as if it were a single CPU with high- and low-frequency
modes. That approach reduces the problem to writing an appropriate
cpufreq governor at the cost of forcing one CPU in each pair to be powered
down at any given time. The big.LITTLE patch set is more fully described
in the article linked above; that patch
set is coming along but is not yet ready for merging into the mainline.
One piece of the larger patch set that might be ready for 3.9, though, is
the "multi-cluster power management" (MCPM)
code.
The Linux kernel has reasonably good CPU power management, but that code,
like the scheduler, was not designed with multiple, dissimilar clusters in
mind. Fixing that requires adding logic that can determine when entire
clusters must be powered up and down, along with the code that actually
implements those transitions. The MCPM subsystem is concerned with the
latter part of the problem, which is not as easy as one might expect.
Multi-cluster power management involves the definition of a state machine
that implements a 2x3 table of states. Along one axis are the three states
describing the cluster's current power situation: CLUSTER_DOWN,
CLUSTER_UP, and CLUSTER_GOING_DOWN. The first two are
steady states, while the third indicates that the cluster is being powered
down, but that the power-down operation is not yet complete. The other
axis in the state table describes whether the kernel running on some CPU
has decided that the
cluster needs to be powered up or not; those states are called
INBOUND_NOT_COMING_UP and INBOUND_COMING_UP. The table
as a whole thus contains six states, along with a well-defined set of rules
describing transitions between those states.
Shutdown
To begin with, imagine a cluster that is in a small portion of the state
space: it is either fully powered up or fully powered down:
The cluster is running or not; in either one of the above state
combinations, there is no plan to bring up the cluster (the
INBOUND_COMING_UP substate would make no sense in a fully-running
cluster in any case).
If we start from the top of the diagram (CLUSTER_UP), we can then
trace out the sequence of steps needed to bring the cluster down. The
first of those, once the power-down decision has been made, is to determine
which CPU is (in the MCPM terminology) the "last man" that is in charge of
shutting everything down
and turning off the lights on its way out. Since the cluster is fully
operational, that decision is relatively easy; a would-be last man simply
acquires the relevant spinlock and elects itself into the position. Once
that has happened, the last man pushes the cluster through to the
CLUSTER_DOWN state:
All transitions marked with solid red arrows are executed by the last man
CPU. Once the decision to power down has been made, the cluster moves to
CLUSTER_GOING_DOWN, where the cleanup work is done. Among other
things, the last man will wait until all other CPUs in the cluster have
powered themselves down. Once everything is ready, the last man pushes the
cluster into CLUSTER_DOWN, powering itself down in the process.
Coming back up
Bringing the cluster back up is a similar process, but with an interesting
challenge: the CPUs in the cluster must elect a "first man" CPU to perform
the initialization work far enough that the kernel can run safely on all
the other CPUs. The problem is that, when a cluster first powers up, there
may be no memory coherence between the CPUs in that cluster, so spinlocks
are not a reliable mechanism for mutual exclusion. Some other mechanism
must be used to safely choose a first man; that mechanism is called "voting
mutexes" or "vlocks."
The core idea behind vlocks is that, while atomic instructions will not
work between CPUs, it is still possible to use memory barriers to ensure
that other CPUs can see a specific memory change. Acquiring a vlock in
this environment is a multi-step operation: a CPU will indicate that it is
about to vote for a lock holder, then vote for itself. Once (1) at
least one CPU has voted for itself, and (2) all CPUs interested in
voting have had their say, the CPU that voted last wins. The vlocks.txt documentation file included with
the patch set provides the following pseudocode to illustrate the
algorithm:
int currently_voting[NR_CPUS] = { 0, };
int last_vote = -1; /* no votes yet */
bool vlock_trylock(int this_cpu)
{
/* signal our desire to vote */
currently_voting[this_cpu] = 1;
if (last_vote != -1) {
/* someone already volunteered himself */
currently_voting[this_cpu] = 0;
return false; /* not ourself */
}
/* let's suggest ourself */
last_vote = this_cpu;
currently_voting[this_cpu] = 0;
/* then wait until everyone else is done voting */
for_each_cpu(i) {
while (currently_voting[i] != 0)
/* wait */;
}
/* result */
if (last_vote == this_cpu)
return true; /* we won */
return false;
}
Missing from the pseudocode is the use of memory barriers to make each
variable change visible across the cluster; in truth, the memory caches for
the cluster have not been enabled at the time that the first-man election
takes place, so few barriers are necessary. Needless to say, vlocks are
relatively slow, but that doesn't matter much when compared to a
heavyweight operation like powering up an entire cluster.
Once a first man has been chosen, it drives the cluster through a set of
states on its way back to full functionality:
The dotted green lines indicate state transitions executed by the inbound,
first-man CPU. When a decision is made to power the cluster up, the first
man will switch to the CLUSTER_DOWN / INBOUND_COMING_UP
combination. While the cluster is in this state, the first man is the only
CPU running; its job is to initialize things to the point that the other
CPUs can safely resume the kernel with properly-functioning mutual
exclusion primitives. Once that has been achieved, the cluster moves to
CLUSTER_UP / INBOUND_COMING_UP while the other CPUs come on line;
a final transition to CLUSTER_UP / INBOUND_NOT_COMING_UP happens
shortly thereafter.
That describes the basic mechanism, but leaves one interesting question
unaddressed: what happens when CPUs disagree about whether the cluster
should go up or down? Such disagreements will not happen during the
power-up process; the cluster is being brought online to execute a specific
task that will still need to be done. But it is possible for the kernel as
a whole to change its mind about powering a cluster down; an unexpected
interrupt or load spike could indicate that the cluster is still needed.
In that case, a new first man may make an appearance while the last man is
trying to clock out and go home. This situation is handled by having the
first man transition the cluster into the sixth state combination:
The CLUSTER_GOING_DOWN / INBOUND_COMING_UP state encapsulates the
conflicted situation where the CPUs differ on the desired state. The
eventual outcome needs to be a powered-up, functioning cluster.
The last man must occasionally check for this state transition as it goes
through its power-down rituals; when it notices that the cluster actually
wants to be up, it faces a choice:
The optimal solution would be to abort the power-down process, unwind any
work that has been done, and put the cluster into the CLUSTER_UP /
INBOUND_COMING_UP state, at which point the first man can finish the
job. Should that not be practical, though, the last man can complete the
job and switch to CLUSTER_DOWN / INBOUND_COMING_UP instead; the
first man will then go through the full power-up operation. Either way,
the end result will be a functioning cluster.
A few closing notes
The above text pretty much describes the process used to change a cluster's
power state; most of the rest is just architecture-specific details. For
the curious, a lot more information can be found in cluster-pm-race-avoidance.txt, included with
the MCPM patch set. It is noteworthy that the entire MCPM patch
set is contained within the ARM architecture subtree; indeed, the entire
big.LITTLE patch is ARM-specific. Perhaps that is how it needs to be, but
it is also not difficult to imagine that other architectures may, at some
point, follow ARM into the world of heterogeneous clusters. There may come
a time when many of the lessons learned here will need to be applied to
generic code.
Traditionally, ARM developers have confined themselves to working with a
specific ARM subarchitecture, leading to a lot of duplicated (and
substandard) code under arch/arm as a whole. More recently, there
has been a big push to work across the ARM subarchitectures; that has
resulted in a lot of cleaned up support code and abstractions for ARM as a
whole. But, possibly, the ARM developers are still a little bit
nervous about stepping outside of arch/arm and making changes to
the core kernel when those changes are needed. Given that there are
probably more Linux systems running on ARM processors than any other, it
would be natural to expect that the needs of the ARM architecture would
drive the evolution of the kernel as a whole. That is certainly happening,
but, one could argue, it could be happening more often and more
consistently.
One could instead argue that the big.LITTLE patch set is a short-term hack intended to get
Linux running on the relevant hardware until a proper solution can be
implemented. The "proper solution" is still likely to need MCPM, though,
and, in any case, this kind of hack has a tendency to stick around for a
long time. There is almost certainly a long list of use cases for which
the basic big.LITTLE approach gives more than adequate results, while
getting proper performance out of a true, scheduler-based solution may take
years of tricky work. Cpufreq-based Big.LITTLE support may need to persist
for a long time while a scheduler-based approach is implemented and stabilized.
That work is currently underway in the form of the big LITTLE MP project; there are patches being passed around within Linaro
now. Needless to say, this work does touch the core scheduler, with over
1000 lines added to kernel/sched/fair.c. Thus far, though, this
work has been done by ARM developers with little code from core scheduler
developers and no exposure on the linux-kernel mailing list. One can only
imagine that, once the linux-kernel posting is made, there will be a
reviewer comment or two to address. So big LITTLE MP is probably not
headed for the mainline right away.
Big LITTLE MP may well be one of the first significant core kernel changes
to be driven by the needs of the mobile and embedded community. It will
almost certainly not be the last. The changing nature of the computing
world has already made itself felt by bringing vast numbers of developers
into the kernel community. Increasingly, one can expect those developers
to take their place in the decision-making process for the kernel as a
whole. Once upon a time, it was said that the kernel was entirely driven
by the needs of enterprises. To the extent that was true, the situation is
changing; we are currently partway through a transition
to where enterprise developers have a lot of help from the mobile and
embedded community.
Comments (1 posted)
Patches and updates
Kernel trees
- Linus Torvalds: Linux 3.8 .
(February 19, 2013)
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
- Eric W. Biederman: [PATCH review 00/85] userns changes for 9p, afs, ceph, cifs, coda,
gfs2, ncpfs, nfs, nfsd, and ocfs2 .
(February 18, 2013)
Memory management
Networking
Security-related
Virtualization and containers
- Rusty Russell: vringh .
(February 19, 2013)
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>