The 3.3 kernel was released on March 18
, so there is no development
kernel as of this writing. Some of the headline features in the 3.3 release
include the byte queue limits
infrastructure, Open vSwitch
, the return of
much of the Android code to the staging tree, C6X architecture
support, large physical
address extension support for the ARM architecture, and more. Much more
information can be found on the KernelNewbies 3.3 page
The 3.4 merge window is open; see the separate article below for a summary
of what has been merged so far.
Stable updates: the 3.0.25 and 3.2.12 stable updates were released, with the
usual pile of important fixes, on March 19. For users of older
kernels, the 18.104.22.168 and 22.214.171.124 long term kernel releases, with a
relatively small number of fixes, came out on March 17.
The 3.0.26 and 3.2.13 stable updates are in the review
process as of this writing; they can be expected on or after
Comments (none posted)
Patch verification occurs in an artificial bubble of software
run/known by kernel developers. It can take years before the code
is exposed to real life situations.
-- Christoph Lameter
Thou shalt not, in the language of C, under any circumstances, on
the pain of death, declare or define a function with an empty set
of parentheses, for though in the language of C++ it meaneth the
same as (void), in C it meaneth (...) which is of meaningless as
there be no anchor argument by which the types of the varadic
arguments can be expressed, and which misleadeth the compiler into
allowing unsavory code and in some cases generate really ugly stuff
for varadic handling.
-- H. Peter Anvin
The only thing that gets drivers written is writing the damn
-- Adam Jackson
Comments (none posted)
Brendan Gregg demonstrates
as a tool for tracking down kernel performance
problems. "The perf report tree (and the ncurses navigator) do an
excellent job at presenting this information as text. However, with text
there are limitations. The output often does not fit in one screen (you
could say it doesn’t need to, if the bulk of the samples are identified on
the first page). Also, identifying the hottest code paths requires reading
the percentages. With the flame graph, all the data is on screen at once,
and the hottest code-paths are immediately obvious as the widest
" The flame graph code
hosted on Github.
Comments (3 posted)
At the 2011 Kernel Summit
, Google developer
Paul Turner described a scheduler testing framework which, he said, would
be released soon. Naturally, things took longer than expected, but, on
March 14, Paul released
a version of
Linsched for general use. Given the amount of interest in this tool, it's
likely that it will find its way into the mainline in a relative hurry.
Linsched is a framework that can run the kernel scheduler with various
simulated workloads and draw conclusions about the quality of the decisions
made. It looks at overall CPU utilization, the number of migrations, and
more. It is able to simulate a wide range of hardware topologies with
The original Linsched posting was quite intrusive; it inserted over 5,000
lines of code into the kernel behind "#ifdef LINSCHED" lines.
A determined effort has reduced that number slightly - to all of 20 lines
of code. The rest has been cleverly hidden in a special "linsched"
architecture that provides just enough support to run the scheduler in user
space. The actual simulation and measurement code lives in the
Making changes to the scheduler is a notoriously difficult task; one can
easily add regressions for specific workloads that go unnoticed until the
changes go into production. With enough simulated topologies and
workloads, a tool like Linsched should be able to remove a lot of that risk
from scheduler development. And that should lead to better kernel releases
Comments (2 posted)
Kernel development news
The release of the 3.3 kernel on March 18 has led inevitably to the opening
of the merge window for the 3.4 development cycle. As of this writing,
some 3,500 non-merge changesets have been pulled into the mainline; this
cycle, in other words, has just begun.
Some of the user-visible features
merged for 3.4 include:
Also worth noting: the "ramster" transcendent memory functionality was
briefly added to the staging tree before being removed; various other
changes had caused it to be seriously broken. Ramster can be thought of as
a way of sharing memory across machines; a system with spare pages can host
data for another that is under memory pressure. See this article for more details and this article for an exposition of the vision
behind Ramster. Adding this functionality requires carving a number of
features out of the OCFS2 filesystem and making them globally available.
One assumes these patches will return for 3.5.
Changes visible to kernel developers include:
- Jump labels have been rebranded again; after a false start they are now known as "static
keys". Details can be found in the new Documentation/static-keys.txt file.
- The (now) unused get_driver() and put_driver()
functions have been removed from the kernel.
- The debugfs filesystem understands the uid=, gid=,
and mode= mount options, allowing the ownership and
permissions for the filesystem to be set in /etc/fstab.
- The zsmalloc allocator has been added
to the staging tree; the older "xvmalloc" allocator has been removed.
- The Android "alarm" driver has been added to the staging tree.
- The deferred driver probing mechanism
has been merged.
- The list of power management stages continues to grow; the kernel has
new callbacks called suspend_late(), resume_early(),
poweroff_late(), and restore_early() for operations
that must be performed at just the right time.
- The "IRQ domain" abstraction has been merged; IRQ domains make it
easier to manage interrupts on systems with more than one interrupt
controller. See Documentation/IRQ-domain.txt for more
- The long-unused second argument to kmap_atomic() has been
removed. Thanks to some preprocessor trickery, calling
kmap_atomic() with two arguments still works, but a
deprecation warning will result.
- There is a new mechanism for the autoloading of drivers for specific
x86 CPU features. Such drivers should declare a
MODULE_DEVICE_TABLE with the x86cpu type; see the
comments at the head of arch/x86/kernel/cpu/match.c for
The 3.4 merge window can be expected to continue until roughly
April 2. There are a lot of subsystem trees yet to be pulled, so one
can expect a large number of changes to go in between now and then.
Comments (4 posted)
In the beginning there was printk()
- literally: the 0.01 kernel
release included 44 printk()
calls. Since then, printk()
has picked up details like logging levels and a lot of new formatting
operators; it has also expanded to tens of thousands of call sites throughout
the kernel. Developers often reach for it as the first way to figure out
what is going on inside a misbehaving subsystem. If some developers have
their way, though, printk()
calls will become an endangered
species. But not everybody has signed on to that goal.
There are certainly plenty of ways in which printk() could be
improved. It imposes no standardization on messages, either across a
subsystem or over time. As a result, messages can be hard for programs
(or people) to parse, and they can change in trivial but obnoxious ways
from one kernel release to the next. The actual calls, starting with text
are relatively verbose; among other things, that often causes
printk() statements to run afoul of the 80-column line width
restriction. Messages printed with printk() may also lack
important information needed to determine what the kernel is really trying
Various attempts have been made to improve on printk() over the
years. Arguably the most successful of those is the set of functions
defined for device drivers:
int dev_dbg(struct device *dev, const char *format, ...);
int dev_info(struct device *dev, const char *format, ...);
int dev_notice(struct device *dev, const char *format, ...);
/* ... */
int dev_emerg(struct device *dev, const char *format, ...);
These functions, by embedding the logging level in the name itself, are
more concise than the printk() calls they replace. They also
print the name of the relevant device in standard form, ensuring that it's
always possible to associate a message with the device that generated it.
Use of these functions is not universal in device drivers, but it is
widespread and uncontroversial.
There is a rather lower level of consensus surrounding a different set of
functions (macros, really) that look like this:
int pr_info(const char *format, ...);
/* ... */
int pr_emerg(const char *format, ...);
These functions, too, encode the logging level in the function name, making
things more concise. They also attempt to at least minimally standardize
the format of logging by passing the format string through a macro
called pr_fmt(). That leads to a line like this appearing in
several hundred source files in the kernel:
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Due to the way
the macro works, this line must appear before the #include block
that would otherwise be at the beginning of the file. Defining
pr_fmt() in this way causes all strings printed from the file to
have the module name prepended; many subsystems use a literal string rather
than the module name, but the intent is the same.
The spread of pr_*() through the kernel is mainly the result of an
ongoing campaign by Joe Perches - notable for having just merged a
100,000-line whitespace-only ISDN subsystem cleanup patch for 3.4 - who
has converted thousands of
printk() calls over the years. To some developers, these changes
are a welcome cleaning-up of the code; to others, they represent pointless
code churn. The discussion has been quiet for a while, but it recently
came back when Joe tried to convert the ext4
filesystem; ext4 maintainer Ted Ts'o rejected the conversion, saying:
Changing printk's to pr_info and pr_cont is patch noise as far as
I'm concerned. Adds no value, and just breaks other patches
David Miller commented
on this decision in a rather unsympathetic fashion:
Some kernel maintainers are real blockheads about code
cleanups. And being like that doesn't make you look established and
sophisticated, instead it makes you look like what you actually
are, a relic.
Ted probably does not feel like a relic, and he is probably not trying to
be sophisticated; he is almost certainly trying to maintain code he is
responsible for in the best way he can. In his view, changing a bunch of
code from one print function to another - possibly introducing a lot of
patch conflicts on the way - does not help in that regard. Beyond that, he
said, the standardization introduced by
these functions is nowhere near enough to solve the structured logging
problem, meaning that, someday, all those calls will have to be changed yet
another time when a proper solution is available.
Proponents of the change argue that some structure is better than none, and
that the new functions offer some useful flexibility when the time to add
more structure comes. They claim that the overall size of the kernel is
reduced (slightly) due to better sharing of strings. Messages printed with
pr_debug() can be enabled and disabled with the dynamic debugging interface, while straight
printk() calls cannot. And, perhaps most of all, they argue that
consistency across the code base has value - though that argument was heard
rather less when the pr_*() interface was new and relatively
Needless to say, this is not the kind of discussion that comes to any sort
of definitive conclusion. With regard to ext4, the conversion will
probably not take place anytime soon; that is Ted's turf, and it is
unlikely that anybody can summon arguments strong enough to convince Linus
to override him. Elsewhere in the kernel, though, these conversions will
certainly continue. As will, undoubtedly, the associated flame wars.
Comments (8 posted)
A non-uniform memory access (NUMA) system is a computer divided into
"nodes," where each node (which may contain multiple processors) has some
memory which is local to the node. All system memory is visible to all
nodes, but accesses to memory that is not local to the accessing node must
go over an inter-node bus; as a result, non-local accesses are
significantly slower. There is, thus, a real performance advantage to be
gained by keeping processes and their memory on the same node.
The Linux kernel has had NUMA awareness for some time, in that it
understands that moving a process from one node to another can be an
expensive undertaking. There is also an interface (available via the
mbind() system call) by which a process can request a
specific allocation policy for its memory. Possibilities include requiring
that all allocations happen within a specific set of nodes
(MPOL_BIND), setting a looser "preferred" node
(MPOL_PREFERRED), or asking that allocations be distributed across
the system (MPOL_INTERLEAVE). It is also possible to use
mbind() to request the active migration of pages from one node to
So NUMA is not a new concept for the kernel, but, as Peter Zijlstra noted
in the introduction to a large NUMA patch
set, things do not work as well as they could:
Current upstream task memory allocation prefers to use the node the
task is currently running on (unless explicitly told otherwise, see
mbind()/set_mempolicy()), and with the scheduler free to move the
task about at will, the task's memory can end up being spread all
over the machine's nodes.
While the scheduler does a reasonable job of keeping short running
tasks on a single node (by means of simply not doing the cross-node
migration very often), it completely blows for long-running
processes with a large memory footprint.
As might be expected, the patch set is dedicated to the creation of a
kernel that does not "completely blow." To that end, it adds a number of
significant changes to how memory management and scheduling are done in the
There are three major sub-parts to Peter's patch set. The first is a
reworked patch set first posted by Lee
Schermerhorn in 2010. These patches change the memory policy mechanism to
make it easier for the kernel to fix things up after a process's memory has
been allocated on distant nodes. "Page migration" is the process of moving
a page from one node to another without the owning process(es) noticing the
change. With Lee's patches, the kernel implements a variation called "lazy
migration" that does not immediately relocate any pages. Instead, the
target pages are simply unmapped from the process's page tables, meaning
that the next access to any of them will generate a page fault. Actual
migration is then done at page fault time. Lazy migration is a less
expensive way of moving a large set of pages; only the pages that are
actually used are moved, the work can be spread over time, and it will be
done in the context of the faulting process.
The lazy migration mechanism is necessary for the rest of the patch set,
but it has value on its own. So the feature is made available to user
space with the MPOL_MF_LAZY flag; it is intended to be used
with the MPOL_MF_MOVE flag, which would otherwise force the
immediate migration of the affected pages. There is also a new
MPOL_MF_NOOP flag allowing the calling process to request the
migration of pages according to the current policy without changing (or
even knowing) that policy.
With lazy migration, memory distributed across a system as the result of
memory allocation and scheduling decisions can be slowly pulled back to the
optimal node. But it is better to avoid making that kind of mess in the
first place. So the second part of the patch set starts by adding the
concept of a "home node" to a process. Each process (or "NUMA
entity" - meaning groups containing a set of processes) is assigned
a home node at fork() time. The scheduler will then try hard to
avoid moving a process off its home node, but within bounds: a process will
still be run on a non-home node if the alternative would be an unbalanced
allocations will, by default, be performed on the home node, even if the
process is running elsewhere at the time.
These policies should minimize
the scattering of memory across the system, but, with this kind of
scheduling regime, it is inevitable that, eventually, one
node will end up with too many processes and too little memory while others
are underutilized. So, sometimes, it will be necessary to rebalance
things. When the scheduler notices that long-running tasks are being
forced away from their home nodes - or that they are having to allocate
memory non-locally - it will consider migrating them to a new node.
Migration is not a half-measure in this case; the scheduler will move both
the process and its memory (using the lazy migration mechanism) to the
target node. The move is expensive, but the process (and the system)
should run much more efficiently once it's done. It only makes sense for
processes that are going to be around for a while, though; the patch set
tries to approximate that goal by only considering processes with at least
one second of run time for migration.
The final piece is a pair of new system calls allowing processes to be put
into "NUMA groups" that will share the same home node. If one of them is
migrated, the entire group will be migrated. The first system call is:
int numa_tbind(int tid, int ng_id, unsigned long flags);
This system call will bind the thread identified by tid to the
NUMA group identified by ng_id; the flags argument is
currently unused and
must be zero. If ng_id is passed as MS_ID_GET, the
system call will, instead, simply return the current NUMA group ID for the
given thread. A value of MS_ID_NEW, instead, creates a new NUMA
group, binds the thread to that group, and returns the new ID.
The second new system call is:
int numa_mbind(void *addr, unsigned long len, int ng_id, unsigned long flags);
This call will set up a memory policy for the region of len bytes
starting at addr and bind it to the NUMA group identified by
ng_id. If necessary, lazy migration will be used to move the
memory over to the node where the given NUMA group is based. Once again,
flags is unused and must be zero. Once the memory is bound to the
NUMA group, it will stay with the processes in that group; if the processes
are moved, the memory will move with them.
Peter provided some benchmark results from a two-node system. Without the
NUMA balancing patches, over time, the benchmark ended up with just as many
remote memory accesses as local accesses - allocated memory was spread
across the system. With the NUMA balancer, 86% of the memory accesses were
local, leading to a significant speedup. As Peter put it: "These
numbers also show that while there's a marked improvement, there's still
some gain to be had. The current numa balancer is still somewhat
fickle." A certain amount of fickleness is perhaps to be expected
for such an involved patch set, given how young it is. Given some time,
reviews, and testing, it should evolve into a solid scheduler component,
giving Linux far better NUMA performance than it has ever had in the past.
Comments (27 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
- Lucas De Marchi: kmod 7 .
(March 19, 2012)
- Kay Sievers: udev 182 .
(March 20, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>