Kernel development
Brief items
Kernel release status
The 3.3 kernel was released on March 18, so there is no development kernel as of this writing. Some of the headline features in the 3.3 release include the byte queue limits infrastructure, Open vSwitch, the return of much of the Android code to the staging tree, C6X architecture support, large physical address extension support for the ARM architecture, and more. Much more information can be found on the KernelNewbies 3.3 page.The 3.4 merge window is open; see the separate article below for a summary of what has been merged so far.
Stable updates: the 3.0.25 and 3.2.12 stable updates were released, with the usual pile of important fixes, on March 19. For users of older kernels, the 2.6.27.62 and 2.6.32.59 long term kernel releases, with a relatively small number of fixes, came out on March 17.
The 3.0.26 and 3.2.13 stable updates are in the review process as of this writing; they can be expected on or after March 23.
Quotes of the week
Gregg: Linux Kernel Performance: Flame Graphs
Brendan Gregg demonstrates "flame graphs" as a tool for tracking down kernel performance problems. "The perf report tree (and the ncurses navigator) do an excellent job at presenting this information as text. However, with text there are limitations. The output often does not fit in one screen (you could say it doesn’t need to, if the bulk of the samples are identified on the first page). Also, identifying the hottest code paths requires reading the percentages. With the flame graph, all the data is on screen at once, and the hottest code-paths are immediately obvious as the widest functions." The flame graph code is hosted on Github.
Linsched for 3.3
At the 2011 Kernel Summit, Google developer Paul Turner described a scheduler testing framework which, he said, would be released soon. Naturally, things took longer than expected, but, on March 14, Paul released a version of Linsched for general use. Given the amount of interest in this tool, it's likely that it will find its way into the mainline in a relative hurry.Linsched is a framework that can run the kernel scheduler with various simulated workloads and draw conclusions about the quality of the decisions made. It looks at overall CPU utilization, the number of migrations, and more. It is able to simulate a wide range of hardware topologies with different characteristics.
The original Linsched posting was quite intrusive; it inserted over 5,000 lines of code into the kernel behind "#ifdef LINSCHED" lines. A determined effort has reduced that number slightly - to all of 20 lines of code. The rest has been cleverly hidden in a special "linsched" architecture that provides just enough support to run the scheduler in user space. The actual simulation and measurement code lives in the tools directory.
Making changes to the scheduler is a notoriously difficult task; one can easily add regressions for specific workloads that go unnoticed until the changes go into production. With enough simulated topologies and workloads, a tool like Linsched should be able to remove a lot of that risk from scheduler development. And that should lead to better kernel releases overall.
Kernel development news
3.4 Merge window part 1
The release of the 3.3 kernel on March 18 has led inevitably to the opening of the merge window for the 3.4 development cycle. As of this writing, some 3,500 non-merge changesets have been pulled into the mainline; this cycle, in other words, has just begun.Some of the user-visible features merged for 3.4 include:
- The perf utility understands a new --uid flag,
which restricts data gathering to processes owned by the given user
ID. It is also now possible to specify multiple processes or threads
with the --pid and --tid options.
- The perf events subsystem can now sample "taken branch" events on
hardware with the "last branch record" functionality.
- The "zcache" compressed caching system (still in staging) can now use
the crypto API for access to compression algorithms.
- The "Yama" security module has been merged; for now it just implements
some restrictions on how the ptrace() system call can be
used, but others may follow. Yama is meant to be a place to collect
various discretionary access control mechanisms intended to make a system
more secure.
- The kernel now has read-only support for the qnx6fs filesystem used
with the QNX operating system.
- New drivers include:
- Crypto: Tegra AES crypto engines.
- Miscellaneous: EnergyMicro EFM32 UART/USART ports,
Maxim DS2781 battery monitors,
Solarflare SFC9000-family hwmon controllers,
Solarflare SFC9000-family SR-IOV controllers,
TI TPS62360 and TPS65217 power regulators,
Samsung S5M8767 regulators,
Renesas RSPI controllers,
SuperH HSPI controllers,
CSR SiRFprimaII SPI controllers,
Broadcom BCM63xx SPI controllers, and
Freescale i.MX on-chip ANATOP LDO regulators.
- Network: Xilinx 10/100/1000 AXI Ethernet controllers,
PEAK PCAN-ExpressCard, PCAN-USB and PCAN-PC CAN controllers,
NXP Semiconductor LPC32xx ARM SoC-based Ethernet controllers, and
TI CPSW switches.
- USB: Ozmo USB-over-WiFi controllers.
- Staging transitions: the old telephony drivers have been moved into staging in anticipation of their eventual removal from the kernel altogether.
The kernel now also contains an audio USB gadget driver compliant with USB audio class 2.0.
- Crypto: Tegra AES crypto engines.
Also worth noting: the "ramster" transcendent memory functionality was briefly added to the staging tree before being removed; various other changes had caused it to be seriously broken. Ramster can be thought of as a way of sharing memory across machines; a system with spare pages can host data for another that is under memory pressure. See this article for more details and this article for an exposition of the vision behind Ramster. Adding this functionality requires carving a number of features out of the OCFS2 filesystem and making them globally available. One assumes these patches will return for 3.5.
Changes visible to kernel developers include:
- Jump labels have been rebranded again; after a false start they are now known as "static
keys". Details can be found in the new Documentation/static-keys.txt file.
- The (now) unused get_driver() and put_driver()
functions have been removed from the kernel.
- The debugfs filesystem understands the uid=, gid=,
and mode= mount options, allowing the ownership and
permissions for the filesystem to be set in /etc/fstab.
- The zsmalloc allocator has been added
to the staging tree; the older "xvmalloc" allocator has been removed.
- The Android "alarm" driver has been added to the staging tree.
- The deferred driver probing mechanism
has been merged.
- The list of power management stages continues to grow; the kernel has
new callbacks called suspend_late(), resume_early(),
freeze_late(), thaw_early(),
poweroff_late(), and restore_early() for operations
that must be performed at just the right time.
- The "IRQ domain" abstraction has been merged; IRQ domains make it
easier to manage interrupts on systems with more than one interrupt
controller. See Documentation/IRQ-domain.txt for more
information.
- The long-unused second argument to kmap_atomic() has been
removed. Thanks to some preprocessor trickery, calling
kmap_atomic() with two arguments still works, but a
deprecation warning will result.
- There is a new mechanism for the autoloading of drivers for specific x86 CPU features. Such drivers should declare a MODULE_DEVICE_TABLE with the x86cpu type; see the comments at the head of arch/x86/kernel/cpu/match.c for details.
The 3.4 merge window can be expected to continue until roughly April 2. There are a lot of subsystem trees yet to be pulled, so one can expect a large number of changes to go in between now and then.
The perils of pr_info()
In the beginning there was printk() - literally: the 0.01 kernel release included 44 printk() calls. Since then, printk() has picked up details like logging levels and a lot of new formatting operators; it has also expanded to tens of thousands of call sites throughout the kernel. Developers often reach for it as the first way to figure out what is going on inside a misbehaving subsystem. If some developers have their way, though, printk() calls will become an endangered species. But not everybody has signed on to that goal.There are certainly plenty of ways in which printk() could be improved. It imposes no standardization on messages, either across a subsystem or over time. As a result, messages can be hard for programs (or people) to parse, and they can change in trivial but obnoxious ways from one kernel release to the next. The actual calls, starting with text like:
printk(KERN_ERR ...
are relatively verbose; among other things, that often causes printk() statements to run afoul of the 80-column line width restriction. Messages printed with printk() may also lack important information needed to determine what the kernel is really trying to say.
Various attempts have been made to improve on printk() over the years. Arguably the most successful of those is the set of functions defined for device drivers:
int dev_dbg(struct device *dev, const char *format, ...);
int dev_info(struct device *dev, const char *format, ...);
int dev_notice(struct device *dev, const char *format, ...);
/* ... */
int dev_emerg(struct device *dev, const char *format, ...);
These functions, by embedding the logging level in the name itself, are more concise than the printk() calls they replace. They also print the name of the relevant device in standard form, ensuring that it's always possible to associate a message with the device that generated it. Use of these functions is not universal in device drivers, but it is widespread and uncontroversial.
There is a rather lower level of consensus surrounding a different set of functions (macros, really) that look like this:
int pr_info(const char *format, ...);
/* ... */
int pr_emerg(const char *format, ...);
These functions, too, encode the logging level in the function name, making things more concise. They also attempt to at least minimally standardize the format of logging by passing the format string through a macro called pr_fmt(). That leads to a line like this appearing in several hundred source files in the kernel:
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Due to the way the macro works, this line must appear before the #include block that would otherwise be at the beginning of the file. Defining pr_fmt() in this way causes all strings printed from the file to have the module name prepended; many subsystems use a literal string rather than the module name, but the intent is the same.
The spread of pr_*() through the kernel is mainly the result of an ongoing campaign by Joe Perches - notable for having just merged a 100,000-line whitespace-only ISDN subsystem cleanup patch for 3.4 - who has converted thousands of printk() calls over the years. To some developers, these changes are a welcome cleaning-up of the code; to others, they represent pointless code churn. The discussion has been quiet for a while, but it recently came back when Joe tried to convert the ext4 filesystem; ext4 maintainer Ted Ts'o rejected the conversion, saying:
David Miller commented on this decision in a rather unsympathetic fashion:
Ted probably does not feel like a relic, and he is probably not trying to be sophisticated; he is almost certainly trying to maintain code he is responsible for in the best way he can. In his view, changing a bunch of code from one print function to another - possibly introducing a lot of patch conflicts on the way - does not help in that regard. Beyond that, he said, the standardization introduced by these functions is nowhere near enough to solve the structured logging problem, meaning that, someday, all those calls will have to be changed yet another time when a proper solution is available.
Proponents of the change argue that some structure is better than none, and that the new functions offer some useful flexibility when the time to add more structure comes. They claim that the overall size of the kernel is reduced (slightly) due to better sharing of strings. Messages printed with pr_debug() can be enabled and disabled with the dynamic debugging interface, while straight printk() calls cannot. And, perhaps most of all, they argue that consistency across the code base has value - though that argument was heard rather less when the pr_*() interface was new and relatively unused.
Needless to say, this is not the kind of discussion that comes to any sort of definitive conclusion. With regard to ext4, the conversion will probably not take place anytime soon; that is Ted's turf, and it is unlikely that anybody can summon arguments strong enough to convince Linus to override him. Elsewhere in the kernel, though, these conversions will certainly continue. As will, undoubtedly, the associated flame wars.
Toward better NUMA scheduling
A non-uniform memory access (NUMA) system is a computer divided into "nodes," where each node (which may contain multiple processors) has some memory which is local to the node. All system memory is visible to all nodes, but accesses to memory that is not local to the accessing node must go over an inter-node bus; as a result, non-local accesses are significantly slower. There is, thus, a real performance advantage to be gained by keeping processes and their memory on the same node.The Linux kernel has had NUMA awareness for some time, in that it understands that moving a process from one node to another can be an expensive undertaking. There is also an interface (available via the mbind() system call) by which a process can request a specific allocation policy for its memory. Possibilities include requiring that all allocations happen within a specific set of nodes (MPOL_BIND), setting a looser "preferred" node (MPOL_PREFERRED), or asking that allocations be distributed across the system (MPOL_INTERLEAVE). It is also possible to use mbind() to request the active migration of pages from one node to another.
So NUMA is not a new concept for the kernel, but, as Peter Zijlstra noted in the introduction to a large NUMA patch set, things do not work as well as they could:
While the scheduler does a reasonable job of keeping short running tasks on a single node (by means of simply not doing the cross-node migration very often), it completely blows for long-running processes with a large memory footprint.
As might be expected, the patch set is dedicated to the creation of a kernel that does not "completely blow." To that end, it adds a number of significant changes to how memory management and scheduling are done in the kernel.
There are three major sub-parts to Peter's patch set. The first is a reworked patch set first posted by Lee Schermerhorn in 2010. These patches change the memory policy mechanism to make it easier for the kernel to fix things up after a process's memory has been allocated on distant nodes. "Page migration" is the process of moving a page from one node to another without the owning process(es) noticing the change. With Lee's patches, the kernel implements a variation called "lazy migration" that does not immediately relocate any pages. Instead, the target pages are simply unmapped from the process's page tables, meaning that the next access to any of them will generate a page fault. Actual migration is then done at page fault time. Lazy migration is a less expensive way of moving a large set of pages; only the pages that are actually used are moved, the work can be spread over time, and it will be done in the context of the faulting process.
The lazy migration mechanism is necessary for the rest of the patch set, but it has value on its own. So the feature is made available to user space with the MPOL_MF_LAZY flag; it is intended to be used with the MPOL_MF_MOVE flag, which would otherwise force the immediate migration of the affected pages. There is also a new MPOL_MF_NOOP flag allowing the calling process to request the migration of pages according to the current policy without changing (or even knowing) that policy.
With lazy migration, memory distributed across a system as the result of memory allocation and scheduling decisions can be slowly pulled back to the optimal node. But it is better to avoid making that kind of mess in the first place. So the second part of the patch set starts by adding the concept of a "home node" to a process. Each process (or "NUMA entity" - meaning groups containing a set of processes) is assigned a home node at fork() time. The scheduler will then try hard to avoid moving a process off its home node, but within bounds: a process will still be run on a non-home node if the alternative would be an unbalanced system. Memory allocations will, by default, be performed on the home node, even if the process is running elsewhere at the time.
These policies should minimize the scattering of memory across the system, but, with this kind of scheduling regime, it is inevitable that, eventually, one node will end up with too many processes and too little memory while others are underutilized. So, sometimes, it will be necessary to rebalance things. When the scheduler notices that long-running tasks are being forced away from their home nodes - or that they are having to allocate memory non-locally - it will consider migrating them to a new node. Migration is not a half-measure in this case; the scheduler will move both the process and its memory (using the lazy migration mechanism) to the target node. The move is expensive, but the process (and the system) should run much more efficiently once it's done. It only makes sense for processes that are going to be around for a while, though; the patch set tries to approximate that goal by only considering processes with at least one second of run time for migration.
The final piece is a pair of new system calls allowing processes to be put into "NUMA groups" that will share the same home node. If one of them is migrated, the entire group will be migrated. The first system call is:
int numa_tbind(int tid, int ng_id, unsigned long flags);
This system call will bind the thread identified by tid to the NUMA group identified by ng_id; the flags argument is currently unused and must be zero. If ng_id is passed as MS_ID_GET, the system call will, instead, simply return the current NUMA group ID for the given thread. A value of MS_ID_NEW, instead, creates a new NUMA group, binds the thread to that group, and returns the new ID.
The second new system call is:
int numa_mbind(void *addr, unsigned long len, int ng_id, unsigned long flags);
This call will set up a memory policy for the region of len bytes starting at addr and bind it to the NUMA group identified by ng_id. If necessary, lazy migration will be used to move the memory over to the node where the given NUMA group is based. Once again, flags is unused and must be zero. Once the memory is bound to the NUMA group, it will stay with the processes in that group; if the processes are moved, the memory will move with them.
Peter provided some benchmark results from a two-node system. Without the
NUMA balancing patches, over time, the benchmark ended up with just as many
remote memory accesses as local accesses - allocated memory was spread
across the system. With the NUMA balancer, 86% of the memory accesses were
local, leading to a significant speedup. As Peter put it: "These
numbers also show that while there's a marked improvement, there's still
some gain to be had. The current numa balancer is still somewhat
fickle.
" A certain amount of fickleness is perhaps to be expected
for such an involved patch set, given how young it is. Given some time,
reviews, and testing, it should evolve into a solid scheduler component,
giving Linux far better NUMA performance than it has ever had in the past.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
