The current stable kernel is 2.6.31; no stable updates have yet been released for this kernel. For older kernels, 188.8.131.52 and 184.108.40.206 were released on September 15. Both contain a handful of important fixes.
We would build less kernels, then drink less coffee, becoming less nervous, more friendly. Everyone will offer flowers in the street, the icebergs will grow back and white bears will...
And eventually we'll be inspired enough to write perf love, the more than expected tool to post process ftrace "love" events.
hello = newCString "hello" >>= printk >> return 0
Just don't try to merge it upstream.introduces a new tool, called "timechart" on his weblog. Timechart is meant to help visualize and diagnose latency problems in a running Linux system. "To solve this, I have been working on a new tool, called Timechart, based on 'perf', that has the objective to show on a system level what is going on, at various levels of detail. In fact, one of the design ideas behind timechart is that the output should be 'infinitely zoomable'; that is, if you want to know more details about something, you should be able to zoom in to get these details."
In response to these problems, Laurent Pinchart has proposed a new subsystem implementing a global video buffer pool. These buffers would be allocated early in the system's lifetime, working around the unreliability of large contiguous allocations. Cache invalidation operations could be done ahead of time, eliminating a significant source of capture-time latency. Passing buffers between devices would be explicitly supported. The proposal is in an early stage, and Laurent would like comments from interested developers.
One of those is the reflink() system call (covered last week), which got an "I'm not pulling this" response from Linus. His objections included the way the system call was seemingly hidden in the ocfs2 tree, concern over how much VFS and security review it has received, and a dislike of the name. He would rather see a name like copyfile(), and he would like it to be more flexible; enabling server-side copying of files on remote filesystems was one idea which was raised.
In response, Joel Becker has proposed a new system call, called copyfile(), which would offer more options regarding just how the copy is done. There has not been much input from developers other than Linus, but Linus, at least, seems to like the new approach. So reflink() is likely to evolve into copyfile(), but there is clearly not time for that to happen in the 2.6.32 merge window.
The other development encountering trouble is fanotify (covered in July). The problem here is that there still is no real consensus on what the API should look like. The current implementation is based on a special socket and a bunch of setsockopt() calls, but there has been pressure (from some) to switch to netlink or (from others) to a set of dedicated system calls. Linus made a late entry into the discussion with a post in favor of the system call alternative; he also asked:
That led to an ongoing discussion about what fanotify is for, whether a new notification API is necessary, and whether fanotify can handle all of the things that people would like to do with it. See Jamie Lokier's post for a significant set of concerns. Linux developers have added two inadequate file notification interfaces so far; there is a certain amount of interest in ensuring that a third one would be a little better. So chances are good that fanotify will sit out this development cycle.
Kernel development news
Changes visible to kernel developers include:
The merge window should stay open for at least another week; it is not clear how LinuxCon and the Linux Plumbers Conference might affect the schedule. Next week's edition will contain an update on changes merged after the publication of this page.
On the BFS front, the (mildly) inflammatory part of the discussion would appear to have run its course. Anybody who has watched the linux-kernel list knows that serious attempts to fix problems often follow the storm; that appears to be the case this time around. Benchmarks are being posted by a number of people; as a general rule, the results of these benchmark runs tend to be mixed. There are also developers and users posting about problems that they are observing; see, for example, Jens Axboe's report of a ten-second pause while trying to run the xmodmap command.
As part of the process of tracking down problems, the conversation turned to tuning the scheduler. Ingo Molnar pointed out that there is a whole set of flags governing scheduler behavior, all of which can be tweaked by the system administrator:
The idea here is not that each user should be required to pick out the correct scheduler from a set of 32768 - a number which presumably seems high even to the "Linux is about choice" crowd. But these flags can be useful for anybody who is trying to track down why the behavior of the scheduler is not as good as it should be. When a tuning change improves things, it gives developers a hint about where they should be looking to find the source of the problem.
A particular test suggested by Ingo was this:
echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
(Politically-correct developers will, of course, have debugfs mounted under /sys/kernel/debug. Your editor takes no position on the proper debugfs mount point.)
One tester reported immediately that setting this flag made the problems go away. Jens also noted that his ten-second xmodmap problem was solved. The evidence of problems with the NEW_FAIR_SLEEPERS feature was compelling enough that Ingo posted a patch to disable it by default; that patch has been merged for 2.6.32.
For the curious, the NEW_FAIR_SLEEPERS feature is a simple tweak which gives a process a small runtime credit when it returns to the run queue after a sleep. It is meant to help interactive processes, but, clearly, something is not working as expected. Once the real problem has been tracked down, it's possible that the NEW_FAIR_SLEEPERS feature could, once again, be enabled by default. In the mean time, users experiencing interactivity problems may want to try disabling it and seeing if things get better.
Another default parameter is changing for 2.6.32; it controls which process runs first after a fork(). For much of the recent past, fork() has arranged things such that the child process gets to run before fork() returns to the parent; this behavior was based on the general observation that the child's work is often more important. There is a good reason to run the parent first, though: the parent's state is active in the processor, the translation lookaside buffer (TLB) contains the right information, etc. So parent-runs-first should perform better. It appears that recent tests showed that parent-runs-first does, indeed, outperform child-runs-first on that most important benchmark: kernel builds. That was enough to get the default changed.
There are some concerns that this change could expose application bugs. Jesper Juhl expresses those concerns this way:
Child-runs-first has never been a part of the fork() API, though; it's not something that applications should rely on. Even before the change, behavior could differ as a result of preemption, SMP systems, and more. So it's really true that child-runs-first was never guaranteed. But that will not make users feel any better if applications break. To help those users, there is a new kernel.sched_child_runs_first sysctl knob; setting it to one will restore the previous behavior.
Active CPU scheduling is interesting, but there is also work happening in another area: what happens when nobody wants the CPU? Contemporary processors include a number of power management features which can be used to reduce power consumption when nothing is going on. Clearly, anybody who is concerned about power consumption will want the processor to be in a low-power state whenever possible. There are, however, some problems with a naive "go into a low power state when idle" policy:
It turns out that the CPU "governor" code in the mainline kernel often gets this decision wrong, especially for the newer Intel "Nehalem" processors; the result is wasted energy and poor performance, where "poor performance" means a nearly 50% hit on some tests that Arjan van de Ven ran. His response was to put together a patch aimed at fixing the problems. The approach taken is interesting.
Clearly, it makes no sense to put the processor into a low-power state if it will be brought back to full power in the very near future. So all the governor code really has to do is to come up with a convincing prediction of the future so it knows when the CPU will be needed again. Unfortunately, the chip vendors have delayed the availability of the long-promised crystal-ball peripherals yet again, forcing the governor code to rely on heuristics; once again, software must make up for deficiencies in the hardware.
When trying to make a guess about when a CPU might wake up, there are two things to consider. One is entirely well known: the time of the next scheduled timer event. The timer will put an upper bound on the time that the CPU might sleep, but it is not a definitive number; interrupts may wake up the CPU before the timer goes off. Arjan's governor tries to guess when that interrupt might happen by looking at the previous behavior of the system. Every time that the processor wakes up, the governor code calculates the difference between the estimated and actual idle times. A running average of that difference is maintained and used to make a (hopefully) more accurate guess as to what the next idle time will really be.
Actually, several running averages are kept. The probability of a very long idle stretch being interrupted by an interrupt is rather higher than the probability when expected idle period is quite short. So there is a separate correction factor maintained for each order of magnitude of idle time - a 1ms estimate will have a different correction factor than a 100µs or a 10ms guess will. Beyond that, a completely different set of correction factors is used (and maintained) if there is I/O outstanding on the current CPU. If there are processes waiting on short-term (block) I/O, the chances of an early wakeup are higher.
The performance concern, meanwhile, is addressed by trying to come up with some sort of estimate of how badly power-management latency would hurt the system. A CPU which is doing very little work will probably cause little pain if it goes to sleep for a while. If, instead, the CPU is quite busy, it's probably better to stay powered up and ready to work. In an attempt to quantify "busy," the governor code calculates a "multiplier":
multiplier = 1 + 20*load_average + 10*iowait_count
All of the numbers are specific to the current CPU. So the multiplier is heavily influenced by the system load average, and a bit less so by the number of processes waiting for I/O. Or so it seems - but remember that processes in uninterruptible waits (as are used for block I/O) are counted in the load average, so their influence is higher than it might seem. In summary, this multiplier grows quickly as the number of active processes increases.
The final step is to examine all of the possible sleep states that the processor provides, starting with the deepest sleep. Each sleep state has an associated "exit latency" value, describing how long it takes to get out of that state; deeper sleeps have higher exit latencies. The new governor code multiplies the exit latency by the multiplier calculated above, then compares the result to its best guess for the idle time. If that idle time exceeds the adjusted latency value, that sleep state is chosen. Given the large multipliers involved, one can see that expected idle times must get fairly long fairly quickly as the system load goes up.
According to Arjan, this change restores performance to something very close to that of a system which is not using sleep states at all. The improvement is significant enough that Arjan would like to see the code merged for 2.6.32, even though it just appeared during the merge window. That might happen, though it is possible that it will turned into a separate CPU governor for one development cycle just in case regressions turn up.
Modern processors support hardware breakpoint or watchpoint debugging functionality, but the Linux kernel does not provided a way for debuggers, such as kgdb or gdb, to access these breakpoint registers in a shared manner. Thus, debuggers running concurrently can easily collide in their use of these registers, causing the debuggers to act in a strange and confusing manner. For example, continuing execution through a breakpoint, rather than breaking, would certainly confuse a programmer.
This issue is being addressed by a proposed kernel API called hw-breakpoint (alternatively hw_breakpoint). The hw-breakpoint functionality, developed in a series of patches by K. Prasad, Frederic Weisbecker, and Alan Stern, aims to provide a consistent, portable, and robust method for multiple programs to access special hardware debug registers. These registers are useful for any application that requires the ability to observe memory data accesses, or trigger the collection of program information based on data accesses. Such applications include debugging, tracing, and performance monitoring. While these patches initially target the x86, they attempt to provide a generic API that can be supported in an architecture independent manner on various processors. Although the details are still being ironed out, with hw-breakpoint hardware debug resources can be concurrently available to various users in a more portable manner.
The most common debugging scenarios that would use the hw-breakpoint patches are memory corruption bugs. Programming mistakes such as bad pointers, buffer overruns, and improper memory allocation/deallocation can lead to memory corruption where valid data is accidentally overwritten. These bugs can be hard to find; the corruption can occur anywhere in the program. The error resulting from the corruption often occurs long after the corruption. These bugs cannot typically be found by focusing on the local sections of code that explicitly access the corrupted data. Instead, debugger watchpoints, which are a special type of breakpoint, are the first choice for debugging memory corruption problems.
Debugger breakpoints halt program execution at a given address and transfer control to the debugger. This allows the program state (variables, memory, and registers) to be examined. When programmers talk of breakpoints they usually are referring to software breakpoints. For example, in gdb the break command sets a software breakpoint at the specified instruction address. The break command replaces the specified instruction with a trap instruction that, when executed, passes control to gdb.
In contrast, watchpoints are best implemented using hardware breakpoints; software implementations of watchpoints are extremely slow. But, hardware breakpoints require special debug registers in the processor. These debug registers continuously monitor memory addresses generated by the processor, and a trap handler is invoked if the address in the register matches the address generated by the processor.
Memory accesses can be for data read, data write, or instruction execute (fetch), so hardware breakpoints usually support trapping on not only the address, but also the type of access: read, write, read/write, or execute. Hardware debug registers may also support trapping on IO port accesses in addition to memory accesses. In either case, a watchpoint is a trap on any type of data access rather than just an instruction execute access. Since memory corruption can happen anywhere in the program, a watchpoint set to trap on writes to the corrupted variable/location can be a good way to catch these bugs in the act.
These hardware debug registers are limited resources: Intel x86 processors support up to four hardware breakpoints/watchpoints using the special purpose DR0 to DR7 registers. Registers DR0 to DR3 can be programmed with the virtual memory address of the desired hardware breakpoint or watchpoint. DR4 and DR5 are reserved for processor use. DR6 is a status register that gives information about the last breakpoint hit, such as the register number of the breakpoint, and DR7 is the breakpoint control register. DR7 includes controls such as, local and global enables, memory access type, and memory access length. However, as with any limited hardware resource, multiple software users must contend for access of these registers.
Since existing released kernels do not control or arbitrate access to these registers, software users can unknowingly clash in their usage, which usually will result in a software error or crash. Hw-breakpoint solves this problem by arbitrating the access to these limited hardware registers from both user-space and kernel-space software. User-space access, such as from gdb, is done via the ptrace() system call. Kernel-space access includes kgdb and KVM (only during context switches between host and guests). Hw-breakpoint arbitration keeps kernel and/or user space debuggers from stepping on each others' toes .
Additional kernel patches have been developed to take advantage of the hw-breakpoint API. A plug-in for ftrace (ftrace has previously been discussed in LWN articles here and here) has been developed to dynamically trace any kernel global symbol. This functionality, called ksym_tracer, allows all read and write accesses on a kernel variable to be displayed in debugfs. Since it uses the hw-breakpoint API, it relies on underlying hardware breakpoint support. This new feature of ftrace could be very useful for memory corruption bugs that are difficult to catch with watchpoints. These difficulties include such things as: 1) an erroneous write that is lurking beneath a large quantity of valid writes, 2) the necessity to setup a remote machine to run Kgdb, and 3) kernel bugs which no longer manifest themselves when the machine is halted via breakpoints. Hw-breakpoint allows the concurrent use of both ksym_tracer and debugger watchpoints without the risk of hardware debug register corruption.
In addition to ftrace, perfcounters (see LWN articles here and here) can be enhanced through the generic hw-breakpoint functionality. Specifically, counters can be updated based on data accesses rather than instruction execution. A patch to perfcounters has been developed to use kernel-space hardware breakpoints to monitor performance events associated with data accesses. For example, spinlock accesses can be counted by monitoring the spinlock flag itself. Currently this patch is rather limited in supporting the definition and use of breakpoint counters. However, additional features are planned.
Since the additions to ftrace and perfcounter patches, the hw-breakpoint API can now be potentially used by several pieces of code: kgdb, KVM, ptrace, ftrace, and perfcounters. This increased potential usage has resulted in increased scrutiny of the API by various developers: hw-breakpoint is no longer solely of concern to debugger developers. This increased scrutiny has resulted in major changes to the hw-breakpoint code that are still ongoing. In particular, the coupling of perfcounters to hw-breakpoint has caused the rethinking of a significant chunk of the original hw-breakpoint functionality and structure.
The original (pre-perfcounter support) hw-breakpoint functionality was primarily developed by K. Prasad. It supported global, system-wide kernel-space breakpoints and per-thread user-space breakpoints. Whereas user-space breakpoints were only enabled during thread execution, kernel breakpoints were always present on all CPUs in the system. Additionally, no reservation policy was implemented. Requests for hardware debug registers were granted on a first-come, first-serve basis. Once all physical debug registers were used, hw-breakpoint returned an error for further breakpoint requests.
This original hw-breakpoint implementation is "an utter mis-match" to support perfcounter functionality for three reasons, as pointed out by Peter Zijlstra. First, counters (either user or kernel-space) can be defined per-cpu or per-task; this conflicts with hw-breakpoint's system-wide kernel breakpoints. Second, per-task counters are scheduled by perfcounter to save unnecessary context swaps of the underlying hardware resources when it is not necessary. Third, counters can be multiplexed, in a time-sliced fashion, beyond the underlying hardware PMUs (performance monitoring unit) resource limit, which for x86 hardware breakpoints is four. These incongruities between perfcounter and hw-breakpoint led to a debate about any coupling between hw-breakpoint and perfcounter. However, a consensus formed that integrating hw-breakpoint into perfcounter's PMU reservation and scheduling infastructure would be beneficial given perfcounters richer support for scheduling, reservation, and management of hardware resources. About these benefits Frederic Weisbecker writes:
Newly posted in the last week is Weisbecker's patch to integrate hw-breakpoint and perfcounter code. Conceptually, this splits the hw-breakpoint functionality into two halves: 1) the top level API, and 2) the low level debug register control. In between these halves lies the perfcounter functionality. With this patch each breakpoint is a specific perfcounter instance called a breakpoint counter. Perfcounter handles register scheduling, and thread/CPU attachment of these breakpoint counter instances. The modified hw-breakpoint API still handles requests from ptrace(), ftrace, and kgdb for breakpoints by creating a breakpoint counter. Breakpoint counters can also be created directly from the existing perfcounter system call (perf_counter_open()). The breakpoint counter layer interacts with the low-level, architecture specific hw-breakpoint code that handles reading and writing the processor's debug registers.
Unfortunately, because of the very recent integration into perfcounters, the hw-breakpoint API has changed and additional changes to the API are planned. Rather than cover in detail the existing API, since it appears likely to change, I will give a summary of it. Two Function calls are provided to set a new hardware breakpoint.
int register_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp); int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, int cpu);where:
cpu is the cpu number to set the breakpoint on; *tsk is a pointer to 'task_struct' of the process to which the address belongs; *bp is a pointer to the breakpoint property information which includes: 1) a pointer to function handler to be invoke upon hitting the breakpoint; 2) a pointer to architecture dependent data (struct arch_hw_breakpoint).The struct arch_hw_breakpoint provides breakpoint properties such as the memory address of the breakpoint, type of memory access (read/write, read, or write), and the length of memory access (byte, short, word, ...). These parameters are highly dependent upon the specific support provided by the hardware. For example, while x86 supports virtual memory addresses, other processors support physical memory addresses. Since the API aims for architecture independence, this structure is architecture dependent.
To avoid having to register and unregister a breakpoint if it just needs modification, the following function is provided:
int modify_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp)Hardware breakpoints are removed by an unregister function:
void unregister_hw_breakpoint(struct hw_breakpoint *bp)
Hw-breakpoint has made its way into the -tip tree, the kernel source development tree maintained by Ingo Molnar. In June it was tentatively targeted for merging from -tip into the 2.6.32 kernel. However, the delayed integration with perfcounters has pushed any merge out past 2.6.32.
Whenever it is released, hw-breakpoint promises to provide a portable and robust method for debuggers to access hardware breakpoints without conflict. While the hw-breakpoint functionality started out as a relatively isolated feature to support debuggers, its existence has spawned new tracing and performance monitoring features. These new features should prove useful for various situations where data memory access, rather than instruction access provides the appropriate trigger to collect dynamic information. By leveraging the perfcounter resource scheduling and reservation functionality, hw-breakpoint has a very generalized method for managing limited hardware breakpoint registers. The release of hw-breakpoint promises to enable new ways for Linux users to track down difficult bugs such as memory corruption, and to enable diverse dynamic data access techniques (such as gdb watchpoints and ftrace ksym_tracer) to play well together.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds