The 2.6.32 merge window is open
, so there is no current development
kernel release. The usual vast pile of patches has been merged; see the
article below for a summary.
The current stable kernel is 2.6.31; no stable updates have yet been
released for this kernel. For older kernels, 18.104.22.168 and 22.214.171.124 were released on
September 15. Both contain a handful of important fixes.
Comments (1 posted)
No need anymore to write some printk to debug, worrying,
sweating, feeling guilty because we know we'll need yet another
printk() after the reboot, and we even already know where while
it is compiling.
We would build less kernels, then drink less coffee, becoming
less nervous, more friendly. Everyone will offer flowers in
the street, the icebergs will grow back and white bears will...
And eventually we'll be inspired enough to write perf love,
the more than expected tool to post process ftrace "love" events.
-- Frederic Weisbecker
How can waiting for child1 to run a bit before forking off child2
_not_ hurt? The parent is the worker bee creator, the queen bee if
you will. Seems to me that making the queen wait until one egg
hatches and ages a bit before laying another egg is a very bad plan
if the goal is to have a hive full of short lived worker bees.
-- Mike Galbraith
(thanks to Ingo Molnar)
And yes, but the engineering model of the kernel development cycle
is that engineer hours are wasted and thrown away all the time.
They are surplus, sorry. That's how life works here.
-- Greg Kroah-Hartman
One of my functions is pointlessly sending patches at maintainers
so you don't have to.
-- Andrew Morton
Comments (1 posted)
There must be a crowd of people out there thinking that they would get into
kernel development, but only if they could do it in Haskell. Here is a
with instructions on how to do just that. "By making
GHC and the Linux build system meet in the middle we can have modules that
are type safe and garbage collected. Using the copy of GHC modified for the
House operating system as a base, it turns out to be relatively simple to
make the modifications necessary to generate object files for the Kernel
" This leads to code which looks like:
hello = newCString "hello" >>= printk >> return 0
Just don't try to merge it upstream.
Comments (84 posted)
Arjan van de Ven introduces a new tool
, called "timechart" on his weblog. Timechart is meant to help visualize and diagnose latency problems in a running Linux system. "To solve this, I have been working on a new tool, called Timechart, based on 'perf', that has the objective to show on a system level what is going on, at various levels of detail. In fact, one of the design ideas behind timechart is that the output should be 'infinitely zoomable'; that is, if you want to know more details about something, you should be able to zoom in to get these details.
Comments (16 posted)
The Video4Linux2 API has a well-developed interface for sharing video
buffers between user space and the kernel. It is not without its problems,
though. Simple video acquisition devices transfer large amounts of data
(video frames) but cannot do scatter/gather I/O, forcing the allocation of
large, physically-contiguous buffers. Queueing buffers for frame transfers
can be a significant source of latency, especially when user-space buffers
need to be locked into memory or when the architecture requires significant
cache invalidation operations. It would also be nice to be able to pass
buffers directly between video devices and related devices, such as
hardware codecs, but the current API does not support that well.
In response to these problems, Laurent Pinchart has proposed a new subsystem implementing a global
video buffer pool. These buffers would be allocated early in the system's
lifetime, working around the unreliability of large contiguous
allocations. Cache invalidation operations could be done ahead of time,
eliminating a significant source of capture-time latency. Passing buffers
between devices would be explicitly supported.
The proposal is in an early stage, and Laurent would like comments from
Comments (1 posted)
At this stage of the development cycle, attention naturally turns to what
has been merged into the mainline kernel. It can also be interesting,
though, to look at what is not
getting in. This time around, a
things have run into opposition at merge time and may, as a result, not
find their way into the 2.6.32 kernel.
One of those is the reflink() system call (covered last week), which got an "I'm not pulling this" response from Linus.
His objections included the way the system call was seemingly hidden in the
ocfs2 tree, concern over how much VFS and security review it has received,
and a dislike of the name. He would rather see a name like
copyfile(), and he would like it to be more flexible; enabling
server-side copying of files on remote filesystems was one idea which was
In response, Joel Becker has proposed a new
system call, called copyfile(), which would offer more options
regarding just how the copy is done. There has not been much input from
developers other than Linus, but Linus, at least, seems to like the new
approach. So reflink() is likely to evolve into
copyfile(), but there is clearly not time for that to happen in
the 2.6.32 merge window.
The other development encountering trouble is fanotify (covered in July). The problem
here is that there still is no real consensus on what the API should look
like. The current implementation is based on a special socket and a bunch
of setsockopt() calls, but there has been pressure (from some) to
switch to netlink or (from others) to a set of dedicated system calls.
Linus made a late entry into the discussion
with a post in favor of the system call alternative; he also asked:
I still want to know what's so wonderful about fanotify that we
would actually want
yet-another-filesystem-notification-interface. So I'm not saying
that I'll take a system call interface. I just don't think that
hiding interfaces behind some random packet interface is at all any
That led to an ongoing discussion about what fanotify is for, whether a new
notification API is necessary, and whether fanotify can handle all of the
things that people would like to do with it. See Jamie Lokier's post for a significant set of
concerns. Linux developers have added two inadequate file notification
interfaces so far; there is a certain amount of interest in ensuring that a
third one would be a little better. So chances are good that fanotify will
sit out this development cycle.
Comments (1 posted)
Kernel development news
Linus started taking patches for the 2.6.32 merge window on
September 10. Thus begins the process which should lead to a final
kernel release around the beginning of December. As of this writing, some
4400 non-merge changes have been merged. The most significant
user-visible changes include:
- The per-BDI write back
threads patch has been merged; this should lead to better
- The devtmpfs virtual
filesystem has been merged. This feature, which is seen by many as
the return of the much-disliked devfs subsystem, has been
controversial from the beginning, despite the facts that it differs
significantly from devfs and some distributions are already making
good use of it. So it's not surprising that there was opposition to it being merged. Linus
silently accepted it, though, so it will appear in 2.6.32.
- The keyctl() system call has a new command
(KEYCTL_SESSION_TO_PARENT) which causes the calling process's
keyring to replace its parent's keyring. This feature is evidently
useful for the AFS filesystem; there's also a new set of security
module hooks to control this functionality.
- The sysfs filesystem now understands security labels, allowing for
tighter security policy control over access to sysfs files.
- The S390 architecture is now able to "call home" and send kernel oops
reports to the service organization's mothership. This functionality
is controlled with the unobviously-named SCLP_ASYNC
- the OProfile code now implements multiplexing of performance counters,
allowing for the collection of a larger range of statistics.
- The SCHED_RESET_ON_FORK scheduler policy flag has been added. This
flag (described in this
article), causes a child process to not inherit elevated priority
or realtime scheduling from its parent.
- The perf tool has a new trace operation; it
generates a simple output stream from a user-specified set of
- The default value of the child_runs_first scheduler sysctl
knob has been changed to "false." This causes the parent process to
continue running after a fork() rather than yielding
immediately to the child process. See this article for more
information on 2.6.32 scheduler changes.
- There is a new set of scheduler tracepoints which improve visibility
into wait, sleep, and I/O wait times. There are also new tracepoints
for module loading and reference count events, system call entry and
exit, network packet copies to user space, and KVM interrupt and
memory-mapped I/O events.
- A vast amount of work has happened within the wireless networking
subsystem; most of it consists of cleanups and improvements which are
not immediately visible to the user. Additionally, wireless
extensions compatibility has been improved and there is now network
namespace support in cfg80211.
- The SPARC64 architecture now has rudimentary performance counter
- The KVM virtualization subsystem has gained a module called "irqfd";
it allows the host to inject interrupts into guest systems. Along
with irqfd comes
a new "ioeventfd" feature enabling emulated memory-mapped I/O in
guests. KVM also
now has support for the "unrestricted guest" mode supported by
latter-day Intel VMX-capable processors.
- The Intel TXT integrity
management mechanism is now in the mainline.
- There is a new "VGA arbitration" module which allows independent
applications to function properly with multiple VGA devices wired to
the same address space. Control is through /dev/vga_arbiter;
see Documentation/vgaarbiter.txt for
- There is the usual pile of new drivers:
- Audio: Zoom2 system-on-chip boards,
Wolfson WM8523, WM8776, WM8974, WM8993 and WM8961 codecs,
Freescale IMX SSI devices,
Freecale i.MX1x and i.MX2x-based audio DMA controllers,
AD1938 and AD1836 sound chips,
ADI BF5xx chip audio devices,
Openmoko Neo FreeRunner (GTA02) sound devices,
DaVinci DM6446 or DM355 EVM audio devices,
Amstrad E3 (Delta) videophones,
Renesas SH7724 serial audio interfaces,
AKM AK4642/AK4643 audio devices,
Simtec TLV320AIC23 audio devices,
Conexant CX20582 codecs, and
Cirrus Logic CS4206 codecs.
- Boards and processors-
Atmel AT91sam9g45 and AT91sam9g10 processors,
Eukrea CPUIMX27, MBIMX27, CPUAT91, CPU9260, and CPU9G20 processors,
Broadcom BCMRing system-on-chip processors,
Nuvoton NUC900 and NUCP950 CPUs,
Marvell OpenRD Base boards,
Freescale i.MX25 processors,
Motorola Zn5 GSM phones,
phyCARD-s (aka pca100) platforms,
Airgoo Home Media Terminal devices,
Samsung S5PC1XX-based systems,
LaCie 2Big Network NAS systems,
ST Ericsson Nomadic 8815-based systems,
Freescale MPC837x RDB/WLAN boards,
Freescale P2020RDB reference boards, and
AppliedMicro PPC460SX Eiger evaluation boards.
- Block: RDC PATA controllers, PMC SIERRA Linux MaxRAID
a (staging) driver called "cowloop", described as
"Cowloop is a "copy-on-write" pseudo block driver. It can
be stacked on top of a "real" block driver, and catches all write
operations on their way from the file systems layer above to the
real driver below, effectively shielding the lower driver from
those write accesses. The requests are then diverted to an
ordinary file, located somewhere else (configurable)."
- Networking: Broadcom BCM8727, BCM50610M and AC131 PHY devices,
Infineon ISAC/HSCX, ISACX, IPAC and IPACX ISDN chipsets,
AVM FRITZ!CARD ISDN adapters,
Traverse Technologies NETJet PCI ISDN cards,
Winbond W6692 based ISDN cards,
Sedlbauer Speedfax+ ISDN cards,
Atheros AR9287 and AR9271 chipsets,
TI wl1271 chipsets,
Xilinx 10/100 Ethernet Lite devices,
Marvell 88W8688 Bluetooth interfaces,
Marvell SD8688 Bluetooth-over-SDIO interfaces,
Ralink RT3090-based wireless adapters (staging), and
Realtek 8192 PCI devices (staging).
Zarlink ZL10039 silicon tuners.
- Miscellaneous: Marvell CESA cryptographic engines,
EP93xx pulse-width modulators,
Samsung S3C24XX or S3C64XX onboard ADCs,
Twinhan USB 6253:0100 remote controls,
Blackfin rotary input devices,
Sentelic Finger Sensing Pad devices,
TI TWL4030/TWL5030/TPS659x0 keypad devices,
Quatech USB2.0 to serial adaptors (staging),
the Android MSM shared memory driver (staging),
HTC Dream QDSP chips (staging),
HTC Dream camera devices (staging),
VME busses (staging),
Microsoft's Hyper-V virtualization drivers (staging),
Discretix security processor devices (staging),
ST Microelectronics LIS3L02DQ accelerometers (staging),
TAOS TSL2561 light-to-digital converters (staging),
Kionix KXSD9 accelerometers (staging),
MAXIM max1363 ADC devices (staging), and
VTI SCA3000 series accelerometers (staging).
Changes visible to kernel developers include:
- There is a new check_acl() operation added to struct
inode_operations. It's part of a push by Linus to move more
permissions testing logic into the VFS core and reduce locking in the
- There is a new kernel_module_request() hook in the security
module API; it allows security modules to decide whether to allow
request_module() calls to succeed. There is also a
new set of hooks for the TUN driver.
- Spinlocks can be built as inline operations for architectures where
that performs better.
- The "classic read-copy-update" and "preempt RCU" implementations have
been removed in favor of "tree RCU" and "bloatwatch RCU".
- The low-level interrupt handling code has gained support for interrupt
controllers accessed by way of slow (I2C, say) busses. Among other
things, that leads to the addition of the IRQF_ONESHOT flag,
which causes an interrupt with a threaded handler to remain masked in
the time between the execution of the hard and threaded handlers.
- The tracing ring buffer is now entirely lockless on the writer's
side. See this article
- As described briefly in this
article, the network driver API has changed. The return type for
ndo_start_xmit() is now netdev_tx_t, an
enum value. For most drivers, simply changing the declared
return type for that function will be sufficient.
- The blk-iopoll
block-layer interrupt mitigation code has been merged.
- Configuring the kernel with "make localmodconfig" will create
a configuration pared down to the modules currently loaded in the
running kernel. "make localyesconfig" builds the modules
into the kernel instead.
- The new power management
core has been merged.
The merge window should stay open for at least another week; it is not
clear how LinuxCon and the Linux Plumbers Conference might affect the
schedule. Next week's edition will contain an update on changes merged
after the publication of this page.
Comments (7 posted)
Scheduler-related development seems to come in bursts. Things will be
relatively quiet for a few development cycles, then activity will suddenly
increase. We would appear to be in one of those periods where developers
start to show a higher level of interest in what the scheduler is doing.
The posting of the BFS scheduler has certainly motivated some of this
activity, but there is more than that going on.
On the BFS front, the (mildly) inflammatory part of the discussion would
appear to have run its course. Anybody who has watched the linux-kernel
list knows that serious attempts to fix problems often follow the storm;
that appears to be the case this time around. Benchmarks are being posted
by a number of people; as a general rule, the results of these benchmark
runs tend to be mixed. There are also developers and users posting about problems
that they are observing; see, for example, Jens
Axboe's report of a ten-second pause while trying to run the
As part of the process of tracking down problems, the conversation turned
to tuning the scheduler. Ingo Molnar pointed
out that there is a whole set of flags governing scheduler behavior,
all of which can be tweaked by the system administrator:
Note, these flags are all runtime, the new settings take effect
almost immediately (and at the latest it takes effect when a task
has started up) and safe to do runtime. It basically gives us
32768 pluggable schedulers each with a slightly separate algorithm
- each setting in essence creates a new scheduler.
The idea here is not that each user should be required to pick out the
correct scheduler from a set of 32768 - a number which presumably seems
high even to the "Linux is about choice" crowd. But these flags can be useful for
anybody who is trying to track down why the behavior of the scheduler is
not as good as it should be. When a tuning change improves things, it
gives developers a hint about where they should be looking to find the
source of the problem.
A particular test suggested by Ingo was this:
echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
(Politically-correct developers will, of course, have debugfs mounted under
/sys/kernel/debug. Your editor takes no position on the proper
debugfs mount point.)
One tester reported immediately that
setting this flag made the problems go away. Jens also noted that his
ten-second xmodmap problem was solved. The evidence of problems
with the NEW_FAIR_SLEEPERS feature was compelling enough that Ingo posted
a patch to disable it by default; that
patch has been merged for 2.6.32.
For the curious, the NEW_FAIR_SLEEPERS feature is a simple tweak which
gives a process a small runtime credit when it returns to the run queue
after a sleep. It is meant to help interactive processes, but, clearly,
something is not working as expected. Once the real problem has been
tracked down, it's possible that the NEW_FAIR_SLEEPERS feature could, once
again, be enabled by default. In the mean time, users experiencing
interactivity problems may want to try disabling it and seeing if things
Another default parameter is changing for 2.6.32; it controls which process
after a fork(). For much of the recent past, fork() has
arranged things such that the child process gets to run before
fork() returns to the parent; this behavior was based on the
general observation that the child's work is often more important. There
is a good reason to run the parent first, though: the parent's state is
active in the processor, the translation lookaside buffer (TLB) contains
the right information, etc. So parent-runs-first should perform better.
It appears that recent tests showed that parent-runs-first does, indeed,
outperform child-runs-first on that most important benchmark: kernel builds. That was
enough to get the default changed.
There are some concerns that this change could expose application bugs.
Jesper Juhl expresses those concerns this
I'm just worried that userspace programs have come to rely on a
certain behaviour and changing that behaviour may result in
undesired results for some apps. In a perfect world people would
just fix those apps that (incorrectly) relied on a certain
child-/parent-runs-first behaviour, but the world is not perfect,
and many apps may not even have source available.
Child-runs-first has never been a part of the fork() API, though;
it's not something that applications should rely on. Even before the
change, behavior could differ as a result of preemption, SMP systems, and
more. So it's really true that child-runs-first was never guaranteed. But
that will not make users feel any better if applications break. To help
those users, there is a new kernel.sched_child_runs_first sysctl
knob; setting it to one will restore the previous behavior.
Better cpuidle governance
Active CPU scheduling is interesting, but there is also work happening in
another area: what happens when nobody wants the CPU? Contemporary
processors include a number of power management features which can be used
to reduce power consumption when nothing is going on.
Clearly, anybody who is concerned about power consumption will want the
processor to be in a low-power state whenever possible. There are,
however, some problems with a naive "go into a low power state when idle"
- Transitions between power states will, themselves, consume power.
If a CPU is put into a very low-power state, only to be brought back
into operation a few microseconds later, the total power consumption
- Power state transitions have a performance cost. An extreme example
would be simply pulling the plug altogether; power consumption will be
admirably low, but the system will experience poor response times that
not even the BFS scheduler can fix. Putting the CPU into a more
conventional low-power state will still create latencies; it takes a
while for the processor to get back into a working mode. So going
into a low-power state too easily will hurt the performance of the
It turns out that the CPU "governor" code in the mainline kernel often gets
this decision wrong, especially for the newer Intel "Nehalem" processors;
the result is wasted energy and poor performance, where "poor
performance" means a nearly 50% hit on some tests that
Arjan van de Ven ran. His response was to put together a patch aimed at fixing the
problems. The approach taken is interesting.
Clearly, it makes no sense to put the processor into a low-power state if
it will be brought back to full power in the very near future. So all the
governor code really has to do is to come up with a convincing prediction
of the future so it knows when the CPU will be needed again.
Unfortunately, the chip vendors have delayed the availability of the
long-promised crystal-ball peripherals yet again, forcing the governor code
to rely on heuristics; once again, software must make up for deficiencies
in the hardware.
When trying to make a guess about when a CPU might wake up, there are two
things to consider. One is entirely well known: the time of the next
scheduled timer event. The timer will put an upper bound on the time that
the CPU might sleep, but it is not a definitive number; interrupts may wake
up the CPU before the timer goes off. Arjan's governor tries to
guess when that interrupt might happen by looking at the previous
behavior of the system. Every time that the processor wakes up, the
governor code calculates the difference between the estimated and actual
idle times. A running average of that difference is maintained and used
to make a (hopefully) more accurate guess as to what the next idle time
will really be.
Actually, several running averages are kept. The probability of a very
long idle stretch being interrupted by an interrupt is rather higher than
the probability when expected idle period is quite short. So there is a
separate correction factor maintained for each order of magnitude of idle
time - a 1ms estimate will have a different correction factor than a
100µs or a 10ms guess will. Beyond that, a completely different set
of correction factors is used (and maintained) if there is I/O outstanding
on the current CPU. If there are processes waiting on short-term (block)
I/O, the chances of an early wakeup are higher.
The performance concern, meanwhile, is addressed by trying to come up with
some sort of estimate of how badly power-management latency would hurt the
system. A CPU which is doing very little work will probably cause little
pain if it goes to sleep for a while. If, instead, the CPU is quite busy,
it's probably better to stay powered up and ready to work. In an attempt
to quantify "busy," the governor code calculates a "multiplier":
multiplier = 1 + 20*load_average + 10*iowait_count
All of the numbers are specific to the current CPU. So the multiplier is
heavily influenced by the system load average, and a bit less so by the
number of processes waiting for I/O. Or so it seems - but remember that
processes in uninterruptible waits (as are used for block I/O) are counted
in the load average, so their influence is higher than it might seem. In
summary, this multiplier grows quickly as the number of active processes
The final step is to examine all of the possible sleep states that the
processor provides, starting with the deepest sleep. Each sleep state has
an associated "exit latency" value, describing how long it takes to get out
of that state; deeper sleeps have higher exit latencies. The new governor
code multiplies the exit latency by the multiplier calculated above, then
compares the result to its best guess for the idle time. If that idle time
exceeds the adjusted latency value, that sleep state is chosen. Given the
large multipliers involved, one can see that expected idle times must get
fairly long fairly quickly as the system load goes up.
According to Arjan, this change restores performance to something very
close to that of a system which is not using sleep states at all. The
improvement is significant enough that Arjan would like to see the code
merged for 2.6.32, even though it just appeared during the merge window.
That might happen, though it is possible that it will turned into a
separate CPU governor for one development cycle just in case regressions
Comments (8 posted)
Modern processors support hardware breakpoint or watchpoint debugging
functionality, but the Linux kernel does not provided a way for debuggers,
such as kgdb or gdb, to access these breakpoint registers
in a shared manner. Thus, debuggers running concurrently can easily
collide in their use of these registers, causing the debuggers to act in
a strange and confusing manner. For example, continuing execution through a
breakpoint, rather than breaking, would certainly confuse a
This issue is being addressed by a proposed kernel API called
hw-breakpoint (alternatively hw_breakpoint). The hw-breakpoint
functionality, developed in a series of patches by K. Prasad, Frederic
Weisbecker, and Alan Stern, aims to provide a consistent, portable, and
robust method for multiple programs to access special hardware debug
registers. These registers are useful for any application that requires
the ability to observe memory data accesses, or trigger the collection of
program information based on data accesses. Such applications include
debugging, tracing, and performance monitoring. While these patches
initially target the x86, they attempt to provide a generic API that can be
supported in an architecture independent manner on various processors.
Although the details are still being ironed out, with hw-breakpoint
hardware debug resources can be concurrently available to various users in
a more portable manner.
The most common debugging scenarios that would use the hw-breakpoint
patches are memory corruption bugs. Programming mistakes such as bad
pointers, buffer overruns, and improper memory allocation/deallocation can
lead to memory corruption where valid data is accidentally
overwritten. These bugs can be hard to find; the corruption can occur
anywhere in the program. The error resulting from the corruption often occurs
long after the corruption. These bugs cannot typically
be found by focusing on the local sections of code that explicitly access
the corrupted data. Instead, debugger watchpoints, which are a special type
of breakpoint, are the first choice for debugging memory corruption
Debugger breakpoints halt program execution at a given address and
transfer control to the debugger. This allows the program state (variables,
memory, and registers) to be examined. When programmers talk of breakpoints
they usually are referring to software breakpoints. For example, in
gdb the break command sets a software breakpoint at the
specified instruction address. The break command replaces the
specified instruction with a trap instruction that, when executed, passes
control to gdb.
In contrast, watchpoints are best implemented using hardware
breakpoints; software implementations of watchpoints are extremely slow.
But, hardware breakpoints require special debug registers in the processor.
These debug registers continuously monitor memory addresses generated by
the processor, and a trap handler is invoked if the address in the
register matches the address generated by the processor.
Memory accesses can be for data read, data write, or instruction execute
(fetch), so hardware breakpoints usually support trapping on
not only the address, but also the type of access: read,
write, read/write, or execute. Hardware debug registers may also support
trapping on IO port accesses in addition to memory accesses. In either
case, a watchpoint is a trap on any type of data access rather than just an
instruction execute access. Since memory corruption can happen anywhere in
the program, a watchpoint set to trap on writes to the corrupted
variable/location can be a good way to catch these bugs in the act.
These hardware debug registers are limited resources: Intel x86
processors support up to four hardware breakpoints/watchpoints using the
special purpose DR0 to DR7 registers. Registers DR0 to DR3 can be
programmed with the virtual memory address of the desired hardware
breakpoint or watchpoint. DR4 and DR5 are reserved for processor use. DR6
is a status register that gives information about the last breakpoint hit,
such as the register number of the breakpoint, and DR7 is the breakpoint
control register. DR7 includes controls such as, local and global enables,
memory access type, and memory access length. However, as with any limited
hardware resource, multiple software users must contend for access of these
Since existing released kernels do not control or arbitrate
access to these registers, software users can unknowingly clash in
their usage, which usually will result in a software error or
crash. Hw-breakpoint solves this problem by arbitrating the access to these
limited hardware registers from both user-space and kernel-space software.
User-space access, such as from gdb, is done via the
ptrace() system call. Kernel-space access includes kgdb
and KVM (only during context switches between host and guests).
Hw-breakpoint arbitration keeps kernel and/or user space debuggers from
stepping on each others' toes .
Additional kernel patches have been developed to take advantage of the
hw-breakpoint API. A plug-in for ftrace (ftrace has previously been
discussed in LWN articles here and here) has been developed to
dynamically trace any kernel global symbol. This functionality, called
ksym_tracer, allows all read and write accesses on a kernel variable to be
displayed in debugfs. Since it uses the hw-breakpoint API, it relies on
underlying hardware breakpoint support. This new feature of ftrace could
be very useful for memory corruption bugs that are difficult to catch with
watchpoints. These difficulties include such things as: 1) an erroneous
write that is lurking beneath a large quantity of valid writes, 2) the
necessity to setup a remote machine to run Kgdb, and 3) kernel
bugs which no longer manifest themselves when the machine is halted via
breakpoints. Hw-breakpoint allows the concurrent use of both ksym_tracer
and debugger watchpoints without the risk of hardware debug register
In addition to ftrace, perfcounters (see LWN articles here and here) can be enhanced through
the generic hw-breakpoint functionality. Specifically, counters can be
updated based on data accesses rather than instruction execution. A patch
to perfcounters has been developed to use kernel-space hardware breakpoints
to monitor performance events associated with data accesses. For example,
spinlock accesses can be counted by monitoring the spinlock flag itself.
Currently this patch is rather limited in supporting the definition and use
of breakpoint counters. However, additional features are planned.
Since the additions to ftrace and perfcounter patches, the hw-breakpoint
API can now be potentially used by several pieces of code: kgdb,
KVM, ptrace, ftrace, and perfcounters. This increased potential
usage has resulted in increased scrutiny of the API by various developers:
hw-breakpoint is no longer solely of concern to debugger developers. This
increased scrutiny has resulted in major changes to the hw-breakpoint code
that are still ongoing. In particular, the coupling of perfcounters to
hw-breakpoint has caused the rethinking of a significant chunk of the
original hw-breakpoint functionality and structure.
The original (pre-perfcounter support) hw-breakpoint functionality was
primarily developed by K. Prasad. It supported global, system-wide
kernel-space breakpoints and per-thread user-space breakpoints. Whereas
user-space breakpoints were only enabled during thread execution, kernel
breakpoints were always present on all CPUs in the system. Additionally,
no reservation policy was implemented. Requests for hardware debug
registers were granted on a first-come, first-serve basis. Once all
physical debug registers were used, hw-breakpoint returned an error for
further breakpoint requests.
This original hw-breakpoint implementation is "an
utter mis-match" to support perfcounter functionality for three
reasons, as pointed out
by Peter Zijlstra. First, counters (either user or kernel-space) can be
defined per-cpu or per-task; this conflicts with hw-breakpoint's
system-wide kernel breakpoints. Second, per-task counters are scheduled by
perfcounter to save unnecessary context swaps of the underlying hardware
resources when it is not necessary. Third, counters can be multiplexed, in
a time-sliced fashion, beyond the underlying hardware PMUs (performance
monitoring unit) resource limit, which for x86 hardware breakpoints is
four. These incongruities between perfcounter and hw-breakpoint led to a
debate about any coupling between hw-breakpoint and perfcounter. However,
a consensus formed that integrating hw-breakpoint into perfcounter's PMU
reservation and scheduling infastructure would be beneficial given
perfcounters richer support for scheduling, reservation, and management of
hardware resources. About these benefits Frederic Weisbecker writes:
And in the end we have a pmu (which unifies the control of
this profiling unit through a well established and known object for
perfcounter) controlled by a high level API that could also benefit to
other debugging subsystems.
Newly posted in the last week is Weisbecker's patch to
integrate hw-breakpoint and perfcounter code. Conceptually, this splits
the hw-breakpoint functionality into two halves: 1) the top level API, and
2) the low level debug register control. In between these halves
lies the perfcounter functionality. With this patch each breakpoint is a
specific perfcounter instance called a breakpoint counter. Perfcounter
handles register scheduling, and thread/CPU attachment of these breakpoint
counter instances. The modified hw-breakpoint API still handles requests
from ptrace(), ftrace, and kgdb for breakpoints by
creating a breakpoint counter. Breakpoint counters can also be created
directly from the existing perfcounter system call
(perf_counter_open()). The breakpoint counter layer interacts
with the low-level, architecture specific hw-breakpoint code that handles
reading and writing the processor's debug registers.
Unfortunately, because of the very recent integration into
perfcounters, the hw-breakpoint API has changed and additional changes to
the API are planned. Rather than cover in detail the existing API, since it
appears likely to change, I will give a summary of it. Two Function calls
are provided to set a new hardware breakpoint.
int register_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp);
int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, int cpu);
cpu is the cpu number to set the breakpoint on;
*tsk is a pointer to 'task_struct' of the process to which the address belongs;
*bp is a pointer to the breakpoint property information which includes:
1) a pointer to function handler to be invoke upon hitting the breakpoint;
2) a pointer to architecture dependent data (struct arch_hw_breakpoint).
The struct arch_hw_breakpoint
provides breakpoint properties such
as the memory address of the breakpoint, type of memory access
(read/write, read, or write), and the length of memory access (byte,
short, word, ...). These parameters are highly dependent upon the
specific support provided by the hardware. For example, while x86
supports virtual memory addresses, other processors support physical
memory addresses. Since the API aims for architecture independence, this
structure is architecture dependent.
To avoid having to
register and unregister a breakpoint if it just needs modification, the
following function is provided:
int modify_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp)
Hardware breakpoints are removed by an unregister function:
void unregister_hw_breakpoint(struct hw_breakpoint *bp)
Hw-breakpoint has made its way into the -tip tree, the kernel source
development tree maintained by Ingo Molnar. In June it was tentatively
targeted for merging from -tip into the 2.6.32 kernel. However,
the delayed integration with perfcounters has pushed any merge out past
Whenever it is released, hw-breakpoint promises to provide a portable
and robust method for debuggers to access hardware breakpoints without
conflict. While the hw-breakpoint functionality started out as a relatively
isolated feature to support debuggers, its existence has spawned new
tracing and performance monitoring features. These new features should
prove useful for various situations where data memory access, rather than
instruction access provides the appropriate trigger to collect dynamic
information. By leveraging the perfcounter resource scheduling and
reservation functionality, hw-breakpoint has a very generalized method for
managing limited hardware breakpoint registers. The release of
hw-breakpoint promises to enable new ways for Linux users to track down
difficult bugs such as memory corruption, and to enable diverse dynamic
data access techniques (such as gdb watchpoints and ftrace
ksym_tracer) to play well together.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>