User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.32 merge window is open, so there is no current development kernel release. The usual vast pile of patches has been merged; see the article below for a summary.

The current stable kernel is 2.6.31; no stable updates have yet been released for this kernel. For older kernels, and were released on September 15. Both contain a handful of important fixes.

Comments (1 posted)

Quotes of the week

No need anymore to write some printk to debug, worrying, sweating, feeling guilty because we know we'll need yet another printk() after the reboot, and we even already know where while it is compiling.

We would build less kernels, then drink less coffee, becoming less nervous, more friendly. Everyone will offer flowers in the street, the icebergs will grow back and white bears will...

And eventually we'll be inspired enough to write perf love, the more than expected tool to post process ftrace "love" events.

-- Frederic Weisbecker

How can waiting for child1 to run a bit before forking off child2 _not_ hurt? The parent is the worker bee creator, the queen bee if you will. Seems to me that making the queen wait until one egg hatches and ages a bit before laying another egg is a very bad plan if the goal is to have a hive full of short lived worker bees.
-- Mike Galbraith (thanks to Ingo Molnar)

And yes, but the engineering model of the kernel development cycle is that engineer hours are wasted and thrown away all the time. They are surplus, sorry. That's how life works here.
-- Greg Kroah-Hartman

One of my functions is pointlessly sending patches at maintainers so you don't have to.
-- Andrew Morton

Comments (1 posted)

Writing kernel modules in Haskell

There must be a crowd of people out there thinking that they would get into kernel development, but only if they could do it in Haskell. Here is a web site with instructions on how to do just that. "By making GHC and the Linux build system meet in the middle we can have modules that are type safe and garbage collected. Using the copy of GHC modified for the House operating system as a base, it turns out to be relatively simple to make the modifications necessary to generate object files for the Kernel environment." This leads to code which looks like:

    hello = newCString "hello" >>= printk >> return 0

Just don't try to merge it upstream.

Comments (84 posted)

Van de Ven: Introducing 'timechart'

Arjan van de Ven introduces a new tool, called "timechart" on his weblog. Timechart is meant to help visualize and diagnose latency problems in a running Linux system. "To solve this, I have been working on a new tool, called Timechart, based on 'perf', that has the objective to show on a system level what is going on, at various levels of detail. In fact, one of the design ideas behind timechart is that the output should be 'infinitely zoomable'; that is, if you want to know more details about something, you should be able to zoom in to get these details."

Comments (16 posted)

Video buffer pools

The Video4Linux2 API has a well-developed interface for sharing video buffers between user space and the kernel. It is not without its problems, though. Simple video acquisition devices transfer large amounts of data (video frames) but cannot do scatter/gather I/O, forcing the allocation of large, physically-contiguous buffers. Queueing buffers for frame transfers can be a significant source of latency, especially when user-space buffers need to be locked into memory or when the architecture requires significant cache invalidation operations. It would also be nice to be able to pass buffers directly between video devices and related devices, such as hardware codecs, but the current API does not support that well.

In response to these problems, Laurent Pinchart has proposed a new subsystem implementing a global video buffer pool. These buffers would be allocated early in the system's lifetime, working around the unreliability of large contiguous allocations. Cache invalidation operations could be done ahead of time, eliminating a significant source of capture-time latency. Passing buffers between devices would be explicitly supported. The proposal is in an early stage, and Laurent would like comments from interested developers.

Comments (1 posted)

Bouncing off the merge window

At this stage of the development cycle, attention naturally turns to what has been merged into the mainline kernel. It can also be interesting, though, to look at what is not getting in. This time around, a couple of things have run into opposition at merge time and may, as a result, not find their way into the 2.6.32 kernel.

One of those is the reflink() system call (covered last week), which got an "I'm not pulling this" response from Linus. His objections included the way the system call was seemingly hidden in the ocfs2 tree, concern over how much VFS and security review it has received, and a dislike of the name. He would rather see a name like copyfile(), and he would like it to be more flexible; enabling server-side copying of files on remote filesystems was one idea which was raised.

In response, Joel Becker has proposed a new system call, called copyfile(), which would offer more options regarding just how the copy is done. There has not been much input from developers other than Linus, but Linus, at least, seems to like the new approach. So reflink() is likely to evolve into copyfile(), but there is clearly not time for that to happen in the 2.6.32 merge window.

The other development encountering trouble is fanotify (covered in July). The problem here is that there still is no real consensus on what the API should look like. The current implementation is based on a special socket and a bunch of setsockopt() calls, but there has been pressure (from some) to switch to netlink or (from others) to a set of dedicated system calls. Linus made a late entry into the discussion with a post in favor of the system call alternative; he also asked:

I still want to know what's so wonderful about fanotify that we would actually want yet-another-filesystem-notification-interface. So I'm not saying that I'll take a system call interface. I just don't think that hiding interfaces behind some random packet interface is at all any better.

That led to an ongoing discussion about what fanotify is for, whether a new notification API is necessary, and whether fanotify can handle all of the things that people would like to do with it. See Jamie Lokier's post for a significant set of concerns. Linux developers have added two inadequate file notification interfaces so far; there is a certain amount of interest in ensuring that a third one would be a little better. So chances are good that fanotify will sit out this development cycle.

Comments (1 posted)

Kernel development news

2.6.32 merge window, part 1

By Jonathan Corbet
September 16, 2009
Linus started taking patches for the 2.6.32 merge window on September 10. Thus begins the process which should lead to a final kernel release around the beginning of December. As of this writing, some 4400 non-merge changes have been merged. The most significant user-visible changes include:

  • The per-BDI write back threads patch has been merged; this should lead to better writeback scalability.

  • The devtmpfs virtual filesystem has been merged. This feature, which is seen by many as the return of the much-disliked devfs subsystem, has been controversial from the beginning, despite the facts that it differs significantly from devfs and some distributions are already making good use of it. So it's not surprising that there was opposition to it being merged. Linus silently accepted it, though, so it will appear in 2.6.32.

  • The keyctl() system call has a new command (KEYCTL_SESSION_TO_PARENT) which causes the calling process's keyring to replace its parent's keyring. This feature is evidently useful for the AFS filesystem; there's also a new set of security module hooks to control this functionality.

  • The sysfs filesystem now understands security labels, allowing for tighter security policy control over access to sysfs files.

  • The S390 architecture is now able to "call home" and send kernel oops reports to the service organization's mothership. This functionality is controlled with the unobviously-named SCLP_ASYNC configuration option.

  • the OProfile code now implements multiplexing of performance counters, allowing for the collection of a larger range of statistics.

  • The SCHED_RESET_ON_FORK scheduler policy flag has been added. This flag (described in this article), causes a child process to not inherit elevated priority or realtime scheduling from its parent.

  • The perf tool has a new trace operation; it generates a simple output stream from a user-specified set of tracepoints.

  • The default value of the child_runs_first scheduler sysctl knob has been changed to "false." This causes the parent process to continue running after a fork() rather than yielding immediately to the child process. See this article for more information on 2.6.32 scheduler changes.

  • There is a new set of scheduler tracepoints which improve visibility into wait, sleep, and I/O wait times. There are also new tracepoints for module loading and reference count events, system call entry and exit, network packet copies to user space, and KVM interrupt and memory-mapped I/O events.

  • A vast amount of work has happened within the wireless networking subsystem; most of it consists of cleanups and improvements which are not immediately visible to the user. Additionally, wireless extensions compatibility has been improved and there is now network namespace support in cfg80211.

  • The SPARC64 architecture now has rudimentary performance counter support.

  • The KVM virtualization subsystem has gained a module called "irqfd"; it allows the host to inject interrupts into guest systems. Along with irqfd comes a new "ioeventfd" feature enabling emulated memory-mapped I/O in guests. KVM also now has support for the "unrestricted guest" mode supported by latter-day Intel VMX-capable processors.

  • The Intel TXT integrity management mechanism is now in the mainline.

  • There is a new "VGA arbitration" module which allows independent applications to function properly with multiple VGA devices wired to the same address space. Control is through /dev/vga_arbiter; see Documentation/vgaarbiter.txt for details.

  • There is the usual pile of new drivers:

    • Audio: Zoom2 system-on-chip boards, Wolfson WM8523, WM8776, WM8974, WM8993 and WM8961 codecs, Freescale IMX SSI devices, Freecale i.MX1x and i.MX2x-based audio DMA controllers, AD1938 and AD1836 sound chips, ADI BF5xx chip audio devices, Openmoko Neo FreeRunner (GTA02) sound devices, DaVinci DM6446 or DM355 EVM audio devices, Amstrad E3 (Delta) videophones, Renesas SH7724 serial audio interfaces, AKM AK4642/AK4643 audio devices, Simtec TLV320AIC23 audio devices, Conexant CX20582 codecs, and Cirrus Logic CS4206 codecs.

    • Boards and processors- Atmel AT91sam9g45 and AT91sam9g10 processors, Eukrea CPUIMX27, MBIMX27, CPUAT91, CPU9260, and CPU9G20 processors, Broadcom BCMRing system-on-chip processors, Nuvoton NUC900 and NUCP950 CPUs, Marvell OpenRD Base boards, Freescale i.MX25 processors, Motorola Zn5 GSM phones, phyCARD-s (aka pca100) platforms, Airgoo Home Media Terminal devices, Samsung S5PC1XX-based systems, LaCie 2Big Network NAS systems, ST Ericsson Nomadic 8815-based systems, Freescale MPC837x RDB/WLAN boards, Freescale P2020RDB reference boards, and AppliedMicro PPC460SX Eiger evaluation boards.

    • Block: RDC PATA controllers, PMC SIERRA Linux MaxRAID adapters, and a (staging) driver called "cowloop", described as "Cowloop is a "copy-on-write" pseudo block driver. It can be stacked on top of a "real" block driver, and catches all write operations on their way from the file systems layer above to the real driver below, effectively shielding the lower driver from those write accesses. The requests are then diverted to an ordinary file, located somewhere else (configurable)."

    • Networking: Broadcom BCM8727, BCM50610M and AC131 PHY devices, Infineon ISAC/HSCX, ISACX, IPAC and IPACX ISDN chipsets, AVM FRITZ!CARD ISDN adapters, Traverse Technologies NETJet PCI ISDN cards, Winbond W6692 based ISDN cards, Sedlbauer Speedfax+ ISDN cards, Atheros AR9287 and AR9271 chipsets, TI wl1271 chipsets, Xilinx 10/100 Ethernet Lite devices, Marvell 88W8688 Bluetooth interfaces, Marvell SD8688 Bluetooth-over-SDIO interfaces, Ralink RT3090-based wireless adapters (staging), and Realtek 8192 PCI devices (staging).

    • Video4Linux: Zarlink ZL10039 silicon tuners.

    • Miscellaneous: Marvell CESA cryptographic engines, EP93xx pulse-width modulators, Samsung S3C24XX or S3C64XX onboard ADCs, Twinhan USB 6253:0100 remote controls, Blackfin rotary input devices, Sentelic Finger Sensing Pad devices, TI TWL4030/TWL5030/TPS659x0 keypad devices, Quatech USB2.0 to serial adaptors (staging), the Android MSM shared memory driver (staging), HTC Dream QDSP chips (staging), HTC Dream camera devices (staging), VME busses (staging), Microsoft's Hyper-V virtualization drivers (staging), Discretix security processor devices (staging), ST Microelectronics LIS3L02DQ accelerometers (staging), TAOS TSL2561 light-to-digital converters (staging), Kionix KXSD9 accelerometers (staging), MAXIM max1363 ADC devices (staging), and VTI SCA3000 series accelerometers (staging).

Changes visible to kernel developers include:

  • There is a new check_acl() operation added to struct inode_operations. It's part of a push by Linus to move more permissions testing logic into the VFS core and reduce locking in the process.

  • There is a new kernel_module_request() hook in the security module API; it allows security modules to decide whether to allow request_module() calls to succeed. There is also a new set of hooks for the TUN driver.

  • Spinlocks can be built as inline operations for architectures where that performs better.

  • The "classic read-copy-update" and "preempt RCU" implementations have been removed in favor of "tree RCU" and "bloatwatch RCU".

  • The low-level interrupt handling code has gained support for interrupt controllers accessed by way of slow (I2C, say) busses. Among other things, that leads to the addition of the IRQF_ONESHOT flag, which causes an interrupt with a threaded handler to remain masked in the time between the execution of the hard and threaded handlers.

  • The tracing ring buffer is now entirely lockless on the writer's side. See this article for details.

  • As described briefly in this article, the network driver API has changed. The return type for ndo_start_xmit() is now netdev_tx_t, an enum value. For most drivers, simply changing the declared return type for that function will be sufficient.

  • The blk-iopoll block-layer interrupt mitigation code has been merged.

  • Configuring the kernel with "make localmodconfig" will create a configuration pared down to the modules currently loaded in the running kernel. "make localyesconfig" builds the modules into the kernel instead.

  • The new power management core has been merged.

The merge window should stay open for at least another week; it is not clear how LinuxCon and the Linux Plumbers Conference might affect the schedule. Next week's edition will contain an update on changes merged after the publication of this page.

Comments (7 posted)

Various scheduler-related topics

By Jonathan Corbet
September 15, 2009
Scheduler-related development seems to come in bursts. Things will be relatively quiet for a few development cycles, then activity will suddenly increase. We would appear to be in one of those periods where developers start to show a higher level of interest in what the scheduler is doing. The posting of the BFS scheduler has certainly motivated some of this activity, but there is more than that going on.


On the BFS front, the (mildly) inflammatory part of the discussion would appear to have run its course. Anybody who has watched the linux-kernel list knows that serious attempts to fix problems often follow the storm; that appears to be the case this time around. Benchmarks are being posted by a number of people; as a general rule, the results of these benchmark runs tend to be mixed. There are also developers and users posting about problems that they are observing; see, for example, Jens Axboe's report of a ten-second pause while trying to run the xmodmap command.

As part of the process of tracking down problems, the conversation turned to tuning the scheduler. Ingo Molnar pointed out that there is a whole set of flags governing scheduler behavior, all of which can be tweaked by the system administrator:

Note, these flags are all runtime, the new settings take effect almost immediately (and at the latest it takes effect when a task has started up) and safe to do runtime. It basically gives us 32768 pluggable schedulers each with a slightly separate algorithm - each setting in essence creates a new scheduler.

The idea here is not that each user should be required to pick out the correct scheduler from a set of 32768 - a number which presumably seems high even to the "Linux is about choice" crowd. But these flags can be useful for anybody who is trying to track down why the behavior of the scheduler is not as good as it should be. When a tuning change improves things, it gives developers a hint about where they should be looking to find the source of the problem.

A particular test suggested by Ingo was this:

    echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features

(Politically-correct developers will, of course, have debugfs mounted under /sys/kernel/debug. Your editor takes no position on the proper debugfs mount point.)

One tester reported immediately that setting this flag made the problems go away. Jens also noted that his ten-second xmodmap problem was solved. The evidence of problems with the NEW_FAIR_SLEEPERS feature was compelling enough that Ingo posted a patch to disable it by default; that patch has been merged for 2.6.32.

For the curious, the NEW_FAIR_SLEEPERS feature is a simple tweak which gives a process a small runtime credit when it returns to the run queue after a sleep. It is meant to help interactive processes, but, clearly, something is not working as expected. Once the real problem has been tracked down, it's possible that the NEW_FAIR_SLEEPERS feature could, once again, be enabled by default. In the mean time, users experiencing interactivity problems may want to try disabling it and seeing if things get better.


Another default parameter is changing for 2.6.32; it controls which process runs first after a fork(). For much of the recent past, fork() has arranged things such that the child process gets to run before fork() returns to the parent; this behavior was based on the general observation that the child's work is often more important. There is a good reason to run the parent first, though: the parent's state is active in the processor, the translation lookaside buffer (TLB) contains the right information, etc. So parent-runs-first should perform better. It appears that recent tests showed that parent-runs-first does, indeed, outperform child-runs-first on that most important benchmark: kernel builds. That was enough to get the default changed.

There are some concerns that this change could expose application bugs. Jesper Juhl expresses those concerns this way:

I'm just worried that userspace programs have come to rely on a certain behaviour and changing that behaviour may result in undesired results for some apps. In a perfect world people would just fix those apps that (incorrectly) relied on a certain child-/parent-runs-first behaviour, but the world is not perfect, and many apps may not even have source available.

Child-runs-first has never been a part of the fork() API, though; it's not something that applications should rely on. Even before the change, behavior could differ as a result of preemption, SMP systems, and more. So it's really true that child-runs-first was never guaranteed. But that will not make users feel any better if applications break. To help those users, there is a new kernel.sched_child_runs_first sysctl knob; setting it to one will restore the previous behavior.

Better cpuidle governance

Active CPU scheduling is interesting, but there is also work happening in another area: what happens when nobody wants the CPU? Contemporary processors include a number of power management features which can be used to reduce power consumption when nothing is going on. Clearly, anybody who is concerned about power consumption will want the processor to be in a low-power state whenever possible. There are, however, some problems with a naive "go into a low power state when idle" policy:

  • Transitions between power states will, themselves, consume power. If a CPU is put into a very low-power state, only to be brought back into operation a few microseconds later, the total power consumption will increase.

  • Power state transitions have a performance cost. An extreme example would be simply pulling the plug altogether; power consumption will be admirably low, but the system will experience poor response times that not even the BFS scheduler can fix. Putting the CPU into a more conventional low-power state will still create latencies; it takes a while for the processor to get back into a working mode. So going into a low-power state too easily will hurt the performance of the system.

It turns out that the CPU "governor" code in the mainline kernel often gets this decision wrong, especially for the newer Intel "Nehalem" processors; the result is wasted energy and poor performance, where "poor performance" means a nearly 50% hit on some tests that Arjan van de Ven ran. His response was to put together a patch aimed at fixing the problems. The approach taken is interesting.

Clearly, it makes no sense to put the processor into a low-power state if it will be brought back to full power in the very near future. So all the governor code really has to do is to come up with a convincing prediction of the future so it knows when the CPU will be needed again. Unfortunately, the chip vendors have delayed the availability of the long-promised crystal-ball peripherals yet again, forcing the governor code to rely on heuristics; once again, software must make up for deficiencies in the hardware.

When trying to make a guess about when a CPU might wake up, there are two things to consider. One is entirely well known: the time of the next scheduled timer event. The timer will put an upper bound on the time that the CPU might sleep, but it is not a definitive number; interrupts may wake up the CPU before the timer goes off. Arjan's governor tries to guess when that interrupt might happen by looking at the previous behavior of the system. Every time that the processor wakes up, the governor code calculates the difference between the estimated and actual idle times. A running average of that difference is maintained and used to make a (hopefully) more accurate guess as to what the next idle time will really be.

Actually, several running averages are kept. The probability of a very long idle stretch being interrupted by an interrupt is rather higher than the probability when expected idle period is quite short. So there is a separate correction factor maintained for each order of magnitude of idle time - a 1ms estimate will have a different correction factor than a 100µs or a 10ms guess will. Beyond that, a completely different set of correction factors is used (and maintained) if there is I/O outstanding on the current CPU. If there are processes waiting on short-term (block) I/O, the chances of an early wakeup are higher.

The performance concern, meanwhile, is addressed by trying to come up with some sort of estimate of how badly power-management latency would hurt the system. A CPU which is doing very little work will probably cause little pain if it goes to sleep for a while. If, instead, the CPU is quite busy, it's probably better to stay powered up and ready to work. In an attempt to quantify "busy," the governor code calculates a "multiplier":

    multiplier = 1 + 20*load_average + 10*iowait_count

All of the numbers are specific to the current CPU. So the multiplier is heavily influenced by the system load average, and a bit less so by the number of processes waiting for I/O. Or so it seems - but remember that processes in uninterruptible waits (as are used for block I/O) are counted in the load average, so their influence is higher than it might seem. In summary, this multiplier grows quickly as the number of active processes increases.

The final step is to examine all of the possible sleep states that the processor provides, starting with the deepest sleep. Each sleep state has an associated "exit latency" value, describing how long it takes to get out of that state; deeper sleeps have higher exit latencies. The new governor code multiplies the exit latency by the multiplier calculated above, then compares the result to its best guess for the idle time. If that idle time exceeds the adjusted latency value, that sleep state is chosen. Given the large multipliers involved, one can see that expected idle times must get fairly long fairly quickly as the system load goes up.

According to Arjan, this change restores performance to something very close to that of a system which is not using sleep states at all. The improvement is significant enough that Arjan would like to see the code merged for 2.6.32, even though it just appeared during the merge window. That might happen, though it is possible that it will turned into a separate CPU governor for one development cycle just in case regressions turn up.

Comments (8 posted)

Hw-breakpoint: shared debugging registers

September 16, 2009

This article was contributed by Jon Ashburn

Modern processors support hardware breakpoint or watchpoint debugging functionality, but the Linux kernel does not provided a way for debuggers, such as kgdb or gdb, to access these breakpoint registers in a shared manner. Thus, debuggers running concurrently can easily collide in their use of these registers, causing the debuggers to act in a strange and confusing manner. For example, continuing execution through a breakpoint, rather than breaking, would certainly confuse a programmer.

This issue is being addressed by a proposed kernel API called hw-breakpoint (alternatively hw_breakpoint). The hw-breakpoint functionality, developed in a series of patches by K. Prasad, Frederic Weisbecker, and Alan Stern, aims to provide a consistent, portable, and robust method for multiple programs to access special hardware debug registers. These registers are useful for any application that requires the ability to observe memory data accesses, or trigger the collection of program information based on data accesses. Such applications include debugging, tracing, and performance monitoring. While these patches initially target the x86, they attempt to provide a generic API that can be supported in an architecture independent manner on various processors. Although the details are still being ironed out, with hw-breakpoint hardware debug resources can be concurrently available to various users in a more portable manner.

The most common debugging scenarios that would use the hw-breakpoint patches are memory corruption bugs. Programming mistakes such as bad pointers, buffer overruns, and improper memory allocation/deallocation can lead to memory corruption where valid data is accidentally overwritten. These bugs can be hard to find; the corruption can occur anywhere in the program. The error resulting from the corruption often occurs long after the corruption. These bugs cannot typically be found by focusing on the local sections of code that explicitly access the corrupted data. Instead, debugger watchpoints, which are a special type of breakpoint, are the first choice for debugging memory corruption problems.

Debugger breakpoints halt program execution at a given address and transfer control to the debugger. This allows the program state (variables, memory, and registers) to be examined. When programmers talk of breakpoints they usually are referring to software breakpoints. For example, in gdb the break command sets a software breakpoint at the specified instruction address. The break command replaces the specified instruction with a trap instruction that, when executed, passes control to gdb.

In contrast, watchpoints are best implemented using hardware breakpoints; software implementations of watchpoints are extremely slow. But, hardware breakpoints require special debug registers in the processor. These debug registers continuously monitor memory addresses generated by the processor, and a trap handler is invoked if the address in the register matches the address generated by the processor.

Memory accesses can be for data read, data write, or instruction execute (fetch), so hardware breakpoints usually support trapping on not only the address, but also the type of access: read, write, read/write, or execute. Hardware debug registers may also support trapping on IO port accesses in addition to memory accesses. In either case, a watchpoint is a trap on any type of data access rather than just an instruction execute access. Since memory corruption can happen anywhere in the program, a watchpoint set to trap on writes to the corrupted variable/location can be a good way to catch these bugs in the act.

These hardware debug registers are limited resources: Intel x86 processors support up to four hardware breakpoints/watchpoints using the special purpose DR0 to DR7 registers. Registers DR0 to DR3 can be programmed with the virtual memory address of the desired hardware breakpoint or watchpoint. DR4 and DR5 are reserved for processor use. DR6 is a status register that gives information about the last breakpoint hit, such as the register number of the breakpoint, and DR7 is the breakpoint control register. DR7 includes controls such as, local and global enables, memory access type, and memory access length. However, as with any limited hardware resource, multiple software users must contend for access of these registers.

Since existing released kernels do not control or arbitrate access to these registers, software users can unknowingly clash in their usage, which usually will result in a software error or crash. Hw-breakpoint solves this problem by arbitrating the access to these limited hardware registers from both user-space and kernel-space software. User-space access, such as from gdb, is done via the ptrace() system call. Kernel-space access includes kgdb and KVM (only during context switches between host and guests). Hw-breakpoint arbitration keeps kernel and/or user space debuggers from stepping on each others' toes .

Additional kernel patches have been developed to take advantage of the hw-breakpoint API. A plug-in for ftrace (ftrace has previously been discussed in LWN articles here and here) has been developed to dynamically trace any kernel global symbol. This functionality, called ksym_tracer, allows all read and write accesses on a kernel variable to be displayed in debugfs. Since it uses the hw-breakpoint API, it relies on underlying hardware breakpoint support. This new feature of ftrace could be very useful for memory corruption bugs that are difficult to catch with watchpoints. These difficulties include such things as: 1) an erroneous write that is lurking beneath a large quantity of valid writes, 2) the necessity to setup a remote machine to run Kgdb, and 3) kernel bugs which no longer manifest themselves when the machine is halted via breakpoints. Hw-breakpoint allows the concurrent use of both ksym_tracer and debugger watchpoints without the risk of hardware debug register corruption.

In addition to ftrace, perfcounters (see LWN articles here and here) can be enhanced through the generic hw-breakpoint functionality. Specifically, counters can be updated based on data accesses rather than instruction execution. A patch to perfcounters has been developed to use kernel-space hardware breakpoints to monitor performance events associated with data accesses. For example, spinlock accesses can be counted by monitoring the spinlock flag itself. Currently this patch is rather limited in supporting the definition and use of breakpoint counters. However, additional features are planned.

Since the additions to ftrace and perfcounter patches, the hw-breakpoint API can now be potentially used by several pieces of code: kgdb, KVM, ptrace, ftrace, and perfcounters. This increased potential usage has resulted in increased scrutiny of the API by various developers: hw-breakpoint is no longer solely of concern to debugger developers. This increased scrutiny has resulted in major changes to the hw-breakpoint code that are still ongoing. In particular, the coupling of perfcounters to hw-breakpoint has caused the rethinking of a significant chunk of the original hw-breakpoint functionality and structure.

The original (pre-perfcounter support) hw-breakpoint functionality was primarily developed by K. Prasad. It supported global, system-wide kernel-space breakpoints and per-thread user-space breakpoints. Whereas user-space breakpoints were only enabled during thread execution, kernel breakpoints were always present on all CPUs in the system. Additionally, no reservation policy was implemented. Requests for hardware debug registers were granted on a first-come, first-serve basis. Once all physical debug registers were used, hw-breakpoint returned an error for further breakpoint requests.

This original hw-breakpoint implementation is "an utter mis-match" to support perfcounter functionality for three reasons, as pointed out by Peter Zijlstra. First, counters (either user or kernel-space) can be defined per-cpu or per-task; this conflicts with hw-breakpoint's system-wide kernel breakpoints. Second, per-task counters are scheduled by perfcounter to save unnecessary context swaps of the underlying hardware resources when it is not necessary. Third, counters can be multiplexed, in a time-sliced fashion, beyond the underlying hardware PMUs (performance monitoring unit) resource limit, which for x86 hardware breakpoints is four. These incongruities between perfcounter and hw-breakpoint led to a debate about any coupling between hw-breakpoint and perfcounter. However, a consensus formed that integrating hw-breakpoint into perfcounter's PMU reservation and scheduling infastructure would be beneficial given perfcounters richer support for scheduling, reservation, and management of hardware resources. About these benefits Frederic Weisbecker writes:

And in the end we have a pmu (which unifies the control of this profiling unit through a well established and known object for perfcounter) controlled by a high level API that could also benefit to other debugging subsystems.

Newly posted in the last week is Weisbecker's patch to integrate hw-breakpoint and perfcounter code. Conceptually, this splits the hw-breakpoint functionality into two halves: 1) the top level API, and 2) the low level debug register control. In between these halves lies the perfcounter functionality. With this patch each breakpoint is a specific perfcounter instance called a breakpoint counter. Perfcounter handles register scheduling, and thread/CPU attachment of these breakpoint counter instances. The modified hw-breakpoint API still handles requests from ptrace(), ftrace, and kgdb for breakpoints by creating a breakpoint counter. Breakpoint counters can also be created directly from the existing perfcounter system call (perf_counter_open()). The breakpoint counter layer interacts with the low-level, architecture specific hw-breakpoint code that handles reading and writing the processor's debug registers.

Unfortunately, because of the very recent integration into perfcounters, the hw-breakpoint API has changed and additional changes to the API are planned. Rather than cover in detail the existing API, since it appears likely to change, I will give a summary of it. Two Function calls are provided to set a new hardware breakpoint.

     int register_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp);
     int register_kernel_hw_breakpoint(struct hw_breakpoint *bp, int cpu);
     cpu   is the cpu number to set the breakpoint on;
     *tsk  is a pointer to 'task_struct' of the process to which the address belongs;
     *bp   is a pointer to the breakpoint property information which includes:
             1) a pointer to function handler to be invoke upon hitting the breakpoint; 
             2) a pointer to architecture dependent data (struct arch_hw_breakpoint).
The struct arch_hw_breakpoint provides breakpoint properties such as the memory address of the breakpoint, type of memory access (read/write, read, or write), and the length of memory access (byte, short, word, ...). These parameters are highly dependent upon the specific support provided by the hardware. For example, while x86 supports virtual memory addresses, other processors support physical memory addresses. Since the API aims for architecture independence, this structure is architecture dependent.

To avoid having to register and unregister a breakpoint if it just needs modification, the following function is provided:

    int modify_user_hw_breakpoint(struct task_struct *tsk, struct hw_breakpoint *bp)
Hardware breakpoints are removed by an unregister function:
    void unregister_hw_breakpoint(struct hw_breakpoint *bp)

Hw-breakpoint has made its way into the -tip tree, the kernel source development tree maintained by Ingo Molnar. In June it was tentatively targeted for merging from -tip into the 2.6.32 kernel. However, the delayed integration with perfcounters has pushed any merge out past 2.6.32.

Whenever it is released, hw-breakpoint promises to provide a portable and robust method for debuggers to access hardware breakpoints without conflict. While the hw-breakpoint functionality started out as a relatively isolated feature to support debuggers, its existence has spawned new tracing and performance monitoring features. These new features should prove useful for various situations where data memory access, rather than instruction access provides the appropriate trigger to collect dynamic information. By leveraging the perfcounter resource scheduling and reservation functionality, hw-breakpoint has a very generalized method for managing limited hardware breakpoint registers. The release of hw-breakpoint promises to enable new ways for Linux users to track down difficult bugs such as memory corruption, and to enable diverse dynamic data access techniques (such as gdb watchpoints and ftrace ksym_tracer) to play well together.

Comments (1 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds