User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

[Andrew Morton] The current 2.6 prepatch is 2.6.12-rc3, which was announced by Linus on April 20. This is the first such created with "git" rather than BitKeeper. The patches are mostly fixes, but there is a rework of the kobject API in there as well. The long-format changelog has the details.

There have been no -mm trees released in the last week; Andrew Morton is currently traveling (though, as can be seen from the picture to the right, not away from his computer).

Comments (none posted)

Kernel development news

Quotes of the week

Looks good from your explanation, but I'm too tired to look at the code. It's 1AM, and the kids get up at 7. I'm not much of a hacker, I usually crash by 10PM these days ;^)
-- Linus Torvalds

And we should all digitally sign every single object too, and we should use 4096-bit PGP keys and unguessable passphrases that are at least 20 words in length. And we should then build a bunker 5 miles underground, encased in lead, so that somebody cannot flip a few bits with a ray-gun, and make us believe that the sha1's match when they don't. Oh, and we need to all wear aluminum propeller beanies to make sure that they don't use that ray-gun to make us do the modification _outselves_.
-- Linus Torvalds, not impressed by SHA worries.

Comments (none posted)

A very quick guide to starting with git

Linus has posted a git archive containing the 2.6.12-rc2 kernel source with a small series of patches. His current plan is to not populate that repository with the full development history reclaimed from BitKeeper. Adding the history would massively bloat the size of the repository, and git currently lacks the tools to do anything interesting with that history anyway. So the repository starts with a clean slate and goes from there.

If you want to experiment with the new setup, the steps are relatively simple. The first of which is to be sure that you are sufficiently interested to pull down a 120MB repository and play with bleeding-edge tools; in many cases, it might be better to wait a little longer. Should you choose to continue, the first step is to grab the latest git-pasky distribution, found at Untar it, and go through a series of steps like:

    git pull pasky

That will yield the current git, with Petr's added tools. Put said tools into your path, create a directory for the kernel tree, and run:

    git init rsync://

The command will appear to do nothing for quite some time; it will eventually pull down the entire repository and check out a copy. You'll now have a copy of the current Linus mainline tree.

Typing "git log" will print out the checkin log messages in reverse chronological order. "git pull" will update the tree to the current mainline. Just typing "git" will yield a list of possible commands. The capability is there, at this point, to check in changes, merge changes from other trees, generate patches, etc. Enjoy, but expect things to continue to change in a hurry.

Comments (14 posted)

Big-endian I/O memory

The kernel provides a set of functions for working easily with I/O memory. Those functions assume that the memory is stored in little-endian byte order. This assumption is usually valid - PCI peripherals, for example, are supposed to always use that ordering. There are devices out there, however, which export big-endian I/O memory. Dealing with these devices has required implementing special-purpose code in the drivers.

One of the few significant changes merged after 2.6.12-rc2 is a new set of I/O memory functions for working with big-endian devices. These functions are:

    unsigned int ioread16be(void __iomem *addr);
    unsigned int ioread32be(void __iomem *addr)
    void iowrite16be (u16 datum, void __iomem *addr);
    viod iowrite32be (u32 datum, void __iomem *addr);

These functions will handle the necessary byte swapping (or lack thereof) to present properly-ordered values on the host architecture. They are exported to modules.

Comments (1 posted)

An introduction to KProbes

April 18, 2005

This article was contributed by Sudhanshu Goswami



KProbes is a debugging mechanism for the Linux kernel which can also be used for monitoring events inside a production system. You can use it to weed out performance bottlenecks, log specific events, trace problems etc. KProbes was developed by IBM as an underlying mechanism for another higher level tracing tool called DProbes. DProbes adds a number of features, including its own scripting language for the writing of probe handlers. However, only KProbes has been merged into the standard kernel.

In this article I will describe the implementation of KProbes as present in the kernel. KProbes heavily depends on processor architecture specific features and uses slightly different mechanisms depending on the architecture on which it's being executed. The following discussion pertains only to the x86 architecture. This article assumes a certain familiarity with the x86 architecture regarding interrupts and exceptions handling. KProbes is available on the following architectures however: ppc64, x86_64, sparc64 and i386.

A kernel probe is a set of handlers placed on a certain instruction address. There are two types of probes in the kernel as of now, called "KProbes" and "JProbes." A KProbe is defined by a pre-handler and a post-handler. When a KProbe is installed at a particular instruction and that instruction is executed, the pre-handler is executed just before the execution of the probed instruction. Similarly, the post-handler is executed just after the execution of the probed instruction. JProbes are used to get access to a kernel function's arguments at runtime. A JProbe is defined by a JProbe handler with the same prototype as that of the function whose arguments are to be accessed. When the probed function is executed the control is first transferred to the user-defined JProbe handler, followed by the transfer of execution to the original function. The KProbes package has been designed in such a way that tools for debugging, tracing and logging could be built by extending it.

[KProbes architecture] The figure to the right describes the architecture of KProbes. On the x86, KProbes makes use of the exception handling mechanisms and modifies the standard breakpoint, debug and a few other exception handlers for its own purpose. Most of the handling of the probes is done in the context of the breakpoint and the debug exception handlers which make up the KProbes architecture dependent layer. The KProbes architecture independent layer is the KProbes manager which is used to register and unregister probes. Users provide probe handlers in kernel modules which register probes through the KProbes manager.

KProbes Interface

The data structures and functions implementing the KProbes interface have been defined in the file <linux/kprobes.h>. The following data structure describes a KProbe.

struct kprobe {
    struct hlist_node hlist;                    /* Internal */
    kprobe_opcode_t addr;                       /* Address of probe */
    kprobe_pre_handler_t pre_handler;           /* Address of pre-handler */
    kprobe_post_handler_t post_handler;         /* Address of post-handler */
    kprobe_fault_handler_t fault_handler;       /* Address of fault handler */
    kprobe_break_handler_t break_handler;       /* Internal */
    kprobe_opcode_t opcode;                     /* Internal */        
    kprobe_opcode_t insn[MAX_INSN_SIZE];        /* Internal */

Let's first talk about registering a KProbe. Users can insert their own probe inside a running kernel by writing a kernel module which implements the pre-handler and the post-handler for the probe. In case a fault occurs while executing a probe handler function, the user can handle the fault by defining a fault-handler and passing its address in struct kprobe. The prototypes for these are defined as below.

typedef int (*kprobe_pre_handler_t)(struct kprobe*, struct pt_regs*);
typedef void (*kprobe_post_handler_t)(struct kprobe*, struct pt_regs*, 
              unsigned long flags);
typedef int (*kprobe_fault_handler_t)(struct kprobe*, struct pt_regs*, 
             int trapnr);

As can be seen the pre-handler and the post-handler both receive a reference to the probe as well as the registers saved for the context in which the probe was hit. These values can be used in the pre-handler or post-handler or if required, they can be modified before returning control to the subsequent instruction. This also means that the same handlers can be used for multiple probe locations. The flags parameter is currently unused. The trapnr parameter (for the fault handler function) contains the exception number which occurred while handling the KProbe. A user defined fault handler can return 0 to let KProbe handle the fault further. It returns 1 if it has handled the fault and wants to let the execution of the probe handler continue.

Note that currently the pre-handler cannot be NULL for a probe, although the use of post-handler is optional. This is considered a bug since there may be cases where the pre-handler may not be required but a post-handler is needed. In such situations the user will still have to define a pre-handler. Another bug (which can oops the kernel) is related to probes which are activated on the ret/lret instructions. Yet another bug is related to probes activated on int3 instructions. All of these problems should be fixed in the 2.6.12 release of the kernel. However, these bugs can be easily avoided so they do not present any serious issues for someone who wants to use KProbes immediately without applying patches.

The KProbe registration functions are defined as shown below.

int register_kprobe(struct kprobe *p);
int unregister_kprobe(struct kprobe *p);

The registration function takes a reference to the KProbe structure describing the probe. Note that the user's module which registers the probe should keep a reference to the structure until the probe is unregistered. Since access to KProbes is serialized, a probe can be registered or unregistered anytime except from inside the probe handlers themselves, which will deadlock the system. This is because probe handlers execute after the spinlock used for locking KProbes has been acquired. The same spinlock is locked just before unregistering the probe. So if an attempt is made to unregister a probe inside a probe handler the same path will try to lock the spinlock twice.

Multiple probes cannot be placed on the same address as of now. However, a patch has been submitted to the kernel mailing list which allows multiple probes to be registered at the same address through another interface. It might be included in the next release of the kernel. Until then, if such an attempt is made register_kprobe() returns -EEXIST.

JProbes are used to give access to a function's arguments at runtime. This is achieved by providing a JProbe handler with the same prototype as that of the function being probed. At runtime, when the original function is executed, control is transferred to the JProbe handler after copying the process's context. On return from the JProbe handler, the context - consisting of the process's registers and the stack - is restored, so any modifications to the context of the process in the JProbe handler are lost. The execution continues from the point at which the probe was placed with the original saved state. A JProbe is represented by the structure given below.

struct jprobe {
    struct kprobe kp;
    kprobe_opcode_t *entry; 	/* user-defined JProbe handler address */

The user places the address of the function which will handle this probe in the entry field. The addr field in struct kprobe should be populated with the address of the function whose arguments are to be accessed. The functions used to register and unregister a JProbe are given below.

int register_jprobe(struct jprobe *p);
void unregister_jprobe(struct jprobe *p);

The JProbe handler which is written by the user should call jprobe_return() when it wants to return instead of the return statement.

KProbes Manager

The KProbes Manager is responsible for registering and unregistering KProbes and JProbes. The file kernel/kprobes.c implements the KProbes manager. Each probe is described by the struct kprobe structure and stored in a hash table hashed by the address at which the probe is placed. Access to this hash table is serialized by the spinlock kprobe_lock. This spinlock is locked before a new probe is registered, an existing probe is unregistered or when a probe is hit. This prevents these operations from executing simultaneously on a SMP machine. Whenever a probe is hit, the probe handler is called with interrupts disabled. Interrupts are disabled because handling a probe is a multiple step process which involves breakpoint handling and single-step execution of the probed instruction. There is no easy way to save the state between these operations hence interrupts are kept disabled during probe handling.

The manager is composed of these functions which are followed by a simplified description of what they do. These functions are architecture independent. A side-by-side reading of the code in kernel/kprobes.c and these steps will clarify the whole implementation.

void lock_kprobes(void)
Locks KProbes and records the CPU on which it was locked

void unlock_kprobes(void)
Resets the recorded CPU and unlocks KProbes

struct kprobe *get_kprobe(void *addr)
Using the address of the probed instruction, returns the probe from hash table

int register_kprobe(struct kprobe *p)
This function registers a probe at a given address. Registration involves copying the instruction at the probe address in a probe specific buffer. On x86 the maximum instruction size is 16 bytes hence 16 bytes are copied at the given address. Then it replaces the instruction at the probed address with the breakpoint instruction.

void unregister_kprobe(struct kprobe *p)
This function unregisters a probe. It restores the original instruction at the address and removes the probe structure from the hash table.

int register_jprobe(struct jprobe *jp)
This function registers a JProbe at a function address. JProbes use the KProbes mechanism. In the KProbe pre_handler it stores its own handler setjmp_pre_handler and in the break_handler stores the address of longjmp_break_handler. Then it registers struct kprobe jp->kp by calling register_kprobe()

void unregister_jprobe(struct jprobe *jp)
Unregisters the struct kprobe used by this JProbe

What happens when a KProbe is hit?

[Kprobe execution diagram] The steps involved in handling a probe are architecture dependent; they are handled by the functions defined in the file arch/i386/kernel/kprobes.c. After the probes are registered, the addresses at which they are active contain the breakpoint instruction (int3 on x86). As soon as execution reaches a probed address the int3 instruction is executed, causing the control to reach the breakpoint handler do_int3() in arch/i386/kernel/traps.c. do_int3() is called through an interrupt gate therefore interrupts are disabled when control reaches there. This handler notifies KProbes that a breakpoint occurred; KProbes checks if the breakpoint was set by the registration function of KProbes. If no probe is present at the address at which the probe was hit it simply returns 0. Otherwise the registered probe function is called.

What happens when a JProbe is hit?

[JProbe execution diagram] A JProbe has to transfer control to another function which has the same prototype as the function on which the probe was placed and then give back control to the original function with the same state as there was before the JProbe was executed. A JProbe leverages the mechanism used by a KProbe. Instead of calling a user-defined pre-handler a JProbe specifies its own pre-handler called setjmp_pre_handler() and uses another handler called a break_handler. This is a three-step process.

In the first step, when the breakpoint is hit control reaches kprobe_handler() which calls the JProbe pre-handler (setjmp_pre_handler()). This saves the stack contents and the registers before changing the eip to the address of the user-defined function. Then it returns 1 which tells kprobe_handler() to simply return instead of setting up single-stepping as for a KProbe. On return control reaches the user-defined function to access the arguments of the original function. When the user defined function is done it calls jprobe_return() instead of doing a normal return.

In the second step jprobe_return() truncates the current stack frame and generates a breakpoint which transfers control to kprobe_handler() through do_int3(). kprobe_handler() finds that the generated breakpoint address (address of int3 instruction in jprobe_handler()) does not have a registered probe however KProbes is active on the current CPU. It assumes that the breakpoint must have been generated by JProbes and hence calls the break_handler of the current_kprobe which it saved earlier. The break_handler restores the stack contents and the registers that were saved before transferring control to the user-defined function and returns.

In the third step kprobe_handler() then sets up single-stepping of the instruction at which the JProbe was set and the rest of the sequence is the same as that of a KProbe.

Possible problems

There could be several possible problems which could occur when a probe is handled by KProbes. The first possibility is that several probes are handled in parallel on a SMP system. However, there's a common hash table shared by all probes which needs to be protected against corruption in such a case. In this case kprobe_lock serializes the probe handling across processors.

Another problem occurs if a probe is placed inside KProbes code, causing KProbes to enter probe handling code recursively. This problem is taken care of in kprobe_handler() by checking if KProbes is already running on the current CPU. In this case the recursing probe is disabled silently and control returns back to the previous probe handling code.

If preemption occurs when KProbes is executing it can context switch to another process while a probe is being handled. The other process could cause another probe to fire which will cause control to reach kprobe_handler() again while the previous probe was not handled completely. This may result in disarming the new probe when KProbes discovers it's recursing. To avoid this problem, preemption is disabled when probes are handled.

Similarly, interrupts are disabled by causing the breakpoint handler and the debug handler to be invoked through interrupt gates rather than trap gates. This disables interrupts as soon as control is transferred to the breakpoint or debug handler. These changes are made in the file arch/i386/kernel/traps.c.

A fault might occur during the handling of a probe. In this case, if the user has defined a fault handler for the probe, control is transferred to the fault handler. If the user-defined fault handler returns 0 the fault is handled by the kernel. Otherwise, it's assumed that the fault was handled by the fault handler and control reaches back to the probe handlers.


KProbes is an excellent tool for debugging and tracing; it can also be used for performance measuring. Developers can use it to trace the path of their programs inside the kernel for debugging purposes. System administrators can use it to trace events inside the kernel on production systems. KProbes can also be used for non-critical performance measurements. The current KProbes implementation, however, introduces some latency of its own in handling probes. The cause behind this latency is the single kprobe_lock which serializes the execution of probes across all CPUs on a SMP machine. Another reason is the mechanism used by KProbes which uses multiple exceptions to handle a single probe. Exception handling is an expensive operation which causes its own delays. Work needs to be done in this area to improve SMP scalability and improving the probe handling time to make KProbes a viable performance measuring tool.

KProbes however cannot be used directly for these purposes. In the raw form a user can write a kernel module implementing the probe handlers. However higher level tools are necessary for making it more convenient to use. Such tools could contain standard probe handlers implementing the desired features or they could contain a means to produce probe-handlers given simple descriptions of them in a scripting language like DProbes.

Related Links

An introductory article on KProbes with some examples on how to use it.
The scriptable tracing tool for Linux which works on top of KProbes.
Network Packet Tracing Patch
This patch is used to trace the path of network packets traveling through the kernel stack using DProbes.
KProbes debugfs patch
This patch lists all probes applied at any addresses through debugfs
SysRq key for KProbes Patch
This patch enables the use of SysRq key to be used for listing all applied probes.
The Linux Kernel Tracing Tool - in the works.


The author will like to thank his editor Jonathan Corbet, Kalyan T.B. (HP), Siddharth Seth (IIITB) and Bharata B. Rao (HP) for going through this article and giving their feedback, comments, suggestions etc. and helping to improve this article.

Comments (8 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds